A.I. Kanev1, K.S. Myshenkov2, S.W. Panjukova3, Jiawen Xie4
1–4 Bauman Moscow State Technical University (Moscow, Russia)
1 aikanev@bmstu.ru, 2 myshenkovks@bmstu.ru, 3 Panyukova@bmstu.ru, 4 xz10052001@gmail.com
With the increasing intercultural exchange, literature is being translated into many languages, making the comparative study of different versions of literary works highly relevant. One such work is the novel Oliver Twist, which exists in various versions in English, Russian, and Chinese. A key feature of the Chinese language is the lack of spaces between individual words, so there is an additional task of word segmentation.
The aim of the study is to conduct a statistical analysis of word frequency in three versions of the novel using Shannon’s and Hartley’s formulas. This will allow us to study the distribution of words and the information complexity of the text in different languages, and also separately compare two versions for the Chinese language: with and without word segmentation.
The study revealed significant differences in the lexical composition and word frequency across the English, Russian, and Chinese versions of Oliver Twist. The English version demonstrated stability in word frequency, while the Russian version showed greater variability in lexical usage. The Chinese version, in turn, was characterized by a rich lexical selection: when the full vocabulary was used, the entropy of information was high, while the entropy was significantly lower when using individual characters. The results of word segmentation in Chinese were closer to English and Russian than in the variant without segmentation.
The results of the study are important for linguistics and translation theory, as well as for cultural studies and the development of translation software. Statistical analysis provides a better understanding of how different languages convey information and can aid in the creation of more effective translation and text adaptation methods.
Kanev A.I., Myshenkov K.S., Panjukova S.W., Jiawen Xie. Distribution of words in different language texts. Dynamics of complex systems. 2025. V. 19. № 5. P. 20−28. DOI: 10.18127/ j19997493-202505-03 (in Russian).
- Dickens Ch. Oliver Twist. Hertfordshire: Wordsworth Editions. 2000.
- Lupyan G., Dale R. Why Are There Different Languages? The Role of Adaptation in Linguistic Diversity, Trends in Cognitive Sciences. 2016. V. 20. № 9. P. 649–660. DOI 10.1016/j.tics.2016.07.005
- Shaw R.D. The Translation Context: Cultural Factors in Translation: Translation Review. 1987. № 23(1). P. 25–29. DOI 10.1080/07374836.1987.10523398
- Hông Phuong L., Thi Minh Huyên N., Roussanaly A. et al. A hybrid approach to word segmentation of Vietnamese texts. Language and Automata Theory and Applications: Second International Conference, LATA 2008, Spain, Tarragona, 2008, March 13–19, Revised Papers 2. Berlin, Heidelberg: Springer, 2008. P. 240–249.
- Meknavin S., Charoenpornsawat P., Kijsirikul B. Feature-based Thai word segmentation. Proceedings of Natural Language Processing Pacific Rim Symposium. 1997. V. 97. P. 41–46.
- Liu Qun. A review of Chinese lexical analysis and syntactic analysis techniques. Lectures of the 1st Student Symposium on Computational Linguistics (sWcL2002). Beijing, 2002.
- Chang P.C., Galley M., Manning C.D. Optimizing Chinese word segmentation for machine translation performance. Proceedings of the third workshop on statistical machine translation. 2008. P. 224–232.
- Xie C., Hu Z., Yang L. et al. Automatic Construction of Sentence Pattern Structure Treebank. Proceedings of the 21st Chinese National Conference on Computational Linguistics. 2022. P. 464–474.
- Qader W.A., Ameen M.M., Ahmed B.I. An Overview of Bag of Words; Importance, Implementation, Applications, and Challenges. 2019 International Engineering Conference (IEC), Iraq, Erbil. 2019. P. 200–204. DOI 10.1109/IEC47844.2019.8950616
- Salton G., Wong A., Yang C.S. A vector space model for automatic indexing. Commun. ACM. 18.11.1975. P. 613–620. DOI 10.1145/361219.361220
- Dumais S.T. Latent Semantic Analysis. Annual Review of Information Science and Technology (ARIST). V. 38. P. 189–230.
- Qaiser S., Ali R. Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents. International Journal of Computer Applications. 2018. V. 181(1). P. 25–29. DOI 10.5120/ijca2018917395
- Ma L., Zhang Y. Using Word2Vec to process big text data. 2015 IEEE International Conference on Big Data (Big Data). USA, CA, Santa Clara, 2015, P. 2895–2897. DOI 10.1109/BigData.2015.7364114
- Danyal M.M., Khan S.S., Khan M. et al. Proposing sentiment analysis model based on BERT and XLNet for movie reviews. Multimedia tools and applications. 2024. V. 83(24). P. 64315–64339. DOI 10.1007/s11042-024-18156-5
- Black S., Biderman S., Hallahan E., Anthony Q. et al. GPT-NeoX-20B: An open-source autoregressive language model. 14.04.2022. arXiv:2204.06745. P. 1–42.
- Aditi, Shandilya S., Bansal N., Mala S. An Evaluation of Word Frequency Techniques for Text Summarization Using Sentiment Analysis Approach. 2020 10th International Conference on Cloud Computing, Data Science & Engineering (Confluence). India, Noida. 2020.
P. 397–403. DOI 10.1109/Confluence47617.2020.9058139 - Grefenstette G. Tokenization. Text, Speech and Language Technology. 1999. V. 9. Syntactic Wordclass Tagging / H. van Halteren (ed.). Dordrecht: Springer. 1999. P. 117–133. DOI 10.1007/978-94-015-9273-4_9
- Steven B., Ewan K., Edward L. Natural Language Processing with Python. California: O'Reilly Media. 2009. 504 p.
- Hartley R.V.L. Transmission of Information. Bell System Technical Journal. 1928. V. 7(3). P. 535–563.
- Shannon C.E. A mathematical theory of communication. Bell System Technical Journal. 1948. V. 27(3). P. 379–423.
- Dickens Ch. Oliver Twist. Nanjing: Yilin Publishing House. 2010.
- Dikkens Ch. Priklyucheniya Olivera Tvista. M.: Khudozhestvennaya literature. 1976. 288 s.

