STUDIES OF REPETITIVENESS FOR THE SIMPLEST RANDOM NATURAL LANGUAGE MODELS

O. S. Kushnir; V. O. Futey

The article addresses a currently important problem of natural language processing, the development of methods for assessing repetitiveness in textual documents and the empirical clarification of the resources of these methods for analyzing the presence of semantic load in texts. So far, the approaches based on the laws of statistical linguistics such as Zipf’s, Pareto’s and Heaps’ laws have been mainly used for this aim, as well as the analysis of word-clustering phenomena and long-term correlations of word tokens. We have developed software for quantitative analysis of the repetitiveness in texts, using the Ukkonen’s suffix-tree algorithm. We suggest analyzing the averaged repetitiveness parameter v0 and the corresponding standard deviation Δv. It is empirically proven that the algorithm implemented with our software exhibits an approximately linear time complexity, O (LlogL) (with L denoting the symbolic-series length), which corresponds to theoretical predictions. The texts analyzed for the repetitiveness have benn natural texts in English and a number of other alphabetic (non-hieroglyphic) languages, as well as random Miller’s monkey texts and natural texts randomized gradually at the linguistic levels of symbols, words and sentences. It is confirmed that, outside the initial section of natural texts, where the repetition function fluctuates, this function exhibits a saturation phenomenon, with a saturated value v0 ≈ 0.5. The texts based on the Miller’s random model are studied in detail. It is found that the behavior of the repetitiveness function and the parameters v0 and Δv of these texts are influenced primarily by a size of the alphabet and a preset distribution of relative frequencies of the alphabet symbols. It is proved that the average repetition v0 for these texts is related to the Shannon information entropy in its simplest representation. We have ascertained that the repetitiveness for the “bag of words” model correlates with the parameter of the average semantic load of a text. Finally, we suggest employing the repetition parameters v0 and Δv for recognizing semantically filled natural texts against semantically empty, stochastic symbolic time series.

natural language processing

natural language models

1. Aletti, G., & Crimaldi, I. (2021). Twitter as an innovation process with damping effect. Sci. Rep., 11, 21243 (15 p.). https://doi.org/10.1038/s41598-021-00378-4
2. Deng, W., Xie, R., Deng, S., & Allahverdyan, A. E. (2021). Two halves of a meaningful text are statistically different. J. Statistical Mechanics: Theory and Experiment, 3, 033413 (28 p.). https://doi.org/10.1088/1742-5468/abe947
3. Lai, U., Randhawa, G. S., & Sheridan, P. (2023). Heaps' law in GPT-Neo large language model emulated corpora. Proceedings of the Tenth International Workshop on Evaluating Information Access (EVIA 2023), a Satellite Workshop of the NTCIR-17 Conference, 20-23. https://doi.org/10.20736/0002001352
4. Kushnir, O., Drebot, A., Ostrikov, D., & Kravchuk, O. (2024). Vlastyvosti leksychnyh merezh, pobudovanyh na pryrodnyh i randomnyh tekstah [Properties of lexical networks built on natural and random texts]. Electronics and Information Technologies, 28, 22-37 (in Ukrainian). https://doi.org/10.30970/eli.28.3
5. Zhenhan Qi (2025). An analysis of Markov model's applications. Theoretical and Natural Science, 92, 82-87. https://doi.org/10.54254/2753-8818/2025.21613
6. Amancio, D. R., Altmann. E. G., Rybski. D., Oliveira Jr., O. N., & da F. Costa, L. (2013). Probing the statistical properties: application to the Voynich manuscript. PLoS ONE, 8 e67310 (10 p.). https://doi.org/10.1371/journal.pone.0067310
7. Kim Chol-jun (2025). Proper interpretation of Heaps' and Zipf's laws. arXiv:2305.15413v3, 1-18. https://doi.org/10.2139/ssrn.5891993
8. Golcher, F. (2007). A stable statistical constant specific for human language texts. 1-6. Retrieved from: https://www. academia.edu/5986557/A_Stable_Statistical_Constant_Specific_for_Human_Language_Texts
9. Kimura, D., & Tanaka-Ishii, K. (2014). Study on constants of natural language texts. J. Language Processing, 21, 877-895. https://doi.org/10.5715/jnlp.21.877
10. Smyth, W. F. (2014). Large-scale detection of repetitions. Phil. Trans. R. Soc. A, 372, 20130138 (11 p.). https://doi.org/10.1098/rsta.2013.0138
11. Tanaka-Ishii, K., & Aihara, S. (2015). Computational constancy measures of texts - Yule's K and Renyi's entropy. Computational Linguistics, 41, 481-502. https://doi.org/10.1162/COLI_a_00228
12. Fu, Z., Lam, W., So, A. M.-C., & Shi, B. (2021). A theoretical analysis of the repetition problem in text generation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(14), 12848-12856. https://doi.org/10.1609/aaai.v35i14.17520
13. Kushnir, O. S., Ivanitskyi, L. B, Kashuba, A. I., Mostova, M. R., & Mykhaylyk, V. B. (2021). Repetition characteristic for single texts. CEUR Workshop Proceedings, 2870, 629-641. https://ceur-ws.org/Vol-2870/paper47.pdf
14. Kushnir, O. S., Ivanitskyi, L. B., Kashuba, A. I., Mostova, M. R., & Mykhaylyk, V. B. (2021). Large-scale studies of the repetition characteristic for different models of symbolic sequences. Proceedings of 12th IEEE International Conference on Electronics and Information Technologies, 61-66. https://doi.org/10.1109/ELIT53502.2021.9501102
15. Salkar, N., Trikalinos, T., Wallace, B., & Nenkova, A. (2022). Self-repetition in abstractive neural summarizers. Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, 2, 341-350. https://doi.org/10.18653/v1/2022.aacl-short.42
16. Rabea, Z., El-Metwally, S., Elmougy, S., & Zakaria, M. (2022). A fast algorithm for constructing suffix arrays for DNA alphabets: a review. Journal of King Saud University - Computer and Information Sciences. 34(7), 4659-4668. https://doi.org/10.1016/j.jksuci.2022.04.015
17. Sepúlveda-Fontaine, S. A., & Amigó, J. M. (2024). Applications of entropy in data analysis and machine learning. Entropy, 26, 1126 (42 p.). https://doi.org/10.3390/e26121126