A statistical approach to coronavirus classification based on nucleotide distributions

2024;
: pp. 987–994
https://doi.org/10.23939/mmc2024.04.987
Received: May 24, 2024
Accepted: October 18, 2024

Husiev M., Rovenchak A.  A statistical approach to coronavirus classification based on nucleotide distributions.  Mathematical Modeling and Computing. Vol. 11, No. 4, pp. 987–994 (2024)

1
Professor Ivan Vakarchuk Department for Theoretical Physics, Ivan Franko National University of Lviv
2
Professor Ivan Vakarchuk Department for Theoretical Physics, Ivan Franko National University of Lviv; SoftServe, Inc.

The objective of this study is to analyze specific genomes, namely the RNA of coronaviruses, based on the parameters obtained from the distributions of nucleotide sequences in their RNA.  The viral RNA was subjected to distribution based on nucleotide sequences obtained by changing one nucleotide base (adenine) into a "whitespace", with empty sequences denoted as "x".  Statistical spectra were constructed in such cases.  They exhibited three distinct peaks that were consistent across the studied species.  Parameters based on the rank–frequency distributions of the obtained nucleotide sequences, sequence lengths, and some other statistical parameters were calculated.  Based on these parameters, the principal components were built, which were the basis for the grouping of the studied viruses.  The most relevant parameters formed the model of a naїve Bayes classifier, which analyzes the probability of the virus belonging to a certain group of viruses in the model.

  1. Artime O., De Domenico M.  From the origin of life to pandemics: emergent phenomena in complex systems.  Philosophical Transactions of the Royal Society A: Mathematical Physical and Engineering Sciences.  380 (2227), 20200410 (2022).
  2. Canfora G., Mercaldo F., Santone A.  A novel classification technique based on formal methods.  ACM Transactions on Knowledge Discovery from Data.  17 (8), 1–30 (2023).
  3. Raman R., Gupta N., Jeppu Y.  Framework for formal verification of machine learning based complex system-of-systems.  Insight.  26 (1), 91–102 (2023).
  4. Holovatch Y., Kenna R., Thurner S.  Complex systems: physics beyond physics.  European Journal of Physics.  38 (2), 023002 (2017).
  5. Newman M.  Networks.  Oxford University Press; 2nd edition (2018).
  6. Tabish M., Azim S., Hussain M. A., Rehman S. U., Sarwar T., Ishqi H. M.  Bioinformatics approaches in studying microbial diversity.  In: Malik A., Grohmann E., Alves M. (eds.) Management of Microbial Resources in the Environment, pp. 119–140. Springer, Dordrecht (2013).
  7. Borkin L. J., Litvinchuk S. N., Rosanov Yu. M., Skorinov D. V.  On cryptic species (an example of amphibians).  Entomological Review.  84 (Suppl 1), S75–S98 (2004).
  8. Husev M., Rovenchak A.  On the verge of life: Distribution of nucleotide sequences in viral RNAs.  Biosemiotics.  14 (2), 253–269 (2021).
  9. Husev M., Rovenchak A.  Parametrization of rank-frequency distributions of nucleotide sequences in virus RNAs.  Visnyk Lviv Univ. Ser. Phys.  58, 72–84 (2021).
  10. Looi M.-K.  Covid-19: Scientists sound alarm over new BA.2.86 "Pirola" variant.  BMJ.  2023, p1964 (2023).
  11. Meo S. A., Meo A. S., Klonoff D. C.  Omicron new variant BA.2.86 (Pirola): Epidemiological, biological, and clinical characteristics – a global data-based analysis.  European Review for Medical and Pharmacological Sciences.  27 (19), 9470–9476 (2023).
  12. Hemo M. K., Islam M. A.  JN.1 as a new variant of COVID-19 – editorial.  Annals of Medicine & Surgery.  86 (4), 1833–1835 (2024).
  13. Abou-Nouh H., El Khomsi M.  Viable control of COVID-19 spread with vaccination.  Mathematical Modeling and Computing.  11 (1), 203–210 (2024).
  14. Chen Yuzhou, Gel Y. R., Marathe M. V., Poor H. V.  A simplicial epidemic model for COVID-19 spread analysis.  Proceedings of the National Academy of Sciences.  121 (1), e2313171120 (2024).
  15. Rovenchak A.  Telling apart \textsl{Felidae} and \textsl{Ursidae} from the distribution of nucleotides in mitochondrial DNA.  Modern Physics Letters B.  32 (05), 1850057 (2018).
  16. Shannon C. E.  A mathematical theory of communication.  The Bell System Technical Journal.  27 (3), 379–423 (1948).
  17. Kelih E., Anti\'c G., Grzybek P., Stadlober E.  Classification of author and/or genre? The impact of word length.  In: Weihs C., Gaul W. (eds.), Classification – the Ubiquitous Challenge, pp. 498–505. Springer-Verlag, Berlin–Heidelberg (2005).
  18. Zörnig P., Kelih E., Fuks L.  Classification of Serbian texts based on lexical characteristics and multivariate statistical analysis.  Glottotheory.  7 (1), 41–66 (2016).
  19. Rovenchak A., Rovenchak O.  Quantifying comprehensibility of Christmas and Easter addresses from the Ukrainian Greek Catholic Church hierarchs.  Glottometrics.  41, 57–66 (2018).
  20. Rovenchak A.  Approaches to the classification of complex systems: Words, texts, and more.  In: Holovatch Yu. (ed.), Order, Disorder and Criticality, vol. 7, pp. 209–246. World Scientific (2023).
  21. Chua K. C., Chandran V., Acharya U. R., Lim C. M.  Application of higher order statistics/spectra in biomedical signals–A review.  Medical Engineering & Physics.  32 (7), 679–689 (2010).
  22. Bland M., Altman D.  Statistics notes: Measurement error.  BMJ.  312 (7047), 1654 (1996).
  23. Tipping M. E., Bishop C. M.  Probabilistic principal component analysis.  Journal of the Royal Statistical Society Series B: Statistical Methodology.  61 (3), 611–622 (1999).
  24. Jolliffe I. T., Cadima J.  Principal component analysis: a review and recent developments.  Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.  374 (2065), 20150202 (2016).
  25. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E.  Scikit-learn: Machine learning in Python.  Journal of Machine Learning Research.  12, 2825–2830 (2011).
  26. Principal component analysis (PCA).  https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html.