Method for Recognizing the Characteristic Elements of Protein Secondary Structure From the Llm of Its Amino Acid Sequence

2024;
: pp. 460 - 468
Authors:
1
Lviv Polytechnic National University, ISN, Lviv, Ukraine

The spatial structure of a protein determines its biochemical properties and, consequently, its function. The same applies to elements of secondary structure, which adopt shapes of helices, coiled coils, strands, sheets and other formations in three-dimensional space. Automatic detection of such formations based on their corresponding amino acid sequences in the protein will enable the cataloging of these sequence fragments, examining and systematizing their correspondence to spatial protein formations. This, in turn, should simplify the task of searching for complementary and functional similarities among different proteins. For this purpose, a method based on covariance, autocorrelation, and spatial-spectral analysis of embeddings of their amino acid sequences has been developed and tested.

  1. Wang, C., Fan, H., Quan, R., & Yang, Y. (2024). ProtChatGPT: Towards Understanding Proteins with Large Language Models. arXiv preprint arXiv:2402.09649.
  2. Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., ... & Fergus, R. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15), e2016239118.
  3. Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K., & Moult, J. (2023). Critical assessment of methods of protein structure prediction (CASP)—Round XV. Proteins: Structure, Function, and Bioinformatics, 91(12), 1539– 1549.
  4. Heinzinger, M., Elnaggar, A., Wang, Y. et al. Modeling aspects of the language of life through transfer- learning protein sequences. BMC Bioinformatics 20, 723 (2019). https://doi.org/10.1186/s12859-019-3220-8.
  5. Ali, S., & Patterson, M., “Spike2Vec: An efficient and scalable embedding approach for COVID-19 spike sequences”, IEEE International Conference on Big Data, 2021.
  6. H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, The Protein Data Bank (2000) Nucleic Acids Research 28: 235–242 https://doi.org/10.1093/nar/28.1.235.
  7. Ali,  S.,  Chourasia,  P.,  & Patterson,  M.  (2023).  When  Protein  Structure  Embedding  Meets  Large Language Models. Genes, 15(1), 25.
  8. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589.
  9. Di Tommaso, P., Moretti, S., Xenarios, I., Orobitg, M., Montanyola, A., Chang, J. M., ... & Notredame, C. (2011). T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension. Nucleic acids research, 39(suppl_2), W13-W17.
  10. Vehlow, C., Stehr, H., Winkelmann, M., Duarte, J. M., Petzold, L., Dinse, J., & Lappe, M. (2011). CMView: interactive contact map visualization and analysis. Bioinformatics, 27(11), 1573–1574.
  11. The PyMOL Molecular Graphics System, Version 2.5 Schrödinger, LLC.