Information system for extraction of information from open web resources

Petro Zdebskyi; Andrii Berko; Lyubomyr Chyrun

The purpose of the work is to develop a project of an information and reference system for finding answers to questions based on the highest degree of comparison using text content from open English- language web resources. Examples of such questions can be: “What is the best book ever?”, “What is the most popular IDE for Python”. The result of the functioning of the information and reference system is a ranked list of answers based on the frequency of appearance of each of the answer options. Also, a numerical characteristic of the probability of the preference of a particular answer over others is added to each element of the list. Based on this metric, the obtained results are ranked. This information and reference system works with questions to which there is no unequivocal answer, what differs it from classic information systems for finding answers to questions of the QA-system type. The latter have a hypothesis that there is only one true answer to the question, often such systems work with well-known facts. Examples of questions they answer can be, for example, the date of birth of a famous person, or the population of a certain country. Instead, the proposed information and reference system answers subjective questions, for example, “What is the best book in the fantasy genre?” or “What is the best programming language?”. The system is based on the popularity of one or another answer. Proper names based on the analysis of N-grams are also keywords for forming the answer to the question.

similarity of text fragments

Part-of-speech tagging

N-gram

TF-IDF

TextRank

Aksonov D., Gozhyj A., Kalinina I., Vysotska V. (2021). Question-Answering Systems Development Based on Big Data Analysis. Computer Sciences and Information Technologies (CSIT): proceedings of the IEEE 16th International Conference, 22–25 Sept., Lviv, Ukraine, 113–118. DOI: 10.1109/CSIT52700.2021.9648631.
Breja M., Jain S. (2020). Causality for Question Answering. CEUR workshop proceedings, Vol. 2604, 884–893.
Kubinska S., Holoshchuk R., Holoshchuk S., Chyrun L. (2022). Ukrainian Language Chatbot for Sentiment Analysis and User Interests Recognition based on Data Mining. CEUR Workshop Proceedings, Vol. 3171, 315–327.
Husak V., Lozynska O., Karpov I., Peleshchak I., Chyrun S., Vysotskyi A. (2020). Information System for Recommendation List Formation of Clothes Style Image Selection According to User’s Needs Based on NLP and Chatbots. CEUR workshop proceedings, Vol. 2604, 788–818.
Romanovskyi O., Pidbutska N., Knysh A. (2021). Elomia Chatbot: The Effectiveness of Artificial Intelligence in the Fight for Mental Health. CEUR Workshop Proceedings, Vol. 2870, 1215–1224.
Yarovyi A., Kudriavtsev D. (2021). Method of Multi-Purpose Text Analysis Based on a Combination of Knowledge Bases for Intelligent Chatbot. CEUR Workshop Proceedings, Vol. 2870, 1238–1248.
Zdebskyi P., Lytvyn V., Burov Y., Rybchak Z., Kravets P., Lozynska O., Holoshchuk R., Kubinska S., Dmytriv A. (2020). Intelligent System for Semantically Similar Sentences Identification and Generation Based on Machine Learning Methods. CEUR workshop proceedings, Vol. 2604, 317–346.
Lytvyn V., Burov Y., Kravets P., Vysotska V., Demchuk A., Berko A., Ryshkovets Y., Shcherbak S., Naum O. (2019). Methods and Models of Intellectual Processing of Texts for Building Ontologies of Software for Medical Terms Identification in Content Classification. CEUR Workshop Proceedings, Vol. 2362, 354–368.
Vysotska V., Berko A., Lytvyn V., Kravets P., Dzyubyk L., Bardachov Y., Vyshemyrska S. (2020). Information Resource Management Technology Based on Fuzzy Logic. Advances in Intelligent Systems and Computing, Vol. 1246, 164–182. DOI: 10.1007/978-3-030-54215-3_11.
Berko A., Matseliukh Y., Ivaniv Y., Chyrun L., Schuchmann V. (2021). The text classification based on Big Data analysis for keyword definition using stemming. Computer science and information technologies: proceedings of IEEE 16th International conference on computer science and information technologies. Lviv, Ukraine, 22–25 September, 2021, 184–188. DOI: 10.1109/CSIT52700.2021.9648764.
Hladun O., Berko A., Bublyk M., Chyrun L., Schuchmann V. (2021). Intelligent system for film script formation based on artbook text and Big Data analysis. Computer science and information technologies: proceedings of IEEE 16th International conference on computer science and information technologies. Lviv, Ukraine, 22–25 September, 2021, 138–146. DOI: 10.1109/CSIT52700.2021.9648682.
Dyriv A., Andrunyk V., Burov Y., Karpov I., Chyrun L. (2021). The user’s psychological state identification based on Big Data analysis for person’s electronic diary. Computer science and information technologies: proceedings of IEEE 16th International conference on computer science and information technologies. Lviv, Ukraine, 22–25 September, 2021, 101–112. DOI: 10.1109/CSIT52700.2021.9648810.
Burov Y., Horodetska A., Bublyk M., Nashkerska M., Vysotska V. (2021). Tourist Service with the Situation Context Processing. International Conference on New Trends in Languages, Literature and Social Communications (ICNTLLSC 2021), 2021/5/27, 233–243. DOI: 10.2991/assehr.k.210525.028.
Lytvyn V., Vysotska V., Peleshchak I., Basyuk T., Kovalchuk V., Kubinska S., Chyrun L., Rusyn B., Pohreliuk L., Salo T. (2019). Identifying Textual Content Based on Thematic Analysis of Similar Texts in Big Data. Proceedings of the International Conference on Computer Sciences and Information Technologies, CSIT, 84–91. DOI: 10.1109/STC-CSIT.2019.8929808.
Vysotska V., Lytvyn V., Kovalchuk V., Kubinska S., Dilai M., Rusyn B., Pohreliuk L., Chyrun L., Chyrun S., Brodyak O. (2019). Method of Similar Textual Content Selection Based on Thematic Information Retrieval. Proceedings of the International Conference on Computer Sciences and Information Technologies, CSIT, 2019, 1–6. DOI: 10.1109/STC-CSIT.2019.8929752.
Savytska L., M. Sübay T., Vnukova N., Bezugla I., Pyvovarov V. (2022). Word2Vec Model Analysis for Semantic and Morphologic Similarities in Turkish Words. CEUR Workshop Proceedings, Vol. 3171, 161–176.
Savytska L., Vnukova N., Bezugla I., Pyvovarov V., Turgut Sübay M. (2021). Using Word2vec Technique to Determine Semantic and Morphologic Similarity in Embedded Words of the Ukrainian Language. CEUR Workshop Proceedings, Vol. 2870, 235–248.
Lytvyn V. The similarity metric of scientific papers summaries on the basis of adaptive ontologies (2011). Proceedings of 7th International Conference on Perspective Technologies and Methods in MEMS Design, 162.
Dupuch M., Trinquar, L., Colombet I., Jaulent M.-C., Grabar N. (2010). Exploitation of semantic similarity for adaptation of existing terminologies within biomedical area. CEUR Workshop Proceedings, 673.
Cardon R., Grabar N. (2020). A French corpus for semantic similarity. LREC 2020 – 12th International Conference on Language Resources and Evaluation, Conference Proceedings, 6889–6894.
Elalfy D., Gad W., Ismail R. (2018). A hybrid model to predict best answers in question answering communities. Egyptian Informatics Journal, Vol. 19(1), 21–31. DOI: 10.1016/j.eij.2017.06.002.
Sahu T. P., Nagwani N. K., Verma S. (2016). Selecting Best Answer: An Empirical Analysis on Community Question Answering Sites. IEEE Access, Vol. 4, 4797-4808. DOI: 10.1109/ACCESS.2016.2600622.
Question And Answer Demo Using BERT. URL: https://www.pragnakalp.com/demos/BERT-NLP-QnA-Demo.
Lytvyn V., Vysotska V., Rzheuskyi A. (2019). Technology for the Psychological Portraits Formation of Social Networks Users for the IT Specialists Recruitment Based on Big Five, NLP and Big Data Analysis. CEUR Workshop Proceedings, Vol. 2392, 147–171.
Shu C., Dosyn D., Lytvyn V., Vysotska V., Sachenko A., Jun S. (2019). Building of the Predicate Recognition System for the NLP Ontology Learning Module. Proceedings of the International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS, 2, 802–808. DOI: 10.1109/IDAACS.2019.8924410.
Oliinyk V.-A., Vysotska V., Burov Y., Mykich K., Basto-Fernandes V. (2020). Propaganda Detection in Text Data Based on NLP and Machine Learning. CEUR workshop proceedings, Vol. 2631, 132–144.
Balush I., Vysotska V., Albota S. (2021). Recommendation System Development Based on Intelligent Search, NLP and Machine Learning Methods. CEUR Workshop Proceedings, Vol. 2917, 584–617.
Batiuk T., Vysotska V., Holoshchuk R., Holoshchuk S. (2022). Intelligent System for Socialization of Individual’s with Shared Interests based on NLP, Machine Learning and SEO Technologies. CEUR Workshop Proceedings, Vol. 3171, 572–631.
Deriviere J., Hamon T., Nazarenko A. (2006). A scalable and distributed NLP architecture for web document annotation. Lecture Notes in Computer Science, Vol. 4139, 56–67. DOI: 10.1007/11816508_8.
Boyè M., Tran T.M., Grabar N. (2014). NLP-oriented contrastive study of linguistic productions of alzheimer’s and control people. Lecture Notes in Computer Science, Vol. 8686, 412–424. DOI: 10.1007/978-3-319- 10888-9_41.
Lytvyn V., Vysotska V., Budz I., Pelekh Y., Sokulska N., Kovalchuk R., Dzyubyk L., Tereshchuk O., Komar M. (2019). Development of the quantitative method for automated text content authorship attribution based on the statistical analysis of N-grams distribution. Eastern-European Journal of Enterprise Technologies, Vol. 6(2-102), 28–51. DOI: 10.15587/1729-4061.2019.186834.
Vysotska V., Markiv O., Teslia S., Romanova Y., Pihulechko I. (2022). Correlation Analysis of Text Author Identification Results Based on N-Grams Frequency Distribution in Ukrainian Scientific and Technical Articles. CEUR Workshop Proceedings, Vol. 3171, 277–314.
Boyer C., Dolamic L., Grabar N. (2015). Automated Detection of Health Websites’ HONcode Conformity: Can N-gram Tokenization Replace Stemming? Studies in Health Technology and Informatics, Vol. 216, 1064.
Lytvyn V., Burov Y., Vysotska V., Pukach Y., Tereshchuk O., Shakleina I. (2021). Abstracting Text Content Based on Weighing the TF-IDF Measure by the Subject Area Ontology. International Conference on Smart Information Systems and Technologies (SIST), Nur-Sultan, Kazakhstan. DOI: 10.1109/SIST50301.2021.9465978.
Das M., Kamalanathan S., Alphonse P.J.A. (2021). A Comparative Study on TF-IDF Feature Weighting Method and Its Analysis Using Unstructured Dataset. CEUR Workshop Proceedings, Vol. 2870, 98–107.
Lande D., Dmytrenko O. (2021). Using Part-of-Speech Tagging for Building Networks of Terms in Legal Sphere. CEUR Workshop Proceedings, Vol. 2870, 87–97.
Hrytsiv N., Bekhta I., Tkachivska M., Byalyk V. (2022). Sylvia Plath’s I felt-Narrative Label of The Bell Jar in Ukrainian Translation: Tagging Textness Features. CEUR Workshop Proceedings, Vol. 3171, 240–255.
Mukalov P., Zelinskyi O., Levkovych R., Tarnavskyi P., Pylyp A., Shakhovska N. (2019). Development of System for Auto-Tagging Articles, Based on Neural Network. CEUR Workshop Proceedings, Vol. 2362, 106–115.
Shakhovska N., Basystiuk O., Shakhovska K. (2019). Development of the Speech-to-Text Chatbot Interface Based on Google API. CEUR Workshop Proceedings, Vol. 2386, 212–221.
Hlavcheva Y., Kanishcheva O., Vovk М., Glavchev M. (2021). Identification of the Author’s Idea Based on the Modified TextRank Method. CEUR Workshop Proceedings, Vol. 2870, 118–128.
Lytvyn V., Vysotska V., Dosyn D., Burov Y. (2018). Method for ontology content and structure optimization, provided by a weighted conceptual graph. Webology, Vol. 15(2), 66–85.
Batiuk T., Chyrun L., Oborska O. (2022). Ontology Model and Ontological Graph for Development of Decision Support System of Personal Socialization by Common Relevant Interests. CEUR Workshop Proceedings, Vol. 3171, 877–903.
Petrenjuk V., Petrenjuk D. (2022). Application Trend through Planar 3-minimal & Projective Planar 2- minimal Graphs. CEUR Workshop Proceedings, Vol. 3171, 1737–1747.
Petrenjuk V. (2020). About φ-Transformation Graphs as a Tool for Investigations. CEUR workshop proceedings, Vol. 2604, 1309–1319.
Lytvyn V., Uhryn D., Fityo A. (2016). Modeling of territorial community formation as a graph partitioning problem. Eastern-European Journal of Enterprise Technologies, Vol. 1(4), 47–52. DOI: 10.15587/1729- 4061.2016.60848.
Meleshko Y., Yakymenko M., Semenov S. (2021). A Method of Detecting Bot Networks Based on Graph Clustering in the Recommendation System of Social Network. CEUR Workshop Proceedings, Vol. 2870, 1249–1261.
Learning Semantic Textual Similarity from Conversations (2022). URL: https://uk.wikipedia.org/wiki/.
TensorFlow. Universal Sentence Encoder (2022). URL: https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder
Huilgol P. (2022). Top 4 Sentence Embedding Techniques using Python! URL: https://www.analyticsvidhya.com/blog/2020/08/top-4-sentence-embedding-techniques-using-python/
Neubig G. (2022). Pre-trained Sentence and Contextualized Word Representations. URL: http://www.phontron.com/class/nn4nlp2021/assets/slides/nn4nlp-09-sentrep.pdf
Add Quora Question Triplets Dataset (2022). URL: https://github.com/huggingface/datasets/issues/4654
The Multi-Genre NLI Corpus (2022). URL: https://cims.nyu.edu/~sbowman/multinli/