natural language processing

SED-UA-Small: Ukrainian Synthetic Dataset for Text Embedding Models

This paper presents Small Synthetic Embedding Dataset, a fully synthetic dataset in Ukrainian designed for training, fine-tuning, and evaluating text embedding models. The use of large language models (LLMs) allows for controlling the diversity of generated data in aspects such as NLP tasks, asymmetry between queries and documents, the presence of instructions, support for various languages, and avoidance of social biases. A zero-shot generation approach was used to create a set of Ukrainian query-documents pairs with corresponding similarity scores.

Development of an Automated Natural Language Text Analysis System Using Transformers

The article is dedicated to the study of the development of an automated medical text analysis system using modern artificial intelligence technologies and natural language processing. The current state and prospects for the development of automated medical text analysis are analyzed. The main methods and technologies used in this field, including machine learning, deep learning, and natural language processing, are examined.

Development of a Unified Output Format for Text Parsers in the Ontology Construction System From Text Documents

The challenge of effectively constructing ontologies from text documents remains unresolved, posing a critical gap in modern knowledge extraction methodologies. One of the primary obstacles is the lack of a standardized output format across different NLP tools, particularly text parsers, which serve as the foundational step in multi-stage knowledge extraction processes. While several widely used text parsers exist, each excels in specific functions, making it beneficial to leverage multiple parsers for more comprehensive ontology construction.

Comparison and Clustering of Textual Information Sources Based on the Cosine Similarity Algorithm

This article presents a study aimed at developing an optimal concept for analyzing and comparing information sources based on large amounts of text information using natural language processing (NLP) methods. The object of the study was Telegram news channels, which are used as sources of text data. Pre-processing of texts was carried out, including cleaning, tokenization and lemmatization, to form a global dictionary consisting of unique words from all information sources.

DECISION SUPPORT SYSTEM FOR DISINFORMATION, FAKES AND PROPAGANDA DETECTION BASED ON MACHINE LEARNING

Due to the simplification of the processes of creating and distributing news via the Internet, as well as due to the physical impossibility of checking large volumes of information circulating in the network, the volume of disinformation and fake news distribution has increased significantly. A decision support system for identifying disinformation, fakes and propaganda based on machine learning has been built. The method of news text analysis for identifying fakes and predicting the detection of disinformation in news texts has been studied.

Intelligent Fake News Prediction System Based on NLP and Machine Learning Technologies

The article describes a study of identification of fake news based on natural language processing, big data analysis and deep learning technology. The developed system automatically checks the news for signs of fake news, such as the use of manipulative language, unverified sources and unreliable information. Data visualization is implemented on the basis of a friendly user interface that displays the results of news analysis in a convenient and understandable format.

Intelligent System for Complex Military Information Analysis Based on Machine Learning and NLP to Assist Tactical Links Commanders

 The article describes the results of research into the processes of complex analysis of military information based on machine learning and natural language processing to help commanders of tactical units. The system should allow users to have the following capabilities: combining the dictionary and information material, adding terms and abbreviations to the dictionary, classifying objects for radio technical intelligence, visualizing aerial objects, classifying aerial objects, using information materials, organizing information materials.

An Arabic question generation system based on a shared BERT-base encoder-decoder architecture

A Question Generation System (QGS) is a sophisticated piece of AI technology designed to automatically generate questions from a given text, document, or context.  Recently, this technology has gained significant attention in various fields, including education, and content creation.  As AI continues to evolve, these systems are likely to become even more advanced and viewed as an inherent part of any modern e-learning or knowledge assessment system.  In this research paper, we showcase the effectiveness of leveraging pre-trained checkpoints for Arabic questions generat