SED-UA-Small: Ukrainian Synthetic Dataset for Text Embedding Models
This paper presents Small Synthetic Embedding Dataset, a fully synthetic dataset in Ukrainian designed for training, fine-tuning, and evaluating text embedding models. The use of large language models (LLMs) allows for controlling the diversity of generated data in aspects such as NLP tasks, asymmetry between queries and documents, the presence of instructions, support for various languages, and avoidance of social biases. A zero-shot generation approach was used to create a set of Ukrainian query-documents pairs with corresponding similarity scores.