SED-UA-Small: Ukrainian Synthetic Dataset for Text Embedding Models

Oleksandr Mediakov; Dmytro Martjanov; Vasyl Lytvyn

This paper presents Small Synthetic Embedding Dataset, a fully synthetic dataset in Ukrainian designed for training, fine-tuning, and evaluating text embedding models. The use of large language models (LLMs) allows for controlling the diversity of generated data in aspects such as NLP tasks, asymmetry between queries and documents, the presence of instructions, support for various languages, and avoidance of social biases. A zero-shot generation approach was used to create a set of Ukrainian query-documents pairs with corresponding similarity scores. The dataset can be used to evaluate the quality of multilingual embedding models, as well as to train or fine-tune models to improve their effectiveness when working with Ukrainian texts. The paper covers a comprehensive description of the dataset construction process, including the parameters influencing the diversity of generated texts, the large language models used for actual generation of the data, and an example of using the dataset to evaluate and compare selected multilingual embedding models on the task of semantic text similarity. Unlike existing Ukrainian datasets, which are mainly based on real texts, SED-UA-small is fully synthetic, providing greater flexibility in controlling the diversity and specificity of data for the needs of training and evaluating embedding models, and allowing for feast and cost-effective expansion of the dataset with high-quality entries if needed. We used a combination of open and proprietary large language models of different sizes to generate the first version of the dataset, consisting of 112 thousand text pairs, divided into training (~50%), testing (25%), and validation (25%) sets. The data is publicly available at https://huggingface.co/datasets/suntez13/sed-ua-small-sts-v1.

text embedding

natural language processing

Large Language Models

Ukrainian text processing

Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., … Kurzweil, R. (2018). Universal sentence encoder. https://doi.org/10.48550/arXiv.1803.11175
Chaplynskyi, D. (2023). Introducing UberText 2.0: A corpus of modern Ukrainian at scale. 1–10. Association for Computational Linguistics. Retrieved from https://aclanthology.org/2023.unlp-1.1
Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). BGE m3-embedding: Multi-lingual, multi- functionality, multi-granularity text embeddings through self-knowledge distillation. https://doi.org/10.48550/arXiv.2402.03216
Dementieva, D., Khylenko, V., & Groh, G. (2025). Cross-lingual text classification transfer: The case of ukrainian (O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, & S. Schockaert, Eds.). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2025.coling-main.97/
Enevoldsen, K., Chung, I., Imene Kerboua, Kardos, M., Mathur, A., Stap, D., … Ömer Çağatan. (2025). MMTEB: Massive multilingual text embedding benchmark. https://doi.org/10.48550/arXiv.2502.13595
Feng, F., Yang, Y., Cer, D., Naveen Arivazhagan, & Wang, W. (2022). Language-agnostic BERT sentence embedding. https://doi.org/10.48550/arXiv.2007.01852
Granite Embedding Team, IBM. (2024). Granite embedding models. Retrieved from https://github.com/ibm- granite/granite-embedding-models/
Grattafiori, A., Dubey, A., Abhinav Jauhri, Pandey, A., Abhishek Kadian, Al-Dahle, A., … Rao, A. (2024). The llama 3 herd of models. https://doi.org/10.48550/arXiv.2407.21783
Lee, J., Dai, Z., Ren, X., Chen, B., Cer, D., Cole, J. R., … Naim, I. (2024). Gecko: Versatile text embeddings distilled from large language models. https://doi.org/10.48550/arXiv.2403.20327
Niklas Muennighoff, Tazi, N., Magne, L., & Reimers, N. (2023). MTEB: Massive text embedding benchmark. https://doi.org/10.48550/arXiv.2210.07316
Reimers, N., & Gurevych, I. (2019, November). Sentence-bert: Sentence embeddings using siamese BERT- Networks. Association for Computational Linguistics. https://doi.org/10.48550/arXiv.1908.10084
Sundar Pichai, Hassabis, D., & Kavukcuoglu, K. (2024, December 11). Introducing Gemini 2.0: our new AI model for the agentic era. Retrieved from Google website: https://blog.google/technology/google-deepmind/google- gemini-ai-update-december-2024/
Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Surya Bhupatiraju, … Tsitsulin, A. (2024). Gemma 2: Improving open language models at a practical size. https://doi.org/10.48550/arXiv.2408.00118
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., & Wei, F. (2024a). Improving text embeddings with large language models. https://doi.org/10.48550/arXiv.2401.00368
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., & Wei, F. (2024b). Multilingual E5 text embeddings: A technical report. https://doi.org/10.48550/arXiv.2402.05672
Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., & Yu, Y. (2018). Texygen: A benchmarking platform for text generation models. Association for Computing Machinery. https://doi.org/10.1145/3209978.3210080