This paper presents Small Synthetic Embedding Dataset, a fully synthetic dataset in Ukrainian designed for training, fine-tuning, and evaluating text embedding models. The use of large language models (LLMs) allows for controlling the diversity of generated data in aspects such as NLP tasks, asymmetry between queries and documents, the presence of instructions, support for various languages, and avoidance of social biases. A zero-shot generation approach was used to create a set of Ukrainian query-documents pairs with corresponding similarity scores. The dataset can be used to evaluate the quality of multilingual embedding models, as well as to train or fine-tune models to improve their effectiveness when working with Ukrainian texts. The paper covers a comprehensive description of the dataset construction process, including the parameters influencing the diversity of generated texts, the large language models used for actual generation of the data, and an example of using the dataset to evaluate and compare selected multilingual embedding models on the task of semantic text similarity. Unlike existing Ukrainian datasets, which are mainly based on real texts, SED-UA-small is fully synthetic, providing greater flexibility in controlling the diversity and specificity of data for the needs of training and evaluating embedding models, and allowing for feast and cost-effective expansion of the dataset with high-quality entries if needed. We used a combination of open and proprietary large language models of different sizes to generate the first version of the dataset, consisting of 112 thousand text pairs, divided into training (~50%), testing (25%), and validation (25%) sets. The data is publicly available at https://huggingface.co/datasets/suntez13/sed-ua-small-sts-v1.
- Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., … Kurzweil, R. (2018). Universal sentence encoder. https://doi.org/10.48550/arXiv.1803.11175
- Chaplynskyi, D. (2023). Introducing UberText 2.0: A corpus of modern Ukrainian at scale. 1–10. Association for Computational Linguistics. Retrieved from https://aclanthology.org/2023.unlp-1.1
- Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). BGE m3-embedding: Multi-lingual, multi- functionality, multi-granularity text embeddings through self-knowledge distillation. https://doi.org/10.48550/arXiv.2402.03216
- Dementieva, D., Khylenko, V., & Groh, G. (2025). Cross-lingual text classification transfer: The case of ukrainian (O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, & S. Schockaert, Eds.). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2025.coling-main.97/
- Enevoldsen, K., Chung, I., Imene Kerboua, Kardos, M., Mathur, A., Stap, D., … Ömer Çağatan. (2025). MMTEB: Massive multilingual text embedding benchmark. https://doi.org/10.48550/arXiv.2502.13595
- Feng, F., Yang, Y., Cer, D., Naveen Arivazhagan, & Wang, W. (2022). Language-agnostic BERT sentence embedding. https://doi.org/10.48550/arXiv.2007.01852
- Granite Embedding Team, IBM. (2024). Granite embedding models. Retrieved from https://github.com/ibm- granite/granite-embedding-models/
- Grattafiori, A., Dubey, A., Abhinav Jauhri, Pandey, A., Abhishek Kadian, Al-Dahle, A., … Rao, A. (2024). The llama 3 herd of models. https://doi.org/10.48550/arXiv.2407.21783
- Lee, J., Dai, Z., Ren, X., Chen, B., Cer, D., Cole, J. R., … Naim, I. (2024). Gecko: Versatile text embeddings distilled from large language models. https://doi.org/10.48550/arXiv.2403.20327
- Niklas Muennighoff, Tazi, N., Magne, L., & Reimers, N. (2023). MTEB: Massive text embedding benchmark. https://doi.org/10.48550/arXiv.2210.07316
- Reimers, N., & Gurevych, I. (2019, November). Sentence-bert: Sentence embeddings using siamese BERT- Networks. Association for Computational Linguistics. https://doi.org/10.48550/arXiv.1908.10084
- Sundar Pichai, Hassabis, D., & Kavukcuoglu, K. (2024, December 11). Introducing Gemini 2.0: our new AI model for the agentic era. Retrieved from Google website: https://blog.google/technology/google-deepmind/google- gemini-ai-update-december-2024/
- Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Surya Bhupatiraju, … Tsitsulin, A. (2024). Gemma 2: Improving open language models at a practical size. https://doi.org/10.48550/arXiv.2408.00118
- Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., & Wei, F. (2024a). Improving text embeddings with large language models. https://doi.org/10.48550/arXiv.2401.00368
- Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., & Wei, F. (2024b). Multilingual E5 text embeddings: A technical report. https://doi.org/10.48550/arXiv.2402.05672
- Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., & Yu, Y. (2018). Texygen: A benchmarking platform for text generation models. Association for Computing Machinery. https://doi.org/10.1145/3209978.3210080