Information Systems for Working With Text Corpora: Classification and Comparative Analysis

2024;
: pp. 273 - 289
1
Lviv Polytechnic National University, Information Systems and Networks Department
2
Lviv Politechnik National University, Department of Information Systems and Networks

The article examines information systems for working with text corpora, particularly their application for linguistic analysis and management of large text data. Information systems for supporting text corpora are analyzed, classified, and compared based on their historical development and functional capabilities. The main focus is comparing the two most common systems that can be distinguished by functionality as corpus managers: ‘AntConc’ and ‘Sketch Engine’. These are evaluated based on key criteria: corpus creation, text processing, annotation, storage and export, data analysis and visualization, interface intuitiveness, support for the Ukrainian language, as well as the presence of an open license. The research aimed to conduct a comparative analysis of these systems using the analytic hierarchy process method to determine their strengths and weaknesses under different usage conditions. It was found that ‘Sketch Engine’ provides advanced capabilities for creating and managing large corpora, annotating and visualizing data, making it a better choice for large research projects. At the same time, ‘AntConc’ is a more accessible and efficient system for individual or small-scale research due to its simplicity, lack of licensing costs, and support for specific parameters for text analysis. The research findings can be useful for corpus and applied linguists when choosing systems for creating and working with text corpora. The conclusions will contribute to making decisions regarding the selection of appropriate tools based on specific research needs, workload, and budget constraints. In addition, the research results can be applied to improving existing and developing new information systems to support corpora in future scientific projects by the authors.

  1. Abdullayeva, O. (2020). Programs used to create the language corpus and their principles. ACADEMICIA: An International Multidisciplinary Research Journal. 10. 1778. https://doi.org/10.5958/2249-7137.2020.00749.1.
  2. Anthony, L. (2013). A critical look at software tools in corpus linguistics. Linguistic Research. 30. 141– 161. https://doi.org/10.17250/khisli.30.2.201308.001.
  3. Anthony, L. (2023). AntConc (Version 4.2.4) [Computer Software]. Tokyo, Japan: Waseda University. Available from https://www.laurenceanthony.net/software.
  4. Anthony, L. (2023). Corpus AI: Integrating Large Language Models (LLMs) into a Corpus Analysis Toolkit. Presentation given at the 49th Annual Conference of the Japan Association for English Corpus Studies, Kansai University, Osaka, Japan. URL: https://osf.io/srtyd/.
  5. Baroni, M., Kilgarriff, A., Pomikálek, J., & Rychlý, P. (2006). WebBootCaT: a Web Tool for Instant Corpora. Proceedings of the 12th EURALEX International Congress. URL: https://www.researchgate.net/publication/ 242220785_WebBootCaT_a_web_tool_for_instant_corpora
  6. Bayón, Candelas. (2024). Specialized terminology, instrumental competence, and corpus management tools / Terminología especializada, competencia instrumental y herramientas de gestión de corpus. FITISPos International Journal. 11. 220-–238. https://doi.org/10.37536/FITISPos-IJ.2024.11.1.402.
  7. Chaplynskyi, D. (2023). Introducing UberText 2.0: A Corpus of Modern Ukrainian at Scale. Proceedings of the Second Ukrainian Natural Language Processing Workshop, 1–10, Dubrovnik. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.unlp-1.1
  8. Jusoh W. et al. (2024). Exploring corpus linguistics via computational tool analysis: key finding review. Indonesian Journal of Electrical Engineering and Computer Science. 34. 1052. http://doi.org/10.11591/ ijeecs.v34.i2.pp1052-1062.
  9. Kapranov Y. (2021). The Antconc Corpus Manager and its Possibilities for Determining the Frequency of Key Words in Different Languages. Knowledge Engineering as a Factor of Intercultural Cooperation between Ukraine,  Japan,  China,  and  the  Republic  of  Korea:  materials  of  the  II  International  Scientific  and  Practical Videoconference, December 1-2, 2021 (pp. 100–102). Publishing Center of Kyiv National Linguistic University. URL:                     http://rep.knlu.edu.ua/xmlui/bitstream/handle/787878787/2980/Капранов%20Я.%20В.%20Корпусний% 20менеджер%20AntConc%20та%20його%20можливості%20для%20визначення%20частоти%20ключових%20 слів%20різних%20мов.pdf.
  10. Khairas, Eri. (2019). Using Antconc Software As English Learning Media: The Students’ Perception.Epigram. 16. 189–194. http://dx.doi.org/10.32722/epi.v16i2.2234.
  11. Kocincová, Lucia & Jakubíček, Miloš & Kovář, Vojtěch & Baisa, Vít. (2015). Interactive Visualizations of Corpus Data in Sketch Engine. Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015. 17–22. URL: https://www.researchgate.net/publication/280566089_Interactive_Visualizations_ of_Corpus_Data_in_Sketch_Engine.
  12. Kotsyba, N. (2013). Laboratoriia ukraiinskoii. URL https://mova.institute/.
  13. Kovář, Vojtěch & Baisa, Vít & Jakubíček, Miloš. (2016). Sketch Engine for Bilingual Lexicography. International Journal of Lexicography. Volume 29, Issue 3, September 2016, Pages 339–352, https://doi.org/10.1093/ijl/ecw029.
  14. Kozak, I. & Кunanets, N. (2024). Challenges in creating text corpus using information systems and ways to solve them. Naukovyi visnyk NLTU Ukrainy, 34(2), 101–108. [In Ukrainian]. https://doi.org/10.36930/40340213
  15. Lexical      Computing       CZ     s.r.o.      (2023).     SketchEngine    [Computer       Software].     Available    from https://www.sketchengine.eu/.
  16. Mu, E., Pereyra-Rojas, M. (2017). Understanding the Analytic Hierarchy Process. In: Practical Decision Making. SpringerBriefs in Operations Research. Springer, Cham. https://doi.org/10.1007/978-3-319-33861-3_2.
  17. Shvedova M., Valdenfels R. & Starko V. (2019). Heneralnyi rehionalno anotovanyi korpus ukrainskoi movy (HRAK). URL: https://uacorpus.org/Kyiv/ua.