Text normalization during pre-corpus preparation: experience of application

Ihor Kulchytskyy

The article analyses the experience of normalization of texts before introduction into the corpus of literary works of Naddnistrian Ukraine. The creation of the corpus was started at the department of Applied Linguistics of Lviv Polytechnic National University. Normalization means a set of information procedures that make the texts suitable for insertion into the corpus: bringing all texts to one code table, checking them for punctuation correctness (sense-identical entities should be marked with one character), eliminating unnecessary characters (for example, blank paragraphs , several gaps in a row, etc.), unification of formatting tools and methods, and more. MS Word editor is offered as a standardization medium, and Python programming language is used to create additional programming tools. Text normalization process contains the following stages: normalization of coding, normalization of graphics, text proofreading, technical normalization of punctuation. Each stage characteristics are presented, problems that arise during their implementation are indicated, and ways to overcome them are suggested. The conclusions are drawn.

Ellis N. C. 'Formulaic language and second language acquisition. Zipfand the phrasal teddy bear'. Annual Review of Applied Linguistics 32, 2012. - 17-44.
https://doi.org/10.1017/S0267190512000025

Friederike Muller and Birgit Waibel, Corpus linguistics - an introduction, from https://www.anglistik.unifreiburg. de/seminar/abteilungen/sprachwissenschaft/ls_mair/corpus-linguistics [FM]

Gries S. Th. Statistics for Linguistics Using R. 2nd edn. - Berlin. De Gruyter Mouton, 2013. - p. 179.

Gries Stefan Th. Some long overdue additions/corrections (to/of actually all sorts of corpus-linguistics measures). International Journal of Corpus Linguistics, Volume 24, Issue 3, Aug 2019, p. 385 - 412
https://doi.org/10.1075/ijcl.00011.gri

Nancy Ide (2008) Preparation and Analysis of Linguistic Corpora. A Companion to Digital Humanities/Susan Schreibman, Ray Siemens, John Unsworth, John Wiley & Sons 640 p. [NI08]

Perez Paredes. All things corpus & applied linguistics Research methods: corpus linguistics, from http://www.perezparedes.es/research-methods-corpus-linguistics/ 7. The Unicode Consortium, from http://www.unicode.org/ [UTF]

Bobkova, TV (2014) Towards a definition of corpus linguistics in modern linguistics. Scientific Papers of Ostroh Academy National University, (45), 3-6.

Vanivska, OI (2012) Basic approaches to the analysis of language data in corpus linguistics. Scientific Papers of Ostroh Academy National University, 27, 3-8.

GRAC (n. D.) General regionally annotated corpus of the Ukrainian language. Accessed 15/01/2020 http://uacorpus.org/

Danylyuk, I. (2013). A body of texts for the study of grammatical servitude. Linguistic Studies, 26, 224- 229.

Darchuk, N. (2010) The research body of the Ukrainian language: basic principles and perspectives. Bulletin of Taras Shevchenko National University of Kyiv, 21, 45-49.

Zagnitko, AP (2015) Establishment of Functional Characteristics and Paradigm-Syntagmal Particle Detection in the Experimental Research Linguistic Corps of Servitude. In O. Levchenko (Ed.) Data from text corpora in linguistic studies (pp. 46-64).

Zagnitko, A. & Danylyuk, I. (2013). A body of grammatical servitude texts. In Applied Linguistics and Linguistic Technologies (pp. 102-112).

Kulchytskyy, IM (2015) Technological aspects of text corpus laying. In O. Levchenko (Ed.) Text corpus data in linguistic research (pp. 29-45).

Kulchytskyi, I. (2016) Text Cases as a Linguistic and Technological Basis for Detecting Changes in the Ukrainian Language. In A. Arkhangelsk (Ed.) XX-XXI centuries: genre-style and linguistic metamorphoses in Ukrainian language and literature (pp. 269-298).

Kulchitsky IM (2014) Technical aspects of computer-generated natural language information. Bulletin of the National University of Lviv Polytechnic, 783, 344-353.

Drul Orestes (2015) Corrected by Franco. Collapsed. Retrieved 16/01/2020 from https://zbruc.eu/node/35977

Rusanovsky VM & Taranenko OO & all. (2004). English language: Encyclopedia. Publishing House «Ukrainian Encyclopedia. MP Bazhan »

Ukrainian Spelling 2019. (2019) Ministry of Education and Science of Ukraine. Retrieved 15/01/2020 from https://mon.gov.ua/en/osvita/zagalna-serednya-osvita/navchalni-programi/...

Shirokov VA & all (2005) Corpus linguistics. Trust.