Polsko-Ukraiński Korpus Równoległy PolUKR i jego następca PolUKR-2

Kotsyba, Natalia

Polsko-Ukraiński Korpus Równoległy PolUKR i jego następca PolUKR-2

Abstract

The paper discusses the present stage of development of one of the aspects of an ongoing project aiming at creating electronic resources for the Ukrainian language. Parallel corpora make an important part of this project. The Polish-Ukrainian Parallel Corpus (PolUKR) was developed in 2004-2010, first in the Institute of Slavic Studies of the Polish Academy of Sciences, later at the faculty “Artes Liberales” of the University of Warsaw. The first two versions of PolUKR are available for search online at http://domeczek.pl/~polukr. PolUKR consists of texts written originally either in Polish or Ukrainian, i.e., it does not contain any texts translated from a third language, but only immediate translations of its own texts. It had been aligned at the level of sentences automatically, afterwards the alignments were edited manually. Both the Polish and Ukrainian sentences had been supplied with the morphosyntactic layer of annotation. The characteristic feature of PolUKR is its purpose-built morphosyntactic categorical apparatus, common for the two corpus languages, and its morphosyntactic tagsets based on it. The tagsets are also used in the multilingual European project MULTEXT-East (1996-2010), version 4 “MONDILEX”, available at http://nl.ijs.si/ME/V4/. While the pilot versions of PolUKR concentrated rather on developing corpus-making technologies, in both their technical and theoretical linguistic aspects, the new version, presently developed in cooperation with the National University of Lviv and Lviv Polytechnical University in Ukraine, aims at: 1) first of all, extending the size of the corpus up to 30 million words (as previously, with the biggest possible attention to original Polish or Ukrainian texts, but without a strict limitation on this feature); 2) optimalization of the morphosyntactic description for the Ukrainian language, i.e., disambiguation of ambiguous interpretations and extension of the grammatical dictionary for new, unknown words. Work on the shallow syntax for Ukrainian is also planned. PolUKR-2 will be used as a basic corpus resource for creating a great Ukrainian-Polish dictionary with ca. 80 thousand entries.

Description

Gruszczyńska, Ewa; Leńko-Szymańska, Agnieszka, red. (2016). Polskojęzyczne korpusy równoległe. Polish-language Parallel Corpora. Warszawa: Instytut Lingwistyki Stosowanej, pp. 133-142.