Jak stworzyć korpus równoległy „dla wszystkich”? O pracy nad Polsko-Niemieckim i Niemiecko-Polskim Korpusem Równoległym
Date
2016Author
Meger, Andreas
Woźniak, Michał
von Waldenfels, Ruprecht
Metadata
Show full item recordAbstract
The article summarizes the Polish-German and German-Polish Parallel Corpus currently under development under the auspices of the University of Mainz, Germany. The corpus includes about 1 million tokens in texts in both translation directions and from various genres; at the moment mainly including press and fictional prose. In the future, it is planned to be expanded to other genres, e.g. legal documents and other specialized text types. The text is tagged, lemmatized and automatically sentence and word aligned using standard tools (UPlug, Hunalign). The article focuses on a new interface that was developed on the basis of the existing ParaVoz interface and published as open source. This new query interface aims to be “for all” in the sense that it includes a graphical query builder as well as it allows the user to directly input sophisticated CQP queries, thus providing both ease of use and access to the full possibilities of the CQP query language, a close relative of the query language used with the IPI PAN query interface to the NKJP.
Besides being convenient, the interface has an educational aspect: inexperienced users can observe correct CQP queries being constructed on the fly reflecting the choices in the graphical interface, helping them to learn what is a straightforward, but also rather strict formal and technical query language. The interface thus flattens what is often a rather steep learning curve for users that are not used to such query languages, like many traditionally inclined linguists. The interface is available in German, Polish and English and implemented using AngularJS, a modern framework that affords smooth interaction
and uncomplicated customization and servicing of the interface.
Search facilities offer queries by lemma and grammatical tag, as well
as the filtering of results on the basis of metadata, including, for example,
a choice of the source language and different genres.
The queries generated in this interface are then evaluated by an
OpenCorpusWorkbench (CWB) backend, which is modified to output
XML. The output is transformed to HTML using client-based
XSLT. A difference to earlier versions of the interface is that word
alignment is now routinely visualized: the equivalents of the word
forms that were found by the query string in the first language are
highlighted in the results in the second language. The article gives an
in-depth description of the rationale and solutions taken, and concludes
with an outlook on future developments.
Collections
- Inne prace ILS [26]

Using this material is possible in accordance with the relevant provisions of fair use or other exceptions provided by law. Other use requires the consent of the holder.