Towards robust tags for scientific publications from natural language processing tools and Wikipedia

Łopuszyński, Michał; Bolikowski, Łukasz

dc.contributor.author	Łopuszyński, Michał
dc.contributor.author	Bolikowski, Łukasz
dc.date.accessioned	2015-04-28T08:05:55Z
dc.date.available	2015-04-28T08:05:55Z
dc.date.issued	2015
dc.identifier.citation	International Journal of Digital Libraries, vol. 16, pp. 25-36 (2015)	pl_PL
dc.identifier.uri	http://dx.doi.org/10.1007/s00799-014-0132-0
dc.identifier.uri	https://depot.ceon.pl/handle/123456789/6690
dc.description.abstract	In this work, two simple methods of tagging scientific publications with labels reflecting their content are presented and compared. As a first source of labels, Wikipedia is employed. A second label set is constructed from the noun phrases occurring in the analyzed corpus. The corpus itself consists of abstracts from 0.7 million scien- tific documents deposited in the ArXiv preprint collection. We present a comparison of both approaches, which shows that discussed methods are to a large extent complemen- tary. Moreover, the results give interesting insights into the completeness of Wikipedia knowledge in various scientific domains. As a next step, we examine the statistical proper- ties of the obtained tags. It turns out that both methods show qualitatively similar rank–frequency dependence, which is best approximated by the stretched exponential curve. The distribution of the number of distinct tags per document fol- lows also the same distribution for both methods and is well described by the negative binomial distribution. The devel- oped tags are meant for use as features in various text mining tasks. Therefore, as a final step we show the preliminary results on their application to topic modeling.	pl_PL
dc.language.iso	en	pl_PL
dc.publisher	Springer	pl_PL
dc.rights	Creative Commons Uznanie autorstwa 3.0 Polska
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/pl/legalcode
dc.subject	ArXiv preprint collection	en
dc.subject	Wikipedia	en
dc.subject	natural language processing	en
dc.subject	tagging document collections	en
dc.title	Towards robust tags for scientific publications from natural language processing tools and Wikipedia	en
dc.type	info:eu-repo/semantics/article	pl_PL
dc.contributor.organization	Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw	pl_PL
dc.description.eperson	Michał Łopuszyński
dc.rights.DELETETHISFIELD	info:eu-repo/semantics/openAccess

Pliki tej pozycji

Nazwa:: Lopuszynski_M_Bolikowski_L_Tow ...
Rozmiar:: 1.701MB
Format:: PDF

Oglądaj/Otwórz

Pozycja umieszczona jest w następujących kolekcjach

Artykuły ICM / ICM Articles [79]

Pokaż uproszczony rekord

Poza zaznaczonymi wyjątkami, licencja tej pozycji opisana jest jako Creative Commons Uznanie autorstwa 3.0 Polska