dc.contributor.author | Łopuszyński, Michał | |
dc.contributor.author | Bolikowski, Łukasz | |
dc.date.accessioned | 2015-04-28T08:05:55Z | |
dc.date.available | 2015-04-28T08:05:55Z | |
dc.date.issued | 2015 | |
dc.identifier.citation | International Journal of Digital Libraries, vol. 16, pp. 25-36 (2015) | pl_PL |
dc.identifier.uri | http://dx.doi.org/10.1007/s00799-014-0132-0 | |
dc.identifier.uri | https://depot.ceon.pl/handle/123456789/6690 | |
dc.description.abstract | In this work, two simple methods of tagging
scientific publications with labels reflecting their content
are presented and compared. As a first source of labels,
Wikipedia is employed. A second label set is constructed
from the noun phrases occurring in the analyzed corpus. The
corpus itself consists of abstracts from 0.7 million scien-
tific documents deposited in the ArXiv preprint collection.
We present a comparison of both approaches, which shows
that discussed methods are to a large extent complemen-
tary. Moreover, the results give interesting insights into the
completeness of Wikipedia knowledge in various scientific
domains. As a next step, we examine the statistical proper-
ties of the obtained tags. It turns out that both methods show
qualitatively similar rank–frequency dependence, which is
best approximated by the stretched exponential curve. The
distribution of the number of distinct tags per document fol-
lows also the same distribution for both methods and is well
described by the negative binomial distribution. The devel-
oped tags are meant for use as features in various text mining
tasks. Therefore, as a final step we show the preliminary
results on their application to topic modeling. | pl_PL |
dc.language.iso | en | pl_PL |
dc.publisher | Springer | pl_PL |
dc.rights | Creative Commons Uznanie autorstwa 3.0 Polska | |
dc.rights.uri | http://creativecommons.org/licenses/by/3.0/pl/legalcode | |
dc.subject | ArXiv preprint collection | en |
dc.subject | Wikipedia | en |
dc.subject | natural language processing | en |
dc.subject | tagging document collections | en |
dc.title | Towards robust tags for scientific publications from natural language processing tools and Wikipedia | en |
dc.type | info:eu-repo/semantics/article | pl_PL |
dc.contributor.organization | Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw | pl_PL |
dc.description.eperson | Michał Łopuszyński | |
dc.rights.DELETETHISFIELD | info:eu-repo/semantics/openAccess | |