Show simple item record

dc.contributor.authorŁopuszyński, Michał
dc.contributor.authorBolikowski, Łukasz
dc.identifier.citationInternational Journal of Digital Libraries, vol. 16, pp. 25-36 (2015)pl_PL
dc.description.abstractIn this work, two simple methods of tagging scientific publications with labels reflecting their content are presented and compared. As a first source of labels, Wikipedia is employed. A second label set is constructed from the noun phrases occurring in the analyzed corpus. The corpus itself consists of abstracts from 0.7 million scien- tific documents deposited in the ArXiv preprint collection. We present a comparison of both approaches, which shows that discussed methods are to a large extent complemen- tary. Moreover, the results give interesting insights into the completeness of Wikipedia knowledge in various scientific domains. As a next step, we examine the statistical proper- ties of the obtained tags. It turns out that both methods show qualitatively similar rank–frequency dependence, which is best approximated by the stretched exponential curve. The distribution of the number of distinct tags per document fol- lows also the same distribution for both methods and is well described by the negative binomial distribution. The devel- oped tags are meant for use as features in various text mining tasks. Therefore, as a final step we show the preliminary results on their application to topic modeling.pl_PL
dc.rightsCreative Commons Uznanie autorstwa 3.0 Polska
dc.subjectArXiv preprint collectionpl_PL
dc.subjectnatural language processingpl_PL
dc.subjecttagging document collectionspl_PL
dc.titleTowards robust tags for scientific publications from natural language processing tools and Wikipediapl_PL
dc.contributor.organizationInterdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Pawińskiego 5a, 02-106 Warsaw Polandpl_PL
dc.description.epersonMichał Łopuszyński

Files in this item


This item appears in the following Collection(s)

Show simple item record

Creative Commons Uznanie autorstwa 3.0 Polska
Except where otherwise noted, this item's license is described as Creative Commons Uznanie autorstwa 3.0 Polska