Martin, German Hurtado, Schockaert, Steven ORCID: https://orcid.org/0000-0002-9256-2881, Cornelis, Chris and Naessens, Helga 2013. Using semi-structured data for assessing research paper similarity. Information Sciences 221 , pp. 245-261. 10.1016/j.ins.2012.09.044 |
Abstract
The task of assessing the similarity of research papers is of interest in a variety of application contexts. It is a challenging task, however, as the full text of the papers is often not available, and similarity needs to be determined based on the papers’ abstract, and some additional features such as their authors, keywords, and the journals in which they were published. Our work explores several methods to exploit this information, first by using methods based on the vector space model and then by adapting language modeling techniques to this end. In the first case, in addition to a number of standard approaches we experiment with the use of a form of explicit semantic analysis. In the second case, the basic strategy we pursue is to augment the information contained in the abstract by interpolating the corresponding language model with language models for the authors, keywords and journal of the paper. This strategy is then extended by revealing the latent topic structure of the collection using an adaptation of Latent Dirichlet Allocation, in which the keywords that were provided by the authors are used to guide the process. Experimental analysis shows that a well-considered use of these techniques significantly improves the results of the standard vector space model approach.
Item Type: | Article |
---|---|
Date Type: | Publication |
Status: | Published |
Schools: | Computer Science & Informatics |
ISSN: | 0020-0255 |
Last Modified: | 25 Oct 2022 09:42 |
URI: | https://orca.cardiff.ac.uk/id/eprint/59680 |
Citation Data
Cited 20 times in Scopus. View in Scopus. Powered By Scopus® Data
Actions (repository staff only)
Edit Item |