De Ribaupierre, Hélène and Falquet, Gilles 2018. Extracting discourse elements and annotating scientific documents using the SciAnnotDoc model: a use case in gender documents. International Journal on Digital Libraries 19 (2-3) , pp. 271-286. 10.1007/s00799-017-0227-5 |
Preview |
PDF
- Published Version
Download (1MB) | Preview |
Abstract
When scientists are searching for informa- tion, they generally have a precise objective in mind. Instead of looking for documents “about a topic T”, they try to answer specific questions such as finding the definition of a concept, finding results for a particular problem, checking whether an idea has already been tested, or comparing the scientific conclusions of two articles. Answering these precise or complex queries on a corpus of scientific documents requires precise mod- elling of the full content of the documents. In particu- lar, each document element must be characterised by its discourse type (hypothesis, definition, result, method, etc.). In this paper we present a scientific document model (SciAnnotDoc ontology), developed from an em- pirical study conducted with scientists, that models the discourse types. We developed an automated process that analyse documents effectively identifying the dis- course types of each element. Using syntactic rules (pat- terns), we evaluated the process output in terms of pre- cision and recall using a previously annotated corpus in Gender Studies. We chose to annotate documents in Humanities, as these documents are well known to be less formalised than those in “hard science”. The process output has been used to create a SciAnnotDoc representation of the corpus on top of which we built a faceted search interface. Experiments with users show that searches using with this interface clearly outper- form standard keyword searches for precise or complex queries.
Item Type: | Article |
---|---|
Date Type: | Publication |
Status: | Published |
Schools: | Computer Science & Informatics |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Publisher: | Springer Verlag |
ISSN: | 1432-5012 |
Date of First Compliant Deposit: | 9 August 2017 |
Date of Acceptance: | 1 August 2017 |
Last Modified: | 23 May 2023 19:13 |
URI: | https://orca.cardiff.ac.uk/id/eprint/103443 |
Citation Data
Cited 11 times in Scopus. View in Scopus. Powered By Scopus® Data
Actions (repository staff only)
Edit Item |