Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Extracting discourse elements and annotating scientific documents using the SciAnnotDoc model: a use case in gender documents

De Ribaupierre, Hélène and Falquet, Gilles 2018. Extracting discourse elements and annotating scientific documents using the SciAnnotDoc model: a use case in gender documents. International Journal on Digital Libraries 19 (2-3) , pp. 271-286. 10.1007/s00799-017-0227-5

[thumbnail of 10.1007%2Fs00799-017-0227-5.pdf]
PDF - Published Version
Download (1MB) | Preview


When scientists are searching for informa- tion, they generally have a precise objective in mind. Instead of looking for documents “about a topic T”, they try to answer specific questions such as finding the definition of a concept, finding results for a particular problem, checking whether an idea has already been tested, or comparing the scientific conclusions of two articles. Answering these precise or complex queries on a corpus of scientific documents requires precise mod- elling of the full content of the documents. In particu- lar, each document element must be characterised by its discourse type (hypothesis, definition, result, method, etc.). In this paper we present a scientific document model (SciAnnotDoc ontology), developed from an em- pirical study conducted with scientists, that models the discourse types. We developed an automated process that analyse documents effectively identifying the dis- course types of each element. Using syntactic rules (pat- terns), we evaluated the process output in terms of pre- cision and recall using a previously annotated corpus in Gender Studies. We chose to annotate documents in Humanities, as these documents are well known to be less formalised than those in “hard science”. The process output has been used to create a SciAnnotDoc representation of the corpus on top of which we built a faceted search interface. Experiments with users show that searches using with this interface clearly outper- form standard keyword searches for precise or complex queries.

Item Type: Article
Date Type: Publication
Status: Published
Schools: Computer Science & Informatics
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Publisher: Springer Verlag
ISSN: 1432-5012
Date of First Compliant Deposit: 9 August 2017
Date of Acceptance: 1 August 2017
Last Modified: 23 May 2023 19:13

Citation Data

Cited 11 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item Edit Item


Downloads per month over past year

View more statistics