Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

A benchmark dataset of herbarium specimen images with label data

Dillen, Mathias, Groom, Quentin, Chagnoux, Simon, Güntsch, Anton, Hardisty, Alex, Haston, Elspeth, Livermore, Laurence, Runnel, Veljo, Schulman, Leif, Willemse, Luc, Wu, Zengzhe and Phillips, Sarah 2019. A benchmark dataset of herbarium specimen images with label data. Biodiversity Data Journal 7 , e31817. 10.3897/BDJ.7.e31817

[thumbnail of BDJ_article_31817.pdf]
PDF - Published Version
Available under License Creative Commons Attribution No Derivatives.

Download (459kB) | Preview


More and more herbaria are digitising their collections. Images of specimens are made available online to facilitate access to them and allow extraction of information from them. Transcription of the data written on specimens is critical for general discoverability and enables incorporation into large aggregated research datasets. Different methods, such as crowdsourcing and artificial intelligence, are being developed to optimise transcription, but herbarium specimens pose difficulties in data extraction for many reasons. To provide developers of transcription methods with a means of optimisation, we have compiled a benchmark dataset of 1,800 herbarium specimen images with corresponding transcribed data. These images originate from nine different collections and include specimens that reflect the multiple potential obstacles that transcription methods may encounter, such as differences in language, text format (printed or handwritten), specimen age and nomenclatural type status. We are making these specimens available with a Creative Commons Zero licence waiver and with permanent online storage of the data. By doing this, we are minimising the obstacles to the use of these images for transcription training. This benchmark dataset of images may also be used where a defined and documented set of herbarium specimens is needed, such as for the extraction of morphological traits, handwriting recognition and colour analysis of specimens.

Item Type: Article
Date Type: Published Online
Status: Published
Schools: Computer Science & Informatics
Subjects: G Geography. Anthropology. Recreation > GE Environmental Sciences
Q Science > QA Mathematics > QA76 Computer software
Q Science > QH Natural history
Uncontrolled Keywords: biodiversity, informatics, test dataset, images, herbaria
Publisher: Pensoft Publishers
ISSN: 1314-2828
Funders: European Commission (GA No. 777483)
Date of First Compliant Deposit: 11 February 2019
Date of Acceptance: 4 February 2019
Last Modified: 11 Feb 2019 11:31

Citation Data

Cited 9 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item Edit Item


Downloads per month over past year

View more statistics