Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Developing computational infrastructure for the CorCenCC corpus - the National Corpus of Contemporary Welsh

Knight, Dawn ORCID: https://orcid.org/0000-0002-4745-6502, Loizides, Fernando ORCID: https://orcid.org/0000-0003-0531-6760, Neale, Steven, Anthony, Laurence and Spasic, Irena ORCID: https://orcid.org/0000-0002-8132-3885 2021. Developing computational infrastructure for the CorCenCC corpus - the National Corpus of Contemporary Welsh. Language Resources and Evaluation 55 , pp. 789-816. 10.1007/s10579-020-09501-9

[thumbnail of Knight2020_Article_DevelopingComputationalInfrast.pdf]
Preview
PDF - Published Version
Available under License Creative Commons Attribution.

Download (2MB) | Preview

Abstract

CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes - National Corpus of Contemporary Welsh) is the first comprehensive corpus of Welsh designed to be reflective of language use across communication types, genres, speakers, language varieties (regional and social) and contexts. This article focuses on the computational infrastructure that we have designed to support data collection for CorCenCC, and the subsequent uses of the corpus which include lexicography, pedagogical research and corpus analysis. A grass-roots approach to design has been adopted, that has adapted and extended previous corpus-building and introduced new features as required for this specific context and language. The key pillars of the infrastructure include a framework that supports metadata collection, an innovative mobile application designed to collect spoken data (utilising a crowdsourcing approach), a backend database that stores curated data and a web-based interface that allows users to query the data online. A usability study was conducted to evaluate the user facing tools and to suggest directions for future improvements. Though the infrastructure was developed for Welsh language collection, its design can be re-used to support corpus development in other minority or major language contexts, broadening the potential utility and impact of this work.

Item Type: Article
Date Type: Publication
Status: Published
Schools: Computer Science & Informatics
English, Communication and Philosophy
Additional Information: This article is licensed under a Creative Commons Attribution 4.0 International License
Publisher: Springer Verlag (Germany)
ISSN: 1574-020X
Date of First Compliant Deposit: 21 July 2020
Date of Acceptance: 20 July 2020
Last Modified: 28 May 2023 09:39
URI: https://orca.cardiff.ac.uk/id/eprint/133636

Citation Data

Cited 5 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics