Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Simple TICO-19: A dataset for joint translation and simplification of COVID-19 texts

Shardlow, Matthew and Alva Manchego, Fernando 2022. Simple TICO-19: A dataset for joint translation and simplification of COVID-19 texts. Presented at: LREC 2022: Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20-25 June 2022. Published in: Calzolari, Nicoletta, Béchet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerk, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Odijk, Jan and Piperidis, Stelios eds. Proceedings of the Thirteenth Language Resources and Evaluation Conference. European Language Resources Association, 3093–3102.

Full text not available from this repository.

Abstract

Specialist high-quality information is typically first available in English, and it is written in a language that may be difficult to understand by most readers. While Machine Translation technologies contribute to mitigate the first issue, the translated content will most likely still contain complex language. In order to investigate and address both problems simultaneously, we introduce Simple TICO-19, a new language resource containing manual simplifications of the English and Spanish portions of the TICO-19 corpus for Machine Translation of COVID-19 literature. We provide an in-depth description of the annotation process, which entailed designing an annotation manual and employing four annotators (two native English speakers and two native Spanish speakers) who simplified over 6,000 sentences from the English and Spanish portions of the TICO-19 corpus. We report several statistics on the new dataset, focusing on analysing the improvements in readability from the original texts to their simplified versions. In addition, we propose baseline methodologies for automatically generating the simplifications, translations and joint translation and simplifications contained in our dataset.

Item Type: Conference or Workshop Item (Paper)
Status: Published
Schools: Computer Science & Informatics
Publisher: European Language Resources Association
Last Modified: 03 Oct 2023 15:00
URI: https://orca.cardiff.ac.uk/id/eprint/161902

Actions (repository staff only)

Edit Item Edit Item