Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

AGORA: An intelligent system for the anonymization, information extraction and automatic mapping of sensitive documents

Juez-Hernandez, Rodrigo, Quijano-Sánchez, Lara, Liberatore, Federico ORCID: https://orcid.org/0000-0001-9900-5108 and Gomez, Jesus 2023. AGORA: An intelligent system for the anonymization, information extraction and automatic mapping of sensitive documents. Applied Soft Computing 145 , 110540. 10.1016/j.asoc.2023.110540

[thumbnail of 1-s2.0-S1568494623005586-main.pdf]
Preview
PDF - Published Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (401kB) | Preview

Abstract

Public institutions, such as law enforcement agencies or health centers, have a vast volume of unstructured text documents, e.g. police reports. Currently, before this data can be shared (e.g. with research institutions), it must go through a lengthy and costly human anonymization procedure. This paper addresses this issue by presenting AGORA, a cutting-edge tool that automatically identifies key entities and anonymizes sensitive data in text documents. AGORA has been developed in partnership with the Spanish National Office Against Hate Crimes and validated in the police and medical domains. This tool allows to export both anonymized texts and identified entities to structured files, thus, simplifying its exploitation for analysis purposes. Also, AGORA is capable of plotting the location entities identified in the documents, as well as obtaining and displaying relevant information from their geographical surroundings. Thus, it simplifies the task of generating comprehensive datasets for subsequent data analysis or predictive tasks. Its main goal is to foster cooperation between public institutions and research centers by easing document sharing as well as serving as a foundation for addressing succeeding phases in data science. The paper conducts a comprehensive assessment of the literature on Named Entity Recognition methodologies and technologies. Followed by extensive computational experiments to identify the best configuration for the NER models embedded in AGORA which include both successful state-of-the-art model setups and novelly proposed ones. Finally, the methodology, conclusions and software provided can be easily reused in similar application scenarios.

Item Type: Article
Date Type: Publication
Status: Published
Schools: Computer Science & Informatics
Publisher: Elsevier
ISSN: 1568-4946
Date of First Compliant Deposit: 21 June 2023
Date of Acceptance: 13 June 2023
Last Modified: 03 Aug 2023 07:47
URI: https://orca.cardiff.ac.uk/id/eprint/160480

Citation Data

Cited 4 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics