Semi-Automatic Dataset Annotation Applied to Automatic Violent Message Detection

Please use this identifier to cite or link to this item: http://hdl.handle.net/10045/140443
Información del item - Informació de l'item - Item information
Title: Semi-Automatic Dataset Annotation Applied to Automatic Violent Message Detection
Authors: Botella-Gil, Beatriz | Sepúlveda-Torres, Robiert | Bonet-Jover, Alba | Martínez-Barco, Patricio | Saquete Boró, Estela
Research Group/s: Procesamiento del Lenguaje y Sistemas de Información (GPLSI)
Center, Department or Service: Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos
Keywords: Natural Language Processing | Violent Language | Hate Speech Detection | Assisted Annotation | Dataset Construction | Human-in-the-Loop | Active Learning
Issue Date: 1-Feb-2024
Publisher: IEEE
Citation: IEEE Access. 2024, 12: 19651-19664. https://doi.org/10.1109/ACCESS.2024.3361404
Abstract: Annotated corpora are indispensable tools to train computational models in Artificial Intelligence and Natural Language Processing. However, manual annotation is a costly, arduous, and time-consuming task, especially when the annotation is semantically complex. To address the problem, this work applies a methodology for semi-automatic annotation of datasets based on the Human-in-the-Loop paradigm. The methodology supports the building a resource, that benefits from a fine-grained annotation, to aid in the detection of Spanish violent messages sourced from social media (Twitter/X). After implementing the proposed methodology for semi-automatic violence annotation, a high quality resource was obtained (hereafter referred to as VILLANOS). The methodology consists of annotating the dataset incrementally, which delivers an increase in annotator efficiency, thereby validating the suitability of the proposal. Annotation time was reduced by 52% compared to manual annotation and performance, by training a model with the VILLANOS dataset, obtains an F 1 of 85.2%. These results demonstrate the efficiency and effectiveness of the methodology, evidencing its validity.
Sponsor: This research work is funded by MCIN/AEI/ 10.13039/501100011033 and, as appropriate, by “ERDF A way of making Europe”, by the “European Union” or by the “European Union NextGenerationEU/PRTR” through the project TRIVIAL: Technological Resources for Intelligent VIral AnaLysis through NLP (PID2021-122263OB-C22) and the project SOCIALTRUST: Assessing trustworthiness in digital media (PDC2022-133146-C22). Also, it is funded by Generalitat Valenciana through the project NL4DISMIS: Natural Language Technologies for dealing with dis- and misinformation (CIPROM/2021/21). Finally, this research work was conducted as part of the ClearText project (TED2021-130707B-I00), funded by MCIN/AEI/10.13039/501100011033 and European Union NextGenerationEU/PRTR.
URI: http://hdl.handle.net/10045/140443
ISSN: 2169-3536
DOI: 10.1109/ACCESS.2024.3361404
Language: eng
Type: info:eu-repo/semantics/article
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Peer Review: si
Publisher version: https://doi.org/10.1109/ACCESS.2024.3361404
Appears in Collections:INV - GPLSI - Artículos de Revistas

Files in This Item:
Files in This Item:
File Description SizeFormat 
ThumbnailBotella-Gil_etal_2024_IEEEAccess.pdf6,69 MBAdobe PDFOpen Preview


Items in RUA are protected by copyright, with all rights reserved, unless otherwise indicated.