Automatic extension of corpora from the intelligent ensembling of eHealth knowledge discovery systems outputs

Empreu sempre aquest identificador per citar o enllaçar aquest ítem http://hdl.handle.net/10045/113584
Información del item - Informació de l'item - Item information
Títol: Automatic extension of corpora from the intelligent ensembling of eHealth knowledge discovery systems outputs
Autors: Consuegra-Ayala, Juan Pablo | Gutiérrez, Yoan | Piad-Morffis, Alejandro | Almeida-Cruz, Yudivian | Palomar, Manuel
Grups d'investigació o GITE: Procesamiento del Lenguaje y Sistemas de Información (GPLSI)
Centre, Departament o Servei: Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos
Paraules clau: Ensemble methods | Annotated corpora | Information extraction | Entity recognition | Relation extraction | Natural language processing
Àrees de coneixement: Lenguajes y Sistemas Informáticos
Data de publicació: d’abril-2021
Editor: Elsevier
Citació bibliogràfica: Journal of Biomedical Informatics. 2021, 116: 103716. https://doi.org/10.1016/j.jbi.2021.103716
Resum: Corpora are one of the most valuable resources at present for building machine learning systems. However, building new corpora is an expensive task, which makes the automatic extension of corpora a highly attractive task to develop. Hence, finding new strategies that reduce the cost and effort involved in this task, while at the same time guaranteeing quality, remains an open and important challenge for the research community. In this paper, we present a set of ensembling strategies oriented toward entity and relation extraction tasks. The main goal is to combine several automatically annotated versions of corpora to produce a single version with improved quality. An ensembler is built by exploring a configuration space in search of the combination that maximizes the fitness of the ensembled collection according to a reference collection. The eHealth-KD 2019 challenge was chosen for the case study. The submitted systems’ outputs were ensembled, resulting in the construction of an automatically annotated collection of 8000 sentences. We show that using this collection as additional training input for a baseline algorithm has a positive impact on its performance. Additionally, the ensembling pipeline was used as a participant system in the 2020 edition of the challenge. The ensembled run achieved a slightly better performance than the individual runs.
Patrocinadors: This research has been partially funded by the University of Alicante and the University of Havana, the Generalitat Valenciana (Conselleria d’Educació, Investigació, Cultura i Esport) and the Spanish Government through the projects LIVING-LANG (RTI2018-094653-B-C22) and SIIA (PROMETEO/2018/089, PROMETEU/2018/089). Moreover, it has been backed by the work of both COST Actions: CA19134 - “Distributed Knowledge Graphs” and CA19142 - “Leading Platform for European Citizens, Industries, Academia and Policymakers in Media Accessibility”.
URI: http://hdl.handle.net/10045/113584
ISSN: 1532-0464 (Print) | 1532-0480 (Online)
DOI: 10.1016/j.jbi.2021.103716
Idioma: eng
Tipus: info:eu-repo/semantics/article
Drets: © 2021 Elsevier Inc.
Revisió científica: si
Versió de l'editor: https://doi.org/10.1016/j.jbi.2021.103716
Apareix a la col·lecció: INV - GPLSI - Artículos de Revistas

Arxius per aquest ítem:
Arxius per aquest ítem:
Arxiu Descripció Tamany Format  
ThumbnailConsuegra-Ayala_etal_2021_JBiomedInformatics_final.pdfVersión final (acceso restringido)1,63 MBAdobe PDFObrir     Sol·licitar una còpia
ThumbnailConsuegra-Ayala_etal_2021_JBiomedInformatics_preprint.pdfPreprint (acceso abierto)3,84 MBAdobe PDFObrir Vista prèvia


Tots els documents dipositats a RUA estan protegits per drets d'autors. Alguns drets reservats.