DrugSemantics: A corpus for Named Entity Recognition in Spanish Summaries of Product Characteristics
Por favor, use este identificador para citar o enlazar este ítem:
http://hdl.handle.net/10045/67734
Título: | DrugSemantics: A corpus for Named Entity Recognition in Spanish Summaries of Product Characteristics |
---|---|
Autor/es: | Moreno, Isabel | Boldrini, Ester | Moreda, Paloma | Romá-Ferri, María Teresa |
Grupo/s de investigación o GITE: | Procesamiento del Lenguaje y Sistemas de Información (GPLSI) |
Centro, Departamento o Servicio: | Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos | Universidad de Alicante. Departamento de Enfermería |
Palabras clave: | Corpus | Reliability | Precision | Named Entity Recognition | Spanish | Summary of Product Characteristics |
Área/s de conocimiento: | Lenguajes y Sistemas Informáticos | Enfermería |
Fecha de publicación: | ago-2017 |
Editor: | Elsevier |
Cita bibliográfica: | Journal of Biomedical Informatics. 2017, 72: 8-22. doi:10.1016/j.jbi.2017.06.013 |
Resumen: | For the healthcare sector, it is critical to exploit the vast amount of textual health-related information. Nevertheless, healthcare providers have difficulties to benefit from such quantity of data during pharmacotherapeutic care. The problem is that such information is stored in different sources and their consultation time is limited. In this context, Natural Language Processing techniques can be applied to efficiently transform textual data into structured information so that it could be used in critical healthcare applications, being of help for physicians in their daily workload, such as: decision support systems, cohort identification, patient management, etc. Any development of these techniques requires annotated corpora. However, there is a lack of such resources in this domain and, in most cases, the few ones available concern English. This paper presents the definition and creation of DrugSemantics corpus, a collection of Summaries of Product Characteristics in Spanish. It was manually annotated with pharmacotherapeutic named entities, detailed in DrugSemantics annotation scheme. Annotators were a Registered Nurse (RN) and two students from the Degree in Nursing. The quality of DrugSemantics corpus has been assessed by measuring its annotation reliability (overall F = 79.33% [95%CI: 78.35–80.31]), as well as its annotation precision (overall P=94.65%P=94.65% [95%CI: 94.11–95.19]). Besides, the gold-standard construction process is described in detail. In total, our corpus contains more than 2000 named entities, 780 sentences and 226,729 tokens. Last, a Named Entity Classification module trained on DrugSemantics is presented aiming at showing the quality of our corpus, as well as an example of how to use it. |
Patrocinador/es: | This work was supported by the Spanish Government (Grants No. TIN2015-65100-R; TIN2015-65136-C02-2-R), the Generalitat Valenciana (Grant No. PROMETEOII/2014/001) and the BBVA Foundation Grant to scientific research teams (Análisis de Sentimientos Aplicado a la Prevención del Suicidio en las Redes Sociales). |
URI: | http://hdl.handle.net/10045/67734 |
ISSN: | 1532-0464 (Print) | 1532-0480 (Online) |
DOI: | 10.1016/j.jbi.2017.06.013 |
Idioma: | eng |
Tipo: | info:eu-repo/semantics/article |
Derechos: | © 2017 Elsevier Inc. |
Revisión científica: | si |
Versión del editor: | http://dx.doi.org/10.1016/j.jbi.2017.06.013 |
Aparece en las colecciones: | INV - GPLSI - Artículos de Revistas |
Archivos en este ítem:
Archivo | Descripción | Tamaño | Formato | |
---|---|---|---|---|
2017_Moreno_etal_JBiomedInf_final.pdf | Versión final (acceso restringido) | 537,61 kB | Adobe PDF | Abrir Solicitar una copia |
2017_Moreno_etal_JBiomedInf_accepted.pdf | Accepted Manuscript (acceso abierto) | 634,97 kB | Adobe PDF | Abrir Vista previa |
Todos los documentos en RUA están protegidos por derechos de autor. Algunos derechos reservados.