MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10045/135117
Información del item - Informació de l'item - Item information
Título: MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
Autor/es: Bañón, Marta | Chichirău, Mălina | Esplà-Gomis, Miquel | Forcada, Mikel L. | Galiano Jiménez, Aarón | Kuzman, Taja | Ljubešić, Nikola | van Noord, Rik | Pla Sempere, Leopoldo | Ramírez Sánchez, Gema | Rupnik, Peter | Suchomel, Vít | Toral, Antonio | Zaragoza Bernabeu, Jaume
Grupo/s de investigación o GITE: Transducens
Centro, Departamento o Servicio: Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos
Palabras clave: MaCoCu | Machine translation | European languages | Under-resourced languages
Fecha de publicación: jun-2023
Editor: European Association for Machine Translation (EAMT)
Cita bibliográfica: Bañón, Marta, et al. “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”. In: Nurminen, Mary, et al. (Eds.). Proceedings of the 24th Annual Conference of the European Association for Machine Translation: 12 – 15 June 2023, Tampere, Finland. European Association for Machine Translation (EAMT), 2023. ISBN 978-952-03-2947-1, pp. 505-506
Resumen: We present the most relevant results of the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages in its second year. Parallel and monolingual corpora have been produced for eleven low-resourced European languages by crawling large amounts of textual data from selected top-level domains of the Internet; both human and automatic evaluation show its usefulness. In addition, several large language models pretrained on MaCoCu data have been published, as well as the code used to collect and curate the data.
Patrocinador/es: This action has received funding from the European Union’s Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341.
URI: http://hdl.handle.net/10045/135117
ISBN: 978-952-03-2947-1
Idioma: eng
Tipo: info:eu-repo/semantics/conferenceObject
Derechos: © 2023 The authors. This article is licensed under a Creative Commons 4.0 licence, no derivative works, attribution, CCBY-ND.
Revisión científica: si
Aparece en las colecciones:INV - TRANSDUCENS - Comunicaciones a Congresos, Conferencias, etc.

Archivos en este ítem:
Archivos en este ítem:
Archivo Descripción TamañoFormato 
ThumbnailBanon_etal_Proceedings-EAMT-2023.pdf214,84 kBAdobe PDFAbrir Vista previa


Este ítem está licenciado bajo Licencia Creative Commons Creative Commons