MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
Por favor, use este identificador para citar o enlazar este ítem:
http://hdl.handle.net/10045/135117
Título: | MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages |
---|---|
Autor/es: | Bañón, Marta | Chichirău, Mălina | Esplà-Gomis, Miquel | Forcada, Mikel L. | Galiano Jiménez, Aarón | Kuzman, Taja | Ljubešić, Nikola | van Noord, Rik | Pla Sempere, Leopoldo | Ramírez Sánchez, Gema | Rupnik, Peter | Suchomel, Vít | Toral, Antonio | Zaragoza Bernabeu, Jaume |
Grupo/s de investigación o GITE: | Transducens |
Centro, Departamento o Servicio: | Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos |
Palabras clave: | MaCoCu | Machine translation | European languages | Under-resourced languages |
Fecha de publicación: | jun-2023 |
Editor: | European Association for Machine Translation (EAMT) |
Cita bibliográfica: | Bañón, Marta, et al. “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”. In: Nurminen, Mary, et al. (Eds.). Proceedings of the 24th Annual Conference of the European Association for Machine Translation: 12 – 15 June 2023, Tampere, Finland. European Association for Machine Translation (EAMT), 2023. ISBN 978-952-03-2947-1, pp. 505-506 |
Resumen: | We present the most relevant results of the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages in its second year. Parallel and monolingual corpora have been produced for eleven low-resourced European languages by crawling large amounts of textual data from selected top-level domains of the Internet; both human and automatic evaluation show its usefulness. In addition, several large language models pretrained on MaCoCu data have been published, as well as the code used to collect and curate the data. |
Patrocinador/es: | This action has received funding from the European Union’s Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. |
URI: | http://hdl.handle.net/10045/135117 |
ISBN: | 978-952-03-2947-1 |
Idioma: | eng |
Tipo: | info:eu-repo/semantics/conferenceObject |
Derechos: | © 2023 The authors. This article is licensed under a Creative Commons 4.0 licence, no derivative works, attribution, CCBY-ND. |
Revisión científica: | si |
Aparece en las colecciones: | INV - TRANSDUCENS - Comunicaciones a Congresos, Conferencias, etc. |
Archivos en este ítem:
Archivo | Descripción | Tamaño | Formato | |
---|---|---|---|---|
Banon_etal_Proceedings-EAMT-2023.pdf | 214,84 kB | Adobe PDF | Abrir Vista previa | |
Este ítem está licenciado bajo Licencia Creative Commons