Producing Monolingual and ParallelWeb Corpora at the Same Time – SpiderLing and Bitextor’s Love Affair
Por favor, use este identificador para citar o enlazar este ítem:
http://hdl.handle.net/10045/135635
Título: | Producing Monolingual and ParallelWeb Corpora at the Same Time – SpiderLing and Bitextor’s Love Affair |
---|---|
Autor/es: | Ljubešić, Nikola | Esplà-Gomis, Miquel | Toral, Antonio | Ortiz Rojas, Sergio | Klubička, Filip |
Grupo/s de investigación o GITE: | Transducens |
Centro, Departamento o Servicio: | Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos |
Palabras clave: | Crawling | Top-level domain | Monolingual corpus | Parallel corpus |
Fecha de publicación: | may-2016 |
Editor: | European Language Resources Association (ELRA) |
Cita bibliográfica: | Ljubešić, Nikola, et al. “Producing Monolingual and ParallelWeb Corpora at the Same Time – SpiderLing and Bitextor’s Love Affair”. In: Calzolari, Nicoletta, et al. (Eds.). Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). Portorož, Slovenia: European Language Resources Association (ELRA), 2016. ISBN 978-2-9517408-9-1, pp. 2949-2956 |
Resumen: | This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain .hr and the Slovene top-level domain .si, and extrinsically on the English–Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English–Croatian, English–Finnish, English–Serbian and English–Slovene language pairs. |
Patrocinador/es: | This research is supported by the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (AbuMaTran). |
URI: | http://hdl.handle.net/10045/135635 |
ISBN: | 978-2-9517408-9-1 |
Idioma: | eng |
Tipo: | info:eu-repo/semantics/conferenceObject |
Derechos: | Creative Commons Attribution 4.0 International License. |
Revisión científica: | si |
Versión del editor: | https://aclanthology.org/L16-1 |
Aparece en las colecciones: | INV - TRANSDUCENS - Comunicaciones a Congresos, Conferencias, etc. Investigaciones financiadas por la UE |
Archivos en este ítem:
Archivo | Descripción | Tamaño | Formato | |
---|---|---|---|---|
Ljubesic_etal_Producing-Monolingual-and-Parallel-Web-Corpora-at-the-Same-Time.pdf | 190,41 kB | Adobe PDF | Abrir Vista previa | |
Este ítem está licenciado bajo Licencia Creative Commons