Producing Monolingual and ParallelWeb Corpora at the Same Time – SpiderLing and Bitextor’s Love Affair

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10045/135635
Información del item - Informació de l'item - Item information
Título: Producing Monolingual and ParallelWeb Corpora at the Same Time – SpiderLing and Bitextor’s Love Affair
Autor/es: Ljubešić, Nikola | Esplà-Gomis, Miquel | Toral, Antonio | Ortiz Rojas, Sergio | Klubička, Filip
Grupo/s de investigación o GITE: Transducens
Centro, Departamento o Servicio: Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos
Palabras clave: Crawling | Top-level domain | Monolingual corpus | Parallel corpus
Fecha de publicación: may-2016
Editor: European Language Resources Association (ELRA)
Cita bibliográfica: Ljubešić, Nikola, et al. “Producing Monolingual and ParallelWeb Corpora at the Same Time – SpiderLing and Bitextor’s Love Affair”. In: Calzolari, Nicoletta, et al. (Eds.). Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). Portorož, Slovenia: European Language Resources Association (ELRA), 2016. ISBN 978-2-9517408-9-1, pp. 2949-2956
Resumen: This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain .hr and the Slovene top-level domain .si, and extrinsically on the English–Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English–Croatian, English–Finnish, English–Serbian and English–Slovene language pairs.
Patrocinador/es: This research is supported by the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (AbuMaTran).
URI: http://hdl.handle.net/10045/135635
ISBN: 978-2-9517408-9-1
Idioma: eng
Tipo: info:eu-repo/semantics/conferenceObject
Derechos: Creative Commons Attribution 4.0 International License.
Revisión científica: si
Versión del editor: https://aclanthology.org/L16-1
Aparece en las colecciones:INV - TRANSDUCENS - Comunicaciones a Congresos, Conferencias, etc.
Investigaciones financiadas por la UE

Archivos en este ítem:
Archivos en este ítem:
Archivo Descripción TamañoFormato 
ThumbnailLjubesic_etal_Producing-Monolingual-and-Parallel-Web-Corpora-at-the-Same-Time.pdf190,41 kBAdobe PDFAbrir Vista previa


Este ítem está licenciado bajo Licencia Creative Commons Creative Commons