Producing Monolingual and ParallelWeb Corpora at the Same Time – SpiderLing and Bitextor’s Love Affair

Ljubešić, Nikola; Esplà-Gomis, Miquel; Toral, Antonio; Ortiz Rojas, Sergio; Klubička, Filip

Producing Monolingual and ParallelWeb Corpora at the Same Time – SpiderLing and Bitextor’s Love Affair

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10045/135635

Información del item - Informació de l'item - Item information
Título:	Producing Monolingual and ParallelWeb Corpora at the Same Time – SpiderLing and Bitextor’s Love Affair
Autor/es:	Ljubešić, Nikola \| Esplà-Gomis, Miquel \| Toral, Antonio \| Ortiz Rojas, Sergio \| Klubička, Filip
Grupo/s de investigación o GITE:	Transducens
Centro, Departamento o Servicio:	Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos
Palabras clave:	Crawling \| Top-level domain \| Monolingual corpus \| Parallel corpus
Fecha de publicación:	may-2016
Editor:	European Language Resources Association (ELRA)
Cita bibliográfica:	Ljubešić, Nikola, et al. “Producing Monolingual and ParallelWeb Corpora at the Same Time – SpiderLing and Bitextor’s Love Affair”. In: Calzolari, Nicoletta, et al. (Eds.). Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). Portorož, Slovenia: European Language Resources Association (ELRA), 2016. ISBN 978-2-9517408-9-1, pp. 2949-2956
Resumen:	This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain .hr and the Slovene top-level domain .si, and extrinsically on the English–Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English–Croatian, English–Finnish, English–Serbian and English–Slovene language pairs.
Patrocinador/es:	This research is supported by the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (AbuMaTran).
URI:	http://hdl.handle.net/10045/135635
ISBN:	978-2-9517408-9-1
Idioma:	eng
Tipo:	info:eu-repo/semantics/conferenceObject
Derechos:	Creative Commons Attribution 4.0 International License.
Revisión científica:	si
Versión del editor:	https://aclanthology.org/L16-1
Aparece en las colecciones:	INV - TRANSDUCENS - Comunicaciones a Congresos, Conferencias, etc. Investigaciones financiadas por la UE

Archivos en este ítem:

Archivos en este ítem:
Archivo	Descripción	Tamaño	Formato
Ljubesic_etal_Producing-Monolingual-and-Parallel-Web-Corpora-at-the-Same-Time.pdf		190,41 kB	Adobe PDF	Abrir Vista previa Cerrar vista previa

Ver citas en Google Académico

Muestra el registro completo

Este ítem está licenciado bajo Licencia Creative Commons