Stand-off Annotation of Web Content as a Legally Safer Alternative to Crawling for Distribution

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10045/62549
Registro completo de metadatos
Registro completo de metadatos
Campo DCValorIdioma
dc.contributorTransducenses_ES
dc.contributor.authorForcada, Mikel L.-
dc.contributor.authorEsplà-Gomis, Miquel-
dc.contributor.authorPérez-Ortiz, Juan Antonio-
dc.contributor.otherUniversidad de Alicante. Departamento de Lenguajes y Sistemas Informáticoses_ES
dc.date.accessioned2017-02-03T07:58:07Z-
dc.date.available2017-02-03T07:58:07Z-
dc.date.issued2016-
dc.identifier.citationBaltic Journal of Modern Computing. 2016, 4(2): 152-164es_ES
dc.identifier.issn2255-8942 (Print)-
dc.identifier.issn2255-8950 (Online)-
dc.identifier.urihttp://hdl.handle.net/10045/62549-
dc.description.abstractSentence-aligned web-crawled parallel text or bitext is frequently used to train statistical machine translation systems. To that end, web-crawled sentence-aligned bitext sets are sometimes made publicly available and distributed by translation technologies practitioners. Contrary to what may be commonly believed, distribution of web-crawled text is far from being free from legal implications, and may sometimes actually violate the usage restrictions. As the distribution and availability of sentence-aligned bitext is key to the development of statistical machine translation systems, this paper proposes an alternative: instead of copying and distributing copies of web content in the form of sentence-aligned bitext, one could distribute a legally safer stand-off annotation of web content, that is, files that identify where the aligned sentences are, so that end users can use this annotation to privately recrawl the bitexts. The paper describes and discusses the legal and technical aspects of this proposal, and outlines an implementation.es_ES
dc.description.sponsorshipFunding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran) is acknowledged.es_ES
dc.languageenges_ES
dc.publisherUniversity of Latviaes_ES
dc.rightsCreative Commons Attribution-ShareAlike 4.0 International licensees_ES
dc.subjectBitextes_ES
dc.subjectParallel textes_ES
dc.subjectStand-off annotationes_ES
dc.subjectLegal issueses_ES
dc.subjectStatistical machine translationes_ES
dc.subject.otherLenguajes y Sistemas Informáticoses_ES
dc.titleStand-off Annotation of Web Content as a Legally Safer Alternative to Crawling for Distributiones_ES
dc.typeinfo:eu-repo/semantics/articlees_ES
dc.peerreviewedsies_ES
dc.relation.publisherversionhttp://www.bjmc.lu.lv/es_ES
dc.rights.accessRightsinfo:eu-repo/semantics/openAccesses_ES
dc.relation.projectIDinfo:eu-repo/grantAgreement/EC/FP7/324414es_ES
Aparece en las colecciones:INV - TRANSDUCENS - Artículos de Revistas
Investigaciones financiadas por la UE

Archivos en este ítem:
Archivos en este ítem:
Archivo Descripción TamañoFormato 
Thumbnail2016_Forcada_etal_BalticJModernComputing.pdf216,07 kBAdobe PDFAbrir Vista previa


Este ítem está licenciado bajo Licencia Creative Commons Creative Commons