Data selection for NMT using Infrequent n-gram Recovery

Please use this identifier to cite or link to this item: http://hdl.handle.net/10045/76087
Full metadata record
Full metadata record
DC FieldValueLanguage
dc.contributor.authorParcheta, Zuzanna-
dc.contributor.authorSanchis-Trilles, Germán-
dc.contributor.authorCasacuberta, Francisco-
dc.date.accessioned2018-05-31T10:00:19Z-
dc.date.available2018-05-31T10:00:19Z-
dc.date.issued2018-
dc.identifier.citationParcheta, Zuzanna; Sanchis-Trilles, Germán; Casacuberta, Francisco. “Data selection for NMT using Infrequent n-gram Recovery”. In: Pérez-Ortiz, Juan Antonio, et al. (Eds.). Proceedings of the 21st Annual Conference of the European Association for Machine Translation: 28-30 May 2018, Universitat d'Alacant, Alacant, Spain, pp. 219-227es_ES
dc.identifier.isbn978-84-09-01901-4-
dc.identifier.urihttp://hdl.handle.net/10045/76087-
dc.description.abstractNeural Machine Translation (NMT) has achieved promising results comparable with Phrase-Based Statistical Machine Translation (PBSMT). However, to train a neural translation engine, much more powerful machines are required than those required to develop translation engines based on PBSMT. One solution to reduce the training cost of NMT systems is the reduction of the training corpus through data selection (DS) techniques. There are many DS techniques applied in PBSMT which bring good results. In this work, we show that the data selection technique based on infrequent n-gram occurrence described in (Gascó et al., 2012) commonly used for PBSMT systems also works well for NMT systems. We focus our work on selecting data according to specific corpora using the previously mentioned technique. The specific-domain corpora used for our experiments are IT domain and medical domain. The DS technique significantly reduces the execution time required to train the model between 87% and 93%. Also, it improves translation quality by up to 2.8 BLEU points. The improvements are obtained with just a small fraction of the data that accounts for between 6% and 20% of the total data.es_ES
dc.languageenges_ES
dc.publisherEuropean Association for Machine Translationes_ES
dc.rights© 2018 The authors. This article is licensed under a Creative Commons 3.0 licence, no derivative works, attribution, CC-BY-ND.es_ES
dc.subjectMachine Translationes_ES
dc.subject.otherLenguajes y Sistemas Informáticoses_ES
dc.titleData selection for NMT using Infrequent n-gram Recoveryes_ES
dc.typeinfo:eu-repo/semantics/conferenceObjectes_ES
dc.peerreviewedsies_ES
dc.relation.publisherversionhttp://eamt2018.dlsi.ua.es/proceedings-eamt2018.pdfes_ES
dc.rights.accessRightsinfo:eu-repo/semantics/openAccesses_ES
Appears in Collections:Congresos - EAMT2018 - Proceedings

Files in This Item:
Files in This Item:
File Description SizeFormat 
ThumbnailEAMT2018-Proceedings_24.pdf1,56 MBAdobe PDFOpen Preview


This item is licensed under a Creative Commons License Creative Commons