Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

Sánchez-Cartagena, Víctor M.; Esplà-Gomis, Miquel; Pérez-Ortiz, Juan Antonio; Sánchez-Martínez, Felipe

Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10045/138759

Información del item - Informació de l'item - Item information
Título:	Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation
Autor/es:	Sánchez-Cartagena, Víctor M. \| Esplà-Gomis, Miquel \| Pérez-Ortiz, Juan Antonio \| Sánchez-Martínez, Felipe
Grupo/s de investigación o GITE:	Transducens
Centro, Departamento o Servicio:	Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos
Palabras clave:	Machine translation \| Low-resource languages \| Data augmentation \| Multi-task learning
Fecha de publicación:	17-nov-2023
Editor:	IEEE
Cita bibliográfica:	IEEE Transactions on Pattern Analysis and Machine Intelligence. 2024, 46(2): 837-850. https://doi.org/10.1109/TPAMI.2023.3333949
Resumen:	When the amount of parallel sentences available to train a neural machine translation is scarce, a common practice is to generate new synthetic training samples from them. A number of approaches have been proposed to produce synthetic parallel sentences that are similar to those in the parallel data available. These approaches work under the assumption that non-fluent target-side synthetic training samples can be harmful and may deteriorate translation performance. Even so, in this paper we demonstrate that synthetic training samples with non-fluent target sentences can improve translation performance if they are used in a multilingual machine translation framework as if they were sentences in another language. We conducted experiments on ten low-resource and four high-resource translation tasks and found out that this simple approach consistently improves translation performance as compared to state-of-the-art methods for generating synthetic training samples similar to those found in corpora. Furthermore, this improvement is independent of the size of the original training corpus, the resulting systems are much more robust against domain shift and produce less hallucinations.
Patrocinador/es:	This paper is part of the R+D+i project PID2021-127999NB-I00 funded by the Spanish Ministry of Science and Innovation (MCIN), the Spanish Research Agency (AEI/10.13039/501100011033) and the European Regional Development Fund A way to make Europe. The computational resources used were funded by the European Regional Development Fund through project ID-IFEDER/2020/003.
URI:	http://hdl.handle.net/10045/138759
ISSN:	0162-8828 (Print) \| 1939-3539 (Online)
DOI:	10.1109/TPAMI.2023.3333949
Idioma:	eng
Tipo:	info:eu-repo/semantics/article
Derechos:	© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Revisión científica:	si
Versión del editor:	https://doi.org/10.1109/TPAMI.2023.3333949
Aparece en las colecciones:	INV - TRANSDUCENS - Artículos de Revistas

Archivos en este ítem:

Archivos en este ítem:
Archivo	Descripción	Tamaño	Formato
Sanchez-Cartagena_etal_2023_IEEE-TPAMI_accepted.pdf	Accepted Manuscript (acceso abierto)	3,14 MB	Adobe PDF	Abrir Vista previa Cerrar vista previa
Sanchez-Cartagena_etal_2023_IEEE-TPAMI_final.pdf	Versión final (acceso restringido)	5,26 MB	Adobe PDF	Abrir Solicitar una copia

Ver citas en Google Académico

Muestra el registro completo