Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation

This paper studies the effects of word-level linguistic annotations in under-resourced neural machine translation, for which there is incomplete evidence in the literature. The study covers eight language pairs, different training corpus sizes, two architectures, and three types of annotation: dummy tags (with no linguistic information at all), part-of-speech tags, and morpho-syntactic description tags, which consist of part of speech and morphological features. These linguistic annotations are interleaved in the input or output streams as a single tag placed before each word. In order to measure the performance under each scenario, we use automatic evaluation metrics and perform automatic error classification. Our experiments show that, in general, source-language annotations are helpful and morpho-syntactic descriptions outperform part of speech for some language pairs. On the contrary, when words are annotated in the target language, part-of-speech tags systematically outperform morpho-syntactic description tags in terms of automatic evaluation metrics, even though the use of morpho-syntactic description tags improves the grammaticality of the output. We provide a detailed analysis of the reasons behind this result.

allows us to train the network to produce TL lemmas instead of surface forms (words as they appear in running texts). This strategy can reduce data sparseness but requires the use of an external morphological generator as a post-processing step (Tamchyna et al., 2017).
Despite the body of work published about this topic, no strategy has clearly emerged as the most appropriate method for integrating linguistic annotations into NMT. The literature mainly contains incomplete evidence. For instance, Yang et al. (2019) conclude that TL part-of-speech annotations boost translation quality with an ad-hoc architecture, but Wagner (2017) claims that TL morpho-syntactic description tags degrade translation quality when they are interleaved: it is not clear whether the difference between both results is caused by the type of linguistic annotations or by the approach followed to integrate them. There are also contradictory results, such as those reported by Tamchyna et al. (2017), who claim that TL annotations are only useful when they are combined with lemmatisation, and Nadejde et al. (2017), who report positive results without lemmatisation. In addition, the influence of factors such as the size of the available training parallel corpus and the language typology have not been properly evaluated.
In this paper, we aim at clarifying how linguistic annotations help NMT by carrying out systematic experiments with eight language pairs. We focus on an under-resourced scenario where linguistic annotations are likely to provide information that cannot be inferred from scarce training data. We analyse multiple factors, namely, language typology, side which is annotated with linguistic information (SL, TL or both), architecture of the NMT system, training corpus size, and type of information encoded in the tags. For the latter factor, we focus only on part-of-speech tags and morpho-syntactic description tags, since other type of annotations, such as CCG supertags (Steedman, 2000), are unlikely to be available for under-resourced languages. We train systems on linguistically annotated surface forms via interleaving, which does not require modifications to the neural network and allows us to easily compare different types of linguistic annotations and NMT architectures. In addition, an automatic error classification allows us to qualitatively compare them. A qualitative analysis has only been previously performed by Nadejde et al. (2017), but it covered only two language pairs with English as TL, and a single type of annotation.
The rest of the paper is organised as follows. Next section overviews the process of interleaving linguistic annotations in the training data. Section 3 then describes the experimental settings whereas Section 4 reports and discusses the results obtained. Section 5 presents automatic error classification results for all the systems evaluated, while Section 6 studies the reasons behind the poor performance of systems with TL morpho-syntactic description tags. The paper ends with some concluding remarks.

Interleaving in neural machine translation
The interleaving approach for integrating linguistic annotations into NMT (Nadejde et al., 2017) annotates each word with a single tag which is interleaved in the sentence before the word, i.e. introduced in the sentence as if it were another word. In our experiments, tags can represent either the part of speech (POS) of the word or its morpho-syntactic description (MSD). As corpora are pre-processed with BPE (Sennrich et al., 2016b), the tag is introduced just once, before the first sub-word unit. To study if the effect of using tags is related to the fact that input and output sequences get longer and word boundaries are explicitly defined, and not to the information provided by tags, we also tried with a dummy tag (DUM) conveying no linguistic information at all, and used the same dummy tag for every word (Wagner, 2017). Interleaved TL tags are removed from the final translation generated by the system before computing the automatic evaluation metrics.
The example below shows the result of interleaving MSD tags in the English sentence It has happened before. The sentence contains a pronoun (PRON) followed by an auxiliary verb (AUX), a main verb (VERB), an adverb (ADV) and a punctuation mark (PUNCT). The analysis of the pronoun tells us that it is personal, nominative, neuter, singular, and 3rd-person. The symbol @@ acts as a sub-word unit separator.

Experimental settings
We conducted experiments for the translation of English text into four languages, and vice-versa. These languages -Czech (cs), German (de), Spanish (es) and Turkish (tr)-belong to different language families and differ at the syntactic and morphological levels. German, Czech and Spanish are Indo-European languages: they are, respectively, Germanic, Slavic and Romance. Both German and Czech have declension and SVO sentence structure, except for the subordinate sentences in German, which are SOV. Spanish has no declension and its sentence structure is SVO. Turkish is an agglutinative Turkic language with declension and SOV sentence structure. The morphological differences between these languages are reflected in the sparsity of the MSD tags: the number of unique tags in the interleaved training corpora ranges from a few hundreds for English and Spanish to a few thousands for Czech, German and Turkish. 1 We simulated an under-resourced scenario by downsampling available parallel corpora for the selected language pairs. Downsampling has important advantages over using truly under-resourced language pairs: (i) we can choose languages from different families and evaluate them on standard, high-quality test sets; (ii) we can confirm whether conclusions hold for richer-resource scenarios by training on larger datasets for the same language pairs; and (iii) linguistic annotations can be obtained with the same state-of-the-art morphological analyser, minimising the potential distortions introduced by differences in the morphological analyser technology and in performance between the languages. The POS and MSD tags were obtained by means of the StandfordNLP tagger (Qi et al., 2018). In any case, the approaches described in this paper could be applied to truly under-resourced language pairs as transfer learning allows to obtain morphological analysers even from scarce morphologically annotated data (Kondratyuk, 2019).
Corpora. The training, development and test sets used belong to the news domain. For training, we used texts from the News Commentary v14 corpus, 2 except for Turkish, for which we used texts from the SETimes corpus (Tyers and Alperen, 2010). For development and testing we used evaluation sets from the WMT 2019 Conference on Machine Translation, each of which contains around 3,000 parallel sentences. 3 To see if the conclusions drawn on the under-resourced settings hold in a richer-resourced scenario, we trained English-German systems (in both directions) on the concatenation of the parallel data made available for the WMT 2017 shared task on news translation 4 plus the synthetic parallel data obtained through back-translation released by Sennrich et al. (2016a).
All corpora were tokenised and truecased with the Moses scripts 5 and parallel sentences longer than 100 words in either side were discarded. Table 1 provides information about the training corpora after their pre-processing. We trained translation models on these corpora and on random sub-sets of them containing 50k parallel sentences (except for the WMT training data). The token counts depicted in Table 1 for the under-resourced scenario are similar to those listed in the OPUS collection (Tiedemann, 2012) for under-resourced language pairs such as English-Kurdish or English-Igbo; token counts for the 50k subsets match other pairs with even smaller resources available in OPUS, such as English-Kazakh.
Translation models. We tested the performance of the recurrent-neural-network encoder-decoder with attention (hereafter, recurrent;Bahdanau et al., 2015) and the Transformer (Vaswani et al., 2017) architectures when the different types of tags introduced in Section 2 are interleaved in the SL input sequence, in the TL output sequence, and in both of them. For each architecture, we also trained a baseline using no tags at all. To keep the experiments to a manageable size, the systems that included tags in both languages were not trained on the nine possible combinations of tag types (three in the SL and three in the TL). Instead, they were trained only on SL MSD and TL POS tags, which were those with the best general performance when used in isolation in the SL and in the TL, respectively. For the same reason, we only explored the SL MSD/TL POS tag combination for systems trained on large-scale WMT data. In order to determine the appropriate values for training hyper-parameters, a grid search over the number of BPE operations and the neural network sizes was carried out. The optimum hyper-parameter values for each language pair, training corpus size and architecture were obtained after training the baseline systems. These hyper-parameters were also used with the systems integrating linguistic annotations. Appendix A provides a detailed description of the training process.
Error classification. We followed the automatic error analysis strategy by Toral and Sánchez-Cartagena (2017), who used the tool Hjerson (Popović, 2011) to classify word errors into five categories: inflection, reordering, missing words, extra words and incorrect lexical choices. As it is difficult to automatically distinguish between the latter three categories (Popović and Ney, 2011), we grouped them into a unique category named lexical errors. Hjerson works on the surface form and lemma of the words in the reference translations and MT outputs. The lemmas were obtained again with the StandfordNLP tagger. Table 2 shows the BLEU (Papineni et al., 2002) scores obtained by the different systems. A score in bold means that the system outperforms the baseline (labelled as None) by a statistically significant margin. A bullet (•) next to the score of a system with interleaved POS or MSD tags means that it outperforms the system with DUM tags in the same language side (SL or TL) by a statistically significant margin. 6 A dagger ( †) next to the score of a system with POS or MSD tags means that it outperforms the system with the opposite tag (either MSD or POS) in the same language side by a statistically significant margin. Statistical significance was assessed with paired bootstrap resampling (Koehn, 2004) (p = 0.05; 1 000 iterations).

Results and discussion
As the four languages paired with English are morphologically richer than English, we split the analysis of the results we describe next into two groups: translation into a TL morphologically richer than the SL (pairs with English as SL), and translation from a morphologically richer SL (pairs with English as TL). It is also worth mentioning that, in all the scenarios evaluated, when a system was trained with interleaved TL tags, the decoder alternately produced TL tags and surface forms at test time as expected.
Translation into a morphologically rich language. When the TL is morphologically richer than the SL, interleaved tags lead to higher BLEU scores, although the impact changes depending on the information encoded in the tag and the language where they are used (SL or TL). SL DUM tags are not very effective: they bring a statistically significant increase in BLEU only to 2 out of the 8 systems evaluated with the recurrent architecture, 7 and to none of the 8 Transformer systems. SL POS and MSD tags generally outperform DUM tags, as they contain information that helps to obtain a better representation of the SL sentence and break the grammatical ambiguity of English (Sennrich and Haddow, 2016). There is a statistically significant difference between SL POS and SL MSD tags in 6 out of the 16 systems evaluated, 8 and in 5 out of these 6 systems MSD tags outperform POS tags. For some language 6 Those systems trained with both SL MSD and TL POS tags could not be compared with systems with both SL DUM and TL DUM tags because the latter were not included in the experimental set-up in order to keep the experiments to a manageable size. Hence, their scores do not contain any bullet. 7 Four language pairs and two training corpus sizes. 8 Four language pairs, two training corpus sizes and two architectures.  pairs and training corpus sizes, enriching the SL representation with information about number, verbal mood, etc. proves to be useful. We can find stronger differences between the different types of tags in the TL. TL DUM tags are useful for the recurrent English-German systems, in line with the findings by Wagner (2017), but their contribution to other language pairs and the Transformer architecture is less clear. The most relevant trend is that using only POS tags in the TL consistently outperforms the use of MSD tags: statistically significant differences are found in all TLs but Spanish, a Romance language which has the simplest morphology. This result is further investigated in sections 5 and 6. Finally, combining SL MSD and TL POS tags leads to the highest scores.
Translation from a morphologically rich language. The effects of using SL tags when the SL is morphologically richer than the TL are similar to those observed in the opposite scenario: POS and MSD tags often outperform DUM tags. When statistically significant differences between POS and MSD tags are found, they favour MSD tags. Concerning TL tags, the systematic degradation observed for MSD tags is less frequent than in the opposite direction, and it is mainly concentrated in the smallest corpus size. An explanation could be that morphological information in English is less complex and easier to predict from the SL sentence. Finally, combining SL MSD tags and TL POS tags also leads to the highest scores.
Large-scale training data. The results for the English-German WMT large-scale training data, also depicted in Table 2, show a different picture. We can still observe that the use of interleaved tags brings a statistically significant improvement, but this only happens in the recurrent architecture. Transformer Figure 1: For language pairs with English as SL, relative changes in the number of errors for each error category, training corpus size and type of interleaved tag. systems do not benefit from the interleaved linguistic annotations when the training corpus size is large. 9 A potential explanation is provided in the next section.
Main findings. In line with previous works (Nadejde et al., 2017;Wagner, 2017), the results analysed so far suggest that interleaved linguistic annotations are helpful both in the SL and the TL and they should be included in both languages in order to maximize performance. While morphological features can be useful in the SL, they should be avoided in the TL if it is morphologically rich. Even when large corpora are available, linguistic annotations can help to boost translation quality.

Error analysis
To better understand the results obtained, we computed the relative difference in the number of Hjerson errors between the systems with interleaved tags and the baseline; 10 a positive value means that the system made more errors than the baseline. As we did before, we split the results into two groups of language pairs: those with English as SL, depicted in Figure 1, and those with English as TL, depicted in Figure 2. In the remainder of this section, we analyse the results obtained and illustrate them with examples. SL tags. SL POS and SL MSD tags systematically reduce lexical errors (green, empty squares and triangles are below the horizontal line). Reordering errors are also reduced with the exception of English-Czech, in which the TL has a relatively flexible word order. 11 Concerning inflection errors, there is not Figure 2: For language pairs with English as TL, relative changes in the number of errors for each error category, training corpus size and type of interleaved tag. clear trend. As aforementioned, a possible explanation could be that SL tags help to obtain more accurate representations of the SL sentences; since inflection errors are related to modeling TL grammar rather than to representing the SL sentence, they are not reduced by interleaving SL tags. All these results are compatible with the evaluation metrics, which showed that SL tags generally improve translation quality. In the first example in Table 3, SL tags help to obtain a better representation of the SL sentence: the system is able to interpret that matters is acting as a noun and produces hace que las cosas sean simples (en: it makes things simple) instead of hace que los alemanes sean simples (en: it makes Germans simple).
TL tags. Different error distributions can be observed depending on the information encoded in the TL tags. TL MSD tags systematically reduce inflection errors in both architectures (the blue, filled triangle is usually among the lowest points in the figure). The largest inflection error reductions occur with highly inflected TLs such as Czech and Turkish. TL POS tags, on the contrary, do not systematically reduce inflection errors. Hence, the system using TL MSD tags is using the morphological features they encode   (tense, number, etc.) for producing the correct inflected form according to the reference. In the second example in Table 3, the system using TL MSD tags generates the right inflected form of the German word literarische because it has predicted its dative case first.
However, the prediction of MSD tags with complex morphology (see Figure 1) also leads to an increase in lexical errors in comparison with the prediction of POS tags. It can be observed that TL MSD tags bring an increase in lexical errors over the baseline (note the green, filled triangles at the top of the figures), while the impact in lexical errors of the POS tags is less clear. Similarly to inflection errors, the difference between the increases of lexical errors brought by MSD and POS tags is larger for Turkish and Czech, which are the two languages with the most sparse MSD tags. Turkish is a Turkic agglutinative language and Czech is a Slavic fusional language with seven cases and four genders. This is compatible with the automatic evaluation metrics: although using TL MSD tags leads to a more grammatical output, the increase in lexical errors makes the system produce translations that are overall less similar to the reference. Note that lexical errors are the most frequent ones. 12 The third example in Table 3 shows that the system with TL MSD tags translates the verb help as beitragen rather than helfen, which is a more precise translation in that context.
When SL MSD tags and TL POS tags are both interleaved, there is a general reduction in the three error categories as compared with the systems using only tags in one of the languages. This confirms that the advantages of SL and TL tags are complementary.
Differences between architectures. Finally, there is a noticeable difference in how the type of errors made by the systems change when interleaving TL tags in recurrent and Transformer architectures. Reordering errors consistently increase in Transformer systems, while they tend to decrease in recurrent systems. Moreover, TL DUM tags consistently increase the total number of translation errors when they are added to a Transformer system and the TL is highly inflected (observe the red, filled circle usually above the horizontal line in Figure 1), while their impact is not clear in recurrent systems. These two findings suggest that adding extra tokens to the TL stream is not the best way of introducing linguistic annotations in self-attention-based NMT systems. It could also explain the results for the large-scale WMT data, where only recurrent systems were able to take advantage of the linguistic annotations. This hypothesis is also compatible with the results reported on WMT data by Yang et al. (2019), who successfully leveraged TL linguistic annotations in Transformer systems using an ad-hoc architecture.

Analysing the effect of target language morphology
We compare the output of the systems interleaving TL POS and TL MSD tags in order to ascertain whether the increase in lexical errors is caused by the difficulty of predicting the more complex and sparse MSD (a) English as SL.
(b) English as TL. tags, or by the conditioning of the prediction of surface forms on TL MSD tags. In the latter case, there is the risk that the system learns to strongly condition on tags and avoids generating new words (Tamchyna et al., 2017). This problem could be exacerbated by the sparsity of TL MSD tags because some of them may co-occur only with a few surface forms in the training corpus.
We tried to answer this question by independently evaluating the prediction of tags and surface forms, and comparing the systems interleaving POS and MSD tags. The prediction of surface forms was evaluated by re-decoding the test set and forcing the system to choose the tags from the reference during beam search, whereas the prediction of tags was evaluated by forcing the surface forms from the reference. If the system really learned to strongly condition on tags, when a tag was observed together with only a few surface forms in the training corpus, low-frequency words would not be generated when translating the test set. To test this hypothesis we studied the surface form prediction accuracy for two subsets: infrequent words (frequency ≤ 10 in the training set) and out-of-vocabulary (OOV) words. Figure 3 shows the results for those language pairs with English as SL. It can be observed that there is a trade-off between tag and surface form prediction accuracy: MSD tags are more difficult to predict, but conditioning on them leads to better surface form prediction. On low-frequency and OOV words, MSD tags still outperform POS tags in terms of surface form prediction accuracy, although the difference between them is smaller.
For a fair comparison of both types of tags, we computed the part-of-speech prediction accuracy when predicting MSD tags. The results are depicted in the rows labelled as POS in Figure 4. For highly inflected TLs, those systems that predict MSD tags have consistently lower POS prediction accuracy than those that predict only POS tags. The difference is larger for recurrent systems. These results suggest that the difficulty of predicting together the part of speech and its morphological features is indeed one of the reasons behind the lexical degradation of systems using MSD tags. Sparseness of MSD tags seems to play an important role in this degradation: highly inflected TLs present the largest degradation.
To evaluate the impact of predicting morphological features regardless of the low part-of-speech accuracy of MSD tags, we re-computed surface form prediction accuracy by letting the beam search algorithm choose among those MSD tags with the part of speech in the reference. The results, depicted in the rows labelled as S.F. in Figure 4, show that, if the systems with interleaved MSD tags correctly predicted the part of speech, the surface form predictions would not be worse than those of systems with interleaved POS tags, neither in general nor for low-frequency and out-of-vocabulary words. Hence, errors in surface form prediction arising from strongly conditioning on sparse MSD tags do not seem to be the main cause behind the degradation of translation quality brought by TL MSD tags. Actually, when tags with the part of speech of the reference are chosen, conditioning on MSD tags outperforms conditioning on POS tags in terms of overall surface form prediction accuracy for language pairs with English as TL. For the other language pairs, the gain introduced by MSD tags is less clear. One possible reason could be that BPE segmentation does not allow the system to learn a general mapping between tags and word endings from the training data. Another explanation could be related to the fact that predicting the morphological gender for German, Czech and Spanish forces the tag prediction task to be aware of TL lexical information, preventing an optimum division of labour between tag and surface form predictions.
In conclusion, the prediction of TL morphological information needs to be factorised differently in order not to harm part-of-speech prediction. For instance, in the morphological analysis field, Chaudhary et al. (2019) and Straka et al. (2019) predict part of speech and each morphological attribute independently.

Concluding remarks
In this paper, we have studied the effects of using linguistic annotations of SL and TL words in underresourced NMT by interleaving linguistic tags for different language pairs, architectures, training data sizes and types of linguistic information (part of speech and morpho-syntactic descriptions).
We have shown that both SL and TL linguistic annotations are useful, in line with previous works in the literature (Wagner, 2017). SL linguistic annotations lead to more accurate SL sentence representations, and for some language pairs, the use of morpho-syntactic descriptions (consisting of part of speech and morphological features) improves the representation obtained when only part-of-speech tags are used. Surprisingly, for highly inflected TLs, TL linguistic annotations are more useful if they simply consist of part-of-speech information. Using morpho-syntactic descriptions leads to an overall translation quality degradation in terms of automatic evaluation metrics, even though it improves the grammaticality of the output. We have also shown that predicting TL morpho-syntactic descriptions frequently results in wrong part-of-speech predictions. Hence, to optimize the use of TL morphological information in NMT, it is advisable to avoid the prediction of part-of-speech and morphological features together as monolithic tags.
The gain introduced by linguistic information encoded as interleaved tags scales to large data availability scenarios only for the recurrent architecture. This result, together with the conclusions of the automatic error analysis, suggest that adding extra tokens to the TL stream is not the optimum way of introducing additional linguistic information in self-attention-based NMT systems.
In summary, the use of morpho-syntactic descriptions in the SL and part-of-speech information in the TL, which can be easily obtained even in under-resourced scenarios, systematically improves translation quality when they are simply interleaved in training data as linguistic tags, even without using a morphological generator (Tamchyna et al., 2017), which could be error-prone for under-resourced languages, and without any kind of information about syntactic structures.

A Training details
The optimum training hyper-parameters were obtained by following the grid search process depicted next. At each step, we chose the hyper-parameters that maximised BLEU on the development set. Table 4 shows the optimum hyper-parameters for each language pair and training corpus size for the recurrent architecture while Table 5 shows the same information for the Transformer architecture.
• First, we explored the optimum number of BPE operations among the following values: 5,000, 10,000, 20,000, and 40,000. The rest of hyper-parameters were set to the values recommended by  for the recurrent architecture and by Vaswani et al. (2017) for the Transformer architecture ("base" configuration), respectively.
• With the optimum number of BPE operations, we then tested if better results could be obtained with tied embeddings (Press and Wolf, 2017) in the decoder.
• Finally, we explored the following combinations of hyper-parameter values for each architecture with the best number of BPE operations and tied embedding configuration obtained in the previous steps. For all systems trained, we applied label smoothing with a value of 0.1 and dropout of 0.1. Unlike Sennrich and Zhang (2019), we did not use a lexical model neither word dropout. The optimisation algorithm was Adam (Kingma and Ba, 2015) with the inverse square root learning rate decay (Vaswani et al., 2017, Sec. 5.3) and 8,000 warm-up iterations. Learning rates were initialised to 0.0004 for recurrent and to 0.0003 for Transformer. Training stopped after 10 validations without any perplexity improvement on the development corpus; validations were performed every 1,000 mini-batches. The model finally used is the one for which the best BLEU score was obtained on the development corpus. We used the same amount of sentences per mini-batch for all the models trained for a given TL and corpus size; the amount of sentences in each mini-batch ensures that, when SL and TL MSD tags are interleaved, the amount of tokens is below 4,500.