throbber
Published as a conference paper at ICLR 2015
`
`NEURAL MACHINE TRANSLATION
`BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
`
`Dzmitry Bahdanau
`Jacobs University Bremen, Germany
`
`Yoshua Bengio∗
`KyungHyun Cho
`Universit´e de Montr´eal
`
`ABSTRACT
`
`Neural machine translation is a recently proposed approach to machine transla-
`tion. Unlike the traditional statistical machine translation, the neural machine
`translation aims at building a single neural network that can be jointly tuned to
`maximize the translation performance. The models proposed recently for neu-
`ral machine translation often belong to a family of encoder–decoders and encode
`a source sentence into a fixed-length vector from which a decoder generates a
`translation. In this paper, we conjecture that the use of a fixed-length vector is a
`bottleneck in improving the performance of this basic encoder–decoder architec-
`ture, and propose to extend this by allowing a model to automatically (soft-)search
`for parts of a source sentence that are relevant to predicting a target word, without
`having to form these parts as a hard segment explicitly. With this new approach,
`we achieve a translation performance comparable to the existing state-of-the-art
`phrase-based system on the task of English-to-French translation. Furthermore,
`qualitative analysis reveals that the (soft-)alignments found by the model agree
`well with our intuition.
`
`1
`
`INTRODUCTION
`
`Neural machine translation is a newly emerging approach to machine translation, recently proposed
`by Kalchbrenner and Blunsom (2013), Sutskever et al. (2014) and Cho et al. (2014b). Unlike the
`traditional phrase-based translation system (see, e.g., Koehn et al., 2003) which consists of many
`small sub-components that are tuned separately, neural machine translation attempts to build and
`train a single, large neural network that reads a sentence and outputs a correct translation.
`Most of the proposed neural machine translation models belong to a family of encoder–
`decoders (Sutskever et al., 2014; Cho et al., 2014a), with an encoder and a decoder for each lan-
`guage, or involve a language-specific encoder applied to each sentence whose outputs are then com-
`pared (Hermann and Blunsom, 2014). An encoder neural network reads and encodes a source sen-
`tence into a fixed-length vector. A decoder then outputs a translation from the encoded vector. The
`whole encoder–decoder system, which consists of the encoder and the decoder for a language pair,
`is jointly trained to maximize the probability of a correct translation given a source sentence.
`A potential issue with this encoder–decoder approach is that a neural network needs to be able to
`compress all the necessary information of a source sentence into a fixed-length vector. This may
`make it difficult for the neural network to cope with long sentences, especially those that are longer
`than the sentences in the training corpus. Cho et al. (2014b) showed that indeed the performance of
`a basic encoder–decoder deteriorates rapidly as the length of an input sentence increases.
`In order to address this issue, we introduce an extension to the encoder–decoder model which learns
`to align and translate jointly. Each time the proposed model generates a word in a translation, it
`(soft-)searches for a set of positions in a source sentence where the most relevant information is
`concentrated. The model then predicts a target word based on the context vectors associated with
`these source positions and all the previous generated target words.
`∗CIFAR Senior Fellow
`
`1
`
`arXiv:1409.0473v7 [cs.CL] 19 May 2016
`
`Petitioner, EX1015
`IPR2024-01234
`Hugging Face, Inc., v. FriendliAI Inc.
`
`

`

`Published as a conference paper at ICLR 2015
`
`The most important distinguishing feature of this approach from the basic encoder–decoder is that
`it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it en-
`codes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively
`while decoding the translation. This frees a neural translation model from having to squash all the
`information of a source sentence, regardless of its length, into a fixed-length vector. We show this
`allows a model to cope better with long sentences.
`In this paper, we show that the proposed approach of jointly learning to align and translate achieves
`significantly improved translation performance over the basic encoder–decoder approach. The im-
`provement is more apparent with longer sentences, but can be observed with sentences of any
`length. On the task of English-to-French translation, the proposed approach achieves, with a single
`model, a translation performance comparable, or close, to the conventional phrase-based system.
`Furthermore, qualitative analysis reveals that the proposed model finds a linguistically plausible
`(soft-)alignment between a source sentence and the corresponding target sentence.
`
`2 BACKGROUND: NEURAL MACHINE TRANSLATION
`
`From a probabilistic perspective, translation is equivalent to finding a target sentence y that max-
`imizes the conditional probability of y given a source sentence x, i.e., arg maxy p(y | x).
`In
`neural machine translation, we fit a parameterized model to maximize the conditional probability
`of sentence pairs using a parallel training corpus. Once the conditional distribution is learned by a
`translation model, given a source sentence a corresponding translation can be generated by searching
`for the sentence that maximizes the conditional probability.
`Recently, a number of papers have proposed the use of neural networks to directly learn this condi-
`tional distribution (see, e.g., Kalchbrenner and Blunsom, 2013; Cho et al., 2014a; Sutskever et al.,
`2014; Cho et al., 2014b; Forcada and ˜Neco, 1997). This neural machine translation approach typ-
`ically consists of two components, the first of which encodes a source sentence x and the second
`decodes to a target sentence y. For instance, two recurrent neural networks (RNN) were used by
`(Cho et al., 2014a) and (Sutskever et al., 2014) to encode a variable-length source sentence into a
`fixed-length vector and to decode the vector into a variable-length target sentence.
`Despite being a quite new approach, neural machine translation has already shown promising results.
`Sutskever et al. (2014) reported that the neural machine translation based on RNNs with long short-
`term memory (LSTM) units achieves close to the state-of-the-art performance of the conventional
`phrase-based machine translation system on an English-to-French translation task.1 Adding neural
`components to existing translation systems, for instance, to score the phrase pairs in the phrase
`table (Cho et al., 2014a) or to re-rank candidate translations (Sutskever et al., 2014), has allowed to
`surpass the previous state-of-the-art performance level.
`
`2.1 RNN ENCODER–DECODER
`
`and
`
`Here, we describe briefly the underlying framework, called RNN Encoder–Decoder, proposed by
`Cho et al. (2014a) and Sutskever et al. (2014) upon which we build a novel architecture that learns
`to align and translate simultaneously.
`In the Encoder–Decoder framework, an encoder reads the input sentence, a sequence of vectors
`x = (x1, · · · , xTx ), into a vector c.2 The most common approach is to use an RNN such that
`ht = f (xt, ht−1)
`c = q ({h1, · · · , hTx}) ,
`where ht ∈ Rn is a hidden state at time t, and c is a vector generated from the sequence of the
`hidden states. f and q are some nonlinear functions. Sutskever et al. (2014) used an LSTM as f and
`q ({h1, · · · , hT}) = hT , for instance.
`1 We mean by the state-of-the-art performance, the performance of the conventional phrase-based system
`without using any neural network-based component.
`2 Although most of the previous works (see, e.g., Cho et al., 2014a; Sutskever et al., 2014; Kalchbrenner and
`Blunsom, 2013) used to encode a variable-length input sentence into a fixed-length vector, it is not necessary,
`and even it may be beneficial to have a variable-length vector, as we will show later.
`
`(1)
`
`2
`
`

`

`Published as a conference paper at ICLR 2015
`
`The decoder is often trained to predict the next word yt(cid:48) given the context vector c and all the
`previously predicted words {y1, · · · , yt(cid:48)−1}. In other words, the decoder defines a probability over
`the translation y by decomposing the joint probability into the ordered conditionals:
`
`T(cid:89)
`(cid:1). With an RNN, each conditional probability is modeled as
`
`p(y) =
`
`p(yt | {y1, · · · , yt−1} , c),
`
`t=1
`
`(2)
`
`where y =(cid:0)y1, · · · , yTy
`
`p(yt | {y1, · · · , yt−1} , c) = g(yt−1, st, c),
`(3)
`where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt, and st is
`the hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNN
`and a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).
`
`3 LEARNING TO ALIGN AND TRANSLATE
`
`In this section, we propose a novel architecture for neural machine translation. The new architecture
`consists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searching
`through a source sentence during decoding a translation (Sec. 3.1).
`
`3.1 DECODER: GENERAL DESCRIPTION
`
`In a new model architecture, we define each conditional probability
`in Eq. (2) as:
`
`(4)
`
`p(yi|y1, . . . , yi−1, x) = g(yi−1, si, ci),
`where si is an RNN hidden state for time i, computed by
`si = f (si−1, yi−1, ci).
`It should be noted that unlike the existing encoder–decoder ap-
`proach (see Eq. (2)), here the probability is conditioned on a distinct
`context vector ci for each target word yi.
`The context vector ci depends on a sequence of annotations
`(h1, · · · , hTx ) to which an encoder maps the input sentence. Each
`annotation hi contains information about the whole input sequence
`with a strong focus on the parts surrounding the i-th word of the
`input sequence. We explain in detail how the annotations are com-
`puted in the next section.
`The context vector ci is, then, computed as a weighted sum of these
`annotations hi:
`
`Tx(cid:88)
`
`ci =
`
`αijhj.
`
`(5)
`
`Figure 1: The graphical illus-
`tration of the proposed model
`trying to generate the t-th tar-
`get word yt given a source
`sentence (x1, x2, . . . , xT ).
`
`j=1
`The weight αij of each annotation hj is computed by
`
`(cid:80)Tx
`
`exp (eij)
`k=1 exp (eik)
`
`αij =
`
`,
`
`(6)
`
`where
`
`eij = a(si−1, hj)
`is an alignment model which scores how well the inputs around position j and the output at position
`i match. The score is based on the RNN hidden state si−1 (just before emitting yi, Eq. (4)) and the
`j-th annotation hj of the input sentence.
`We parametrize the alignment model a as a feedforward neural network which is jointly trained with
`all the other components of the proposed system. Note that unlike in traditional machine translation,
`
`3
`
`yt-1
`s t-1
`
`yt
`s t
`
`+
`
`αt,1
`αt,2
`
`αt,T
`
`αt,3
`
`h1
`
`h1
`x1
`
`h2
`
`h2
`x2
`
`h3
`
`h3
`x3
`
`hT
`
`hT
`xT
`
`

`

`Published as a conference paper at ICLR 2015
`
`the alignment is not considered to be a latent variable. Instead, the alignment model directly com-
`putes a soft alignment, which allows the gradient of the cost function to be backpropagated through.
`This gradient can be used to train the alignment model as well as the whole translation model jointly.
`We can understand the approach of taking a weighted sum of all the annotations as computing an
`expected annotation, where the expectation is over possible alignments. Let αij be a probability that
`the target word yi is aligned to, or translated from, a source word xj. Then, the i-th context vector
`ci is the expected annotation over all the annotations with probabilities αij.
`The probability αij, or its associated energy eij, reflects the importance of the annotation hj with
`respect to the previous hidden state si−1 in deciding the next state si and generating yi. Intuitively,
`this implements a mechanism of attention in the decoder. The decoder decides parts of the source
`sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the
`encoder from the burden of having to encode all information in the source sentence into a fixed-
`length vector. With this new approach the information can be spread throughout the sequence of
`annotations, which can be selectively retrieved by the decoder accordingly.
`
`3.2 ENCODER: BIDIRECTIONAL RNN FOR ANNOTATING SEQUENCES
`
`The usual RNN, described in Eq. (1), reads an input sequence x in order starting from the first
`symbol x1 to the last one xTx. However, in the proposed scheme, we would like the annotation
`of each word to summarize not only the preceding words, but also the following words. Hence,
`we propose to use a bidirectional RNN (BiRNN, Schuster and Paliwal, 1997), which has been
`successfully used recently in speech recognition (see, e.g., Graves et al., 2013).
`−→
`f reads the input sequence
`A BiRNN consists of forward and backward RNN’s. The forward RNN
`−→
`−→
`h 1, · · · ,
`as it is ordered (from x1 to xTx) and calculates a sequence of forward hidden states (
`h Tx ).
`←−
`The backward RNN
`f reads the sequence in the reverse order (from xTx to x1), resulting in a
`←−
`←−
`h 1, · · · ,
`sequence of backward hidden states (
`h Tx ).
`−→
`We obtain an annotation for each word xj by concatenating the forward hidden state
`h j and the
`←−
`←−
`h (cid:62)
`h (cid:62)
`backward one
`h j, i.e., hj =
`. In this way, the annotation hj contains the summaries
`j ;
`j
`of both the preceding words and the following words. Due to the tendency of RNNs to better
`represent recent inputs, the annotation hj will be focused on the words around xj. This sequence
`of annotations is used by the decoder and the alignment model later to compute the context vector
`(Eqs. (5)–(6)).
`See Fig. 1 for the graphical illustration of the proposed model.
`
`(cid:104)−→
`
`(cid:105)(cid:62)
`
`4 EXPERIMENT SETTINGS
`
`We evaluate the proposed approach on the task of English-to-French translation. We use the bilin-
`gual, parallel corpora provided by ACL WMT ’14.3 As a comparison, we also report the perfor-
`mance of an RNN Encoder–Decoder which was proposed recently by Cho et al. (2014a). We use
`the same training procedures and the same dataset for both models.4
`
`4.1 DATASET
`
`WMT ’14 contains the following English-French parallel corpora: Europarl (61M words), news
`commentary (5.5M), UN (421M) and two crawled corpora of 90M and 272.5M words respectively,
`totaling 850M words. Following the procedure described in Cho et al. (2014a), we reduce the size of
`the combined corpus to have 348M words using the data selection method by Axelrod et al. (2011).5
`We do not use any monolingual data other than the mentioned parallel corpora, although it may be
`possible to use a much larger monolingual corpus to pretrain an encoder. We concatenate news-test-
`
`3 http://www.statmt.org/wmt14/translation-task.html
`4 Implementations are available at https://github.com/lisa-groundhog/GroundHog.
`5 Available online at http://www-lium.univ-lemans.fr/˜schwenk/cslm_joint_paper/.
`
`4
`
`

`

`Published as a conference paper at ICLR 2015
`
`Figure 2: The BLEU scores
`of the generated translations
`on the test set with respect
`to the lengths of the sen-
`tences. The results are on
`the full
`test set which in-
`cludes sentences having un-
`known words to the models.
`
`2012 and news-test-2013 to make a development (validation) set, and evaluate the models on the test
`set (news-test-2014) from WMT ’14, which consists of 3003 sentences not present in the training
`data.
`After a usual tokenization6, we use a shortlist of 30,000 most frequent words in each language to
`train our models. Any word not included in the shortlist is mapped to a special token ([UNK]). We
`do not apply any other special preprocessing, such as lowercasing or stemming, to the data.
`
`4.2 MODELS
`
`We train two types of models. The first one is an RNN Encoder–Decoder (RNNencdec, Cho et al.,
`2014a), and the other is the proposed model, to which we refer as RNNsearch. We train each model
`twice: first with the sentences of length up to 30 words (RNNencdec-30, RNNsearch-30) and then
`with the sentences of length up to 50 word (RNNencdec-50, RNNsearch-50).
`The encoder and decoder of the RNNencdec have 1000 hidden units each.7 The encoder of the
`RNNsearch consists of forward and backward recurrent neural networks (RNN) each having 1000
`hidden units. Its decoder has 1000 hidden units. In both cases, we use a multilayer network with a
`single maxout (Goodfellow et al., 2013) hidden layer to compute the conditional probability of each
`target word (Pascanu et al., 2014).
`We use a minibatch stochastic gradient descent (SGD) algorithm together with Adadelta (Zeiler,
`2012) to train each model. Each SGD update direction is computed using a minibatch of 80 sen-
`tences. We trained each model for approximately 5 days.
`Once a model is trained, we use a beam search to find a translation that approximately maximizes the
`conditional probability (see, e.g., Graves, 2012; Boulanger-Lewandowski et al., 2013). Sutskever
`et al. (2014) used this approach to generate translations from their neural machine translation model.
`For more details on the architectures of the models and training procedure used in the experiments,
`see Appendices A and B.
`
`5 RESULTS
`
`5.1 QUANTITATIVE RESULTS
`
`In Table 1, we list the translation performances measured in BLEU score. It is clear from the table
`that in all the cases, the proposed RNNsearch outperforms the conventional RNNencdec. More
`importantly, the performance of the RNNsearch is as high as that of the conventional phrase-based
`translation system (Moses), when only the sentences consisting of known words are considered.
`This is a significant achievement, considering that Moses uses a separate monolingual corpus (418M
`words) in addition to the parallel corpora we used to train the RNNsearch and RNNencdec.
`
`6 We used the tokenization script from the open-source machine translation package, Moses.
`7 In this paper, by a ’hidden unit’, we always mean the gated hidden unit (see Appendix A.1.1).
`
`5
`
`20
`
`30
`Sentence length
`
`40
`
`50
`
`60
`
`RNNsearch-50
`RNNsearch-30
`RNNenc-50
`RNNenc-30
`
`10
`
`30
`
`25
`
`20
`
`15
`
`10
`
`BLEUscore
`
`05
`
`0
`
`

`

`Published as a conference paper at ICLR 2015
`
`(a)
`
`(b)
`
`(c)
`
`(d)
`
`Figure 3: Four sample alignments found by RNNsearch-50. The x-axis and y-axis of each plot
`correspond to the words in the source sentence (English) and the generated translation (French),
`respectively. Each pixel shows the weight αij of the annotation of the j-th source word for the i-th
`target word (see Eq. (6)), in grayscale (0: black, 1: white). (a) an arbitrary sentence. (b–d) three
`randomly selected samples among the sentences without any unknown words and of length between
`10 and 20 words from the test set.
`
`One of the motivations behind the proposed approach was the use of a fixed-length context vector
`in the basic encoder–decoder approach. We conjectured that this limitation may make the basic
`encoder–decoder approach to underperform with long sentences. In Fig. 2, we see that the perfor-
`mance of RNNencdec dramatically drops as the length of the sentences increases. On the other hand,
`both RNNsearch-30 and RNNsearch-50 are more robust to the length of the sentences. RNNsearch-
`50, especially, shows no performance deterioration even with sentences of length 50 or more. This
`superiority of the proposed model over the basic encoder–decoder is further confirmed by the fact
`that the RNNsearch-30 even outperforms RNNencdec-50 (see Table 1).
`
`6
`
`end>
`
`. <
`
`1992
`August
`in
`signed
`was
`Area
`Economic
`European
`the
`on
`agreement
`The
`
`L'
`accord
`sur
`la
`zone
`économique
`européenne
`a
`été
`signé
`en
`août
`1992
`.
`<end>
`
`end>
`
`. <
`
`environments
`of
`known
`least
`the
`is
`environment
`marine
`the
`that
`noted
`be
`should
`It
`
`Il
`convient
`de
`noter
`que
`l'
`environnement
`marin
`est
`le
`moins
`connu
`de
`l'
`environnement
`.
`<end>
`
`end>
`
`. <
`
`weapons
`chemical
`new
`produce
`longer
`no
`can
`Syria
`that
`means
`equipment
`the
`of
`Destruction
`
`La
`destruction
`de
`l'
`équipement
`signifie
`que
`la
`Syrie
`ne
`peut
`plus
`produire
`de
`nouvelles
`armes
`chimiques
`.
`<end>
`
`end>
`
`. <
`
`said
`man
`he
`
`, " t
`
`family
`my
`with
`future
`my
`change
`will
`his
`
`" T
`
`"
`Cela
`va
`changer
`mon
`avenir
`avec
`ma
`famille
`
`" , a
`
`dit
`l'
`homme
`.
`<end>
`
`

`

`Published as a conference paper at ICLR 2015
`
`Model
`RNNencdec-30
`RNNsearch-30
`RNNencdec-50
`RNNsearch-50
`RNNsearch-50(cid:63)
`Moses
`
`All
`13.93
`21.50
`17.82
`26.75
`28.45
`33.30
`
`No UNK◦
`24.19
`31.44
`26.71
`34.16
`36.15
`35.63
`
`Table 1: BLEU scores of the trained models com-
`puted on the test set. The second and third columns
`show respectively the scores on all the sentences and,
`on the sentences without any unknown word in them-
`selves and in the reference translations. Note that
`RNNsearch-50(cid:63) was trained much longer until the
`performance on the development set stopped improv-
`ing. (◦) We disallowed the models to generate [UNK]
`tokens when only the sentences having no unknown
`words were evaluated (last column).
`
`5.2 QUALITATIVE ANALYSIS
`
`5.2.1 ALIGNMENT
`
`The proposed approach provides an intuitive way to inspect the (soft-)alignment between the words
`in a generated translation and those in a source sentence. This is done by visualizing the annotation
`weights αij from Eq. (6), as in Fig. 3. Each row of a matrix in each plot indicates the weights
`associated with the annotations. From this we see which positions in the source sentence were
`considered more important when generating the target word.
`We can see from the alignments in Fig. 3 that the alignment of words between English and French
`is largely monotonic. We see strong weights along the diagonal of each matrix. However, we also
`observe a number of non-trivial, non-monotonic alignments. Adjectives and nouns are typically
`ordered differently between French and English, and we see an example in Fig. 3 (a). From this
`figure, we see that the model correctly translates a phrase [European Economic Area] into [zone
`´economique europ´een]. The RNNsearch was able to correctly align [zone] with [Area], jumping
`over the two words ([European] and [Economic]), and then looked one word back at a time to
`complete the whole phrase [zone ´economique europ´eenne].
`The strength of the soft-alignment, opposed to a hard-alignment, is evident, for instance, from
`Fig. 3 (d). Consider the source phrase [the man] which was translated into [l’ homme]. Any hard
`alignment will map [the] to [l’] and [man] to [homme]. This is not helpful for translation, as one
`must consider the word following [the] to determine whether it should be translated into [le], [la],
`[les] or [l’]. Our soft-alignment solves this issue naturally by letting the model look at both [the] and
`[man], and in this example, we see that the model was able to correctly translate [the] into [l’]. We
`observe similar behaviors in all the presented cases in Fig. 3. An additional benefit of the soft align-
`ment is that it naturally deals with source and target phrases of different lengths, without requiring a
`counter-intuitive way of mapping some words to or from nowhere ([NULL]) (see, e.g., Chapters 4
`and 5 of Koehn, 2010).
`
`5.2.2 LONG SENTENCES
`
`As clearly visible from Fig. 2 the proposed model (RNNsearch) is much better than the conventional
`model (RNNencdec) at translating long sentences. This is likely due to the fact that the RNNsearch
`does not require encoding a long sentence into a fixed-length vector perfectly, but only accurately
`encoding the parts of the input sentence that surround a particular word.
`As an example, consider this source sentence from the test set:
`
`An admitting privilege is the right of a doctor to admit a patient to a hospital or
`a medical centre to carry out a diagnosis or a procedure, based on his status as a
`health care worker at a hospital.
`
`The RNNencdec-50 translated this sentence into:
`
`Un privil`ege d’admission est le droit d’un m´edecin de reconnaˆıtre un patient `a
`l’hˆopital ou un centre m´edical d’un diagnostic ou de prendre un diagnostic en
`fonction de son ´etat de sant´e.
`
`7
`
`

`

`Published as a conference paper at ICLR 2015
`
`The RNNencdec-50 correctly translated the source sentence until [a medical center]. However, from
`there on (underlined), it deviated from the original meaning of the source sentence. For instance, it
`replaced [based on his status as a health care worker at a hospital] in the source sentence with [en
`fonction de son ´etat de sant´e] (“based on his state of health”).
`On the other hand, the RNNsearch-50 generated the following correct translation, preserving the
`whole meaning of the input sentence without omitting any details:
`
`Un privil`ege d’admission est le droit d’un m´edecin d’admettre un patient `a un
`hˆopital ou un centre m´edical pour effectuer un diagnostic ou une proc´edure, selon
`son statut de travailleur des soins de sant´e `a l’hˆopital.
`
`Let us consider another sentence from the test set:
`
`This kind of experience is part of Disney’s efforts to ”extend the lifetime of its
`series and build new relationships with audiences via digital platforms that are
`becoming ever more important,” he added.
`
`The translation by the RNNencdec-50 is
`
`Ce type d’exp´erience fait partie des initiatives du Disney pour ”prolonger la dur´ee
`de vie de ses nouvelles et de d´evelopper des liens avec les lecteurs num´eriques qui
`deviennent plus complexes.
`
`As with the previous example, the RNNencdec began deviating from the actual meaning of the
`source sentence after generating approximately 30 words (see the underlined phrase). After that
`point, the quality of the translation deteriorates, with basic mistakes such as the lack of a closing
`quotation mark.
`Again, the RNNsearch-50 was able to translate this long sentence correctly:
`
`Ce genre d’exp´erience fait partie des efforts de Disney pour ”prolonger la dur´ee
`de vie de ses s´eries et cr´eer de nouvelles relations avec des publics via des
`plateformes num´eriques de plus en plus importantes”, a-t-il ajout´e.
`
`In conjunction with the quantitative results presented already, these qualitative observations con-
`firm our hypotheses that the RNNsearch architecture enables far more reliable translation of long
`sentences than the standard RNNencdec model.
`In Appendix C, we provide a few more sample translations of long source sentences generated by
`the RNNencdec-50, RNNsearch-50 and Google Translate along with the reference translations.
`
`6 RELATED WORK
`
`6.1 LEARNING TO ALIGN
`
`A similar approach of aligning an output symbol with an input symbol was proposed recently by
`Graves (2013) in the context of handwriting synthesis. Handwriting synthesis is a task where the
`model is asked to generate handwriting of a given sequence of characters. In his work, he used a
`mixture of Gaussian kernels to compute the weights of the annotations, where the location, width
`and mixture coefficient of each kernel was predicted from an alignment model. More specifically,
`his alignment was restricted to predict the location such that the location increases monotonically.
`The main difference from our approach is that, in (Graves, 2013), the modes of the weights of the
`annotations only move in one direction. In the context of machine translation, this is a severe limi-
`tation, as (long-distance) reordering is often needed to generate a grammatically correct translation
`(for instance, English-to-German).
`Our approach, on the other hand, requires computing the annotation weight of every word in the
`source sentence for each word in the translation. This drawback is not severe with the task of
`translation in which most of input and output sentences are only 15–40 words. However, this may
`limit the applicability of the proposed scheme to other tasks.
`
`8
`
`

`

`Published as a conference paper at ICLR 2015
`
`6.2 NEURAL NETWORKS FOR MACHINE TRANSLATION
`
`Since Bengio et al. (2003) introduced a neural probabilistic language model which uses a neural net-
`work to model the conditional probability of a word given a fixed number of the preceding words,
`neural networks have widely been used in machine translation. However, the role of neural net-
`works has been largely limited to simply providing a single feature to an existing statistical machine
`translation system or to re-rank a list of candidate translations provided by an existing system.
`For instance, Schwenk (2012) proposed using a feedforward neural network to compute the score of
`a pair of source and target phrases and to use the score as an additional feature in the phrase-based
`statistical machine translation system. More recently, Kalchbrenner and Blunsom (2013) and Devlin
`et al. (2014) reported the successful use of the neural networks as a sub-component of the existing
`translation system. Traditionally, a neural network trained as a target-side language model has been
`used to rescore or rerank a list of candidate translations (see, e.g., Schwenk et al., 2006).
`Although the above approaches were shown to improve the translation performance over the state-
`of-the-art machine translation systems, we are more interested in a more ambitious objective of
`designing a completely new translation system based on neural networks. The neural machine trans-
`lation approach we consider in this paper is therefore a radical departure from these earlier works.
`Rather than using a neural network as a part of the existing system, our model works on its own and
`generates a translation from a source sentence directly.
`
`7 CONCLUSION
`
`The conventional approach to neural machine translation, called an encoder–decoder approach, en-
`codes a whole input sentence into a fixed-length vector from which a translation will be decoded.
`We conjectured that the use of a fixed-length context vector is problematic for translating long sen-
`tences, based on a recent empirical study reported by Cho et al. (2014b) and Pouget-Abadie et al.
`(2014).
`In this paper, we proposed a novel architecture that addresses this issue. We extended the basic
`encoder–decoder by letting a model (soft-)search for a set of input words, or their annotations com-
`puted by an encoder, when generating each target word. This frees the model from having to encode
`a whole source sentence into a fixed-length vector, and also lets the model focus only on information
`relevant to the generation of the next target word. This has a major positive impact on the ability
`of the neural machine translation system to yield good results on longer sentences. Unlike with
`the traditional machine translation systems, all of the pieces of the translation system, including
`the alignment mechanism, are jointly trained towards a better log-probability of producing correct
`translations.
`We tested the proposed model, called RNNsearch, on the task of English-to-French translation. The
`experiment revealed that the proposed RNNsearch outperforms the conventional encoder–decoder
`model (RNNencdec) significantly, regardless of the sentence length and that it is much more ro-
`bust to the length of a source sentence. From the qualitative analysis where we investigated the
`(soft-)alignment generated by the RNNsearch, we were able to conclude that the model can cor-
`rectly align each target word with the relevant words, or their annotations, in the source sentence as
`it generated a correct translation.
`Perhaps more importantly, the proposed approach achieved a translation performance comparable to
`the existing phrase-based statistical machine translation. It is a striking result, considering that the
`proposed architecture, or the whole family of neural machine translation, has only been proposed
`as recently as this year. We believe the architecture proposed here is a promising step toward better
`machine translation and a better understanding of natural languages in general.
`One of challenges left for the future is to better handle unknown, or rare words. This will be required
`for the model to be more widely used and to match the performance of current state-of-the-art
`machine translation systems in all contexts.
`
`9
`
`

`

`Published as a conference paper at ICLR 2015
`
`ACKNOWLEDGMENTS
`
`The authors would like to thank the developers of Theano (Bergstra et al., 2010; Bastien et al.,
`2012). We acknowledge the support of the following agencies for research funding and computing
`support: NSERC, Calcul Qu´ebec, Compute Canada, the Canada Research Chairs and CIFAR. Bah-
`danau thanks the support from Planet Intelligent Systems GmbH. We also thank Felix Hill, Bart van
`Merri´enboer, Jean Pouget-Abadie, Coline Devin and Tae-Ho Kim.
`
`REFERENCES
`Axelrod, A., He, X., and Gao, J. (2011). Domain adaptation via pseudo in-domain data selection.
`In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing
`(EMNLP), pages 355–362. Association for Computational Linguistics.
`
`Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Be

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket