throbber
Sequence to Sequence Learning
`with Neural Networks
`
`Ilya Sutskever
`Google
`ilyasu@google.com
`
`Oriol Vinyals
`Google
`vinyals@google.com
`
`Quoc V. Le
`Google
`qvl@google.com
`
`Abstract
`
`Deep Neural Networks (DNNs) are powerful models that have achieved excel-
`lent performance on difficult learning tasks. Although DNNs work well whenever
`large labeled training sets are available, they cannot be used to map sequences to
`sequences. In this paper, we present a general end-to-end approach to sequence
`learning that makes minimal assumptions on the sequence structure. Our method
`uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence
`to a vector of a fixed dimensionality, and then another deep LSTM to decode the
`target sequence from the vector. Our main result is that on an English to French
`translation task from the WMT’14 dataset, the translations produced by the LSTM
`achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU
`score was penalized on out-of-vocabulary words. Additionally, the LSTM did not
`have difficulty on long sentences. For comparison, a phrase-based SMT system
`achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM
`to rerank the 1000 hypotheses produced by the aforementioned SMT system, its
`BLEU score increases to 36.5, which is close to the previous best result on this
`task. The LSTM also learned sensible phrase and sentence representations that
`are sensitive to word order and are relatively invariant to the active and the pas-
`sive voice. Finally, we found that reversing the order of the words in all source
`sentences (but not target sentences) improved the LSTM’s performance markedly,
`because doing so introduced many short term dependencies between the source
`and the target sentence which made the optimization problem easier.
`
`1 Introduction
`
`Deep Neural Networks (DNNs) are extremely powerful machine learning models that achieve ex-
`cellent performance on difficult problems such as speech recognition [13, 7] and visual object recog-
`nition [19, 6, 21, 20]. DNNs are powerful because they can perform arbitrary parallel computation
`for a modest number of steps. A surprising example of the power of DNNs is their ability to sort
`N N -bit numbers using only 2 hidden layers of quadratic size [27]. So, while neural networks are
`related to conventional statistical models, they learn an intricate computation. Furthermore, large
`DNNs can be trained with supervised backpropagation whenever the labeled training set has enough
`information to specify the network’s parameters. Thus, if there exists a parameter setting of a large
`DNN that achieves good results (for example, because humans can solve the task very rapidly),
`supervised backpropagation will find these parameters and solve the problem.
`
`Despite their flexibility and power, DNNs can only be applied to problems whose inputs and targets
`can be sensibly encoded with vectors of fixed dimensionality. It is a significant limitation, since
`many important problems are best expressed with sequences whose lengths are not known a-priori.
`For example, speech recognition and machine translation are sequential problems. Likewise, ques-
`tion answering can also be seen as mapping a sequence of words representing the question to a
`
`1
`
`arXiv:1409.3215v3 [cs.CL] 14 Dec 2014
`
`Petitioner, EX1014
`IPR2024-01234
`Hugging Face, Inc., v. FriendliAI Inc.
`
`

`

`sequence of words representing the answer. It is therefore clear that a domain-independent method
`that learns to map sequences to sequences would be useful.
`
`Sequences pose a challenge for DNNs because they require that the dimensionality of the inputs and
`outputs is known and fixed. In this paper, we show that a straightforward application of the Long
`Short-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems.
`The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed-
`dimensional vector representation, and then to use another LSTM to extract the output sequence
`from that vector (fig. 1). The second LSTM is essentially a recurrent neural network language model
`[28, 23, 30] except that it is conditioned on the input sequence. The LSTM’s ability to successfully
`learn on data with long range temporal dependencies makes it a natural choice for this application
`due to the considerable time lag between the inputs and their corresponding outputs (fig. 1).
`
`There have been a number of related attempts to address the general sequence to sequence learning
`problem with neural networks. Our approach is closely related to Kalchbrenner and Blunsom [18]
`who were the first to map the entire input sentence to vector, and is related to Cho et al. [5] although
`the latter was used only for rescoring hypotheses produced by a phrase-based system. Graves [10]
`introduced a novel differentiable attention mechanism that allows neural networks to focus on dif-
`ferent parts of their input, and an elegant variant of this idea was successfully applied to machine
`translation by Bahdanau et al. [2]. The Connectionist Sequence Classification is another popular
`technique for mapping sequences to sequences with neural networks, but it assumes a monotonic
`alignment between the inputs and the outputs [11].
`
`Figure 1: Our model reads an input sentence “ABC” and produces “WXYZ” as the output sentence. The
`model stops making predictions after outputting the end-of-sentence token. Note that the LSTM reads the
`input sentence in reverse, because doing so introduces many short term dependencies in the data that make the
`optimization problem much easier.
`
`The main result of this work is the following. On the WMT’14 English to French translation task,
`we obtained a BLEU score of 34.81 by directly extracting translations from an ensemble of 5 deep
`LSTMs (with 384M parameters and 8,000 dimensional state each) using a simple left-to-right beam-
`search decoder. This is by far the best result achieved by direct translation with large neural net-
`works. For comparison, the BLEU score of an SMT baseline on this dataset is 33.30 [29]. The 34.81
`BLEU score was achieved by an LSTM with a vocabulary of 80k words, so the score was penalized
`whenever the reference translation contained a word not covered by these 80k. This result shows
`that a relatively unoptimized small-vocabulary neural network architecture which has much room
`for improvement outperforms a phrase-based SMT system.
`
`Finally, we used the LSTM to rescore the publicly available 1000-best lists of the SMT baseline on
`the same task [29]. By doing so, we obtained a BLEU score of 36.5, which improves the baseline by
`3.2 BLEU points and is close to the previous best published result on this task (which is 37.0 [9]).
`
`Surprisingly, the LSTM did not suffer on very long sentences, despite the recent experience of other
`researchers with related architectures [26]. We were able to do well on long sentences because we
`reversed the order of words in the source sentence but not the target sentences in the training and test
`set. By doing so, we introduced many short term dependencies that made the optimization problem
`much simpler (see sec. 2 and 3.3). As a result, SGD could learn LSTMs that had no trouble with
`long sentences. The simple trick of reversing the words in the source sentence is one of the key
`technical contributions of this work.
`
`A useful property of the LSTM is that it learns to map an input sentence of variable length into
`a fixed-dimensional vector representation. Given that translations tend to be paraphrases of the
`source sentences, the translation objective encourages the LSTM to find sentence representations
`that capture their meaning, as sentences with similar meanings are close to each other while different
`
`2
`
`

`

`sentences meanings will be far. A qualitative evaluation supports this claim, showing that our model
`is aware of word order and is fairly invariant to the active and passive voice.
`
`2 The model
`
`The Recurrent Neural Network (RNN) [31, 28] is a natural generalization of feedforward neural
`networks to sequences. Given a sequence of inputs (x1, . . . , xT ), a standard RNN computes a
`sequence of outputs (y1, . . . , yT ) by iterating the following equation:
`ht = sigm (cid:0)W hxxt + W hhht−1(cid:1)
`yt = W yhht
`The RNN can easily map sequences to sequences whenever the alignment between the inputs the
`outputs is known ahead of time. However, it is not clear how to apply an RNN to problems whose
`input and the output sequences have different lengths with complicated and non-monotonic relation-
`ships.
`
`The simplest strategy for general sequence learning is to map the input sequence to a fixed-sized
`vector using one RNN, and then to map the vector to the target sequence with another RNN (this
`approach has also been taken by Cho et al. [5]). While it could work in principle since the RNN is
`provided with all the relevant information, it would be difficult to train the RNNs due to the resulting
`long term dependencies (figure 1) [14, 4, 16, 15]. However, the Long Short-Term Memory (LSTM)
`[16] is known to learn problems with long range temporal dependencies, so an LSTM may succeed
`in this setting.
`The goal of the LSTM is to estimate the conditional probability p(y1, . . . , yT ′|x1, . . . , xT ) where
`(x1, . . . , xT ) is an input sequence and y1, . . . , yT ′ is its corresponding output sequence whose length
`T ′ may differ from T . The LSTM computes this conditional probability by first obtaining the fixed-
`dimensional representation v of the input sequence (x1, . . . , xT ) given by the last hidden state of the
`LSTM, and then computing the probability of y1, . . . , yT ′ with a standard LSTM-LM formulation
`whose initial hidden state is set to the representation v of x1, . . . , xT :
`
`T ′
`
`p(y1, . . . , yT ′ |x1, . . . , xT ) =
`
`p(yt|v, y1, . . . , yt−1)
`
`(1)
`
`Y t
`
`=1
`In this equation, each p(yt|v, y1, . . . , yt−1) distribution is represented with a softmax over all the
`words in the vocabulary. We use the LSTM formulation from Graves [10]. Note that we require that
`each sentence ends with a special end-of-sentence symbol “<EOS>”, which enables the model to
`define a distribution over sequences of all possible lengths. The overall scheme is outlined in figure
`1, where the shown LSTM computes the representation of “A”, “B”, “C”, “<EOS>” and then uses
`this representation to compute the probability of “W”, “X”, “Y”, “Z”, “<EOS>”.
`Our actual models differ from the above description in three important ways. First, we used two
`different LSTMs: one for the input sequence and another for the output sequence, because doing
`so increases the number model parameters at negligible computational cost and makes it natural to
`train the LSTM on multiple language pairs simultaneously [18]. Second, we found that deep LSTMs
`significantly outperformed shallow LSTMs, so we chose an LSTM with four layers. Third, we found
`it extremely valuable to reverse the order of the words of the input sentence. So for example, instead
`of mapping the sentence a, b, c to the sentence α, β, γ, the LSTM is asked to map c, b, a to α, β, γ,
`where α, β, γ is the translation of a, b, c. This way, a is in close proximity to α, b is fairly close to β,
`and so on, a fact that makes it easy for SGD to “establish communication” between the input and the
`output. We found this simple data transformation to greatly improve the performance of the LSTM.
`
`3 Experiments
`
`We applied our method to the WMT’14 English to French MT task in two ways. We used it to
`directly translate the input sentence without using a reference SMT system and we it to rescore the
`n-best lists of an SMT baseline. We report the accuracy of these translation methods, present sample
`translations, and visualize the resulting sentence representation.
`
`3
`
`

`

`3.1 Dataset details
`
`We used the WMT’14 English to French dataset. We trained our models on a subset of 12M sen-
`tences consisting of 348M French words and 304M English words, which is a clean “selected”
`subset from [29]. We chose this translation task and this specific training set subset because of the
`public availability of a tokenized training and test set together with 1000-best lists from the baseline
`SMT [29].
`
`As typical neural language models rely on a vector representation for each word, we used a fixed
`vocabulary for both languages. We used 160,000 of the most frequent words for the source language
`and 80,000 of the most frequent words for the target language. Every out-of-vocabulary word was
`replaced with a special “UNK” token.
`
`3.2 Decoding and Rescoring
`
`The core of our experiments involved training a large deep LSTM on many sentence pairs. We
`trained it by maximizing the log probability of a correct translation T given the source sentence S,
`so the training objective is
`
`1/|S| X
`
`log p(T |S)
`
`(T,S)∈S
`where S is the training set. Once training is complete, we produce translations by finding the most
`likely translation according to the LSTM:
`
`ˆT = arg max
`T
`
`p(T |S)
`
`(2)
`
`We search for the most likely translation using a simple left-to-right beam search decoder which
`maintains a small number B of partial hypotheses, where a partial hypothesis is a prefix of some
`translation. At each timestep we extend each partial hypothesis in the beam with every possible
`word in the vocabulary. This greatly increases the number of the hypotheses so we discard all but
`the B most likely hypotheses according to the model’s log probability. As soon as the “<EOS>”
`symbol is appended to a hypothesis, it is removed from the beam and is added to the set of complete
`hypotheses. While this decoder is approximate, it is simple to implement. Interestingly, our system
`performs well even with a beam size of 1, and a beam of size 2 provides most of the benefits of beam
`search (Table 1).
`
`We also used the LSTM to rescore the 1000-best lists produced by the baseline system [29]. To
`rescore an n-best list, we computed the log probability of every hypothesis with our LSTM and took
`an even average with their score and the LSTM’s score.
`
`3.3 Reversing the Source Sentences
`
`While the LSTM is capable of solving problems with long term dependencies, we discovered that
`the LSTM learns much better when the source sentences are reversed (the target sentences are not
`reversed). By doing so, the LSTM’s test perplexity dropped from 5.8 to 4.7, and the test BLEU
`scores of its decoded translations increased from 25.9 to 30.6.
`
`While we do not have a complete explanation to this phenomenon, we believe that it is caused by
`the introduction of many short term dependencies to the dataset. Normally, when we concatenate a
`source sentence with a target sentence, each word in the source sentence is far from its corresponding
`word in the target sentence. As a result, the problem has a large “minimal time lag” [17]. By
`reversing the words in the source sentence, the average distance between corresponding words in
`the source and target language is unchanged. However, the first few words in the source language
`are now very close to the first few words in the target language, so the problem’s minimal time lag is
`greatly reduced. Thus, backpropagation has an easier time “establishing communication” between
`the source sentence and the target sentence, which in turn results in substantially improved overall
`performance.
`
`Initially, we believed that reversing the input sentences would only lead to more confident predic-
`tions in the early parts of the target sentence and to less confident predictions in the later parts. How-
`ever, LSTMs trained on reversed source sentences did much better on long sentences than LSTMs
`
`4
`
`

`

`trained on the raw source sentences (see sec. 3.7), which suggests that reversing the input sentences
`results in LSTMs with better memory utilization.
`
`3.4 Training details
`
`We found that the LSTM models are fairly easy to train. We used deep LSTMs with 4 layers,
`with 1000 cells at each layer and 1000 dimensional word embeddings, with an input vocabulary
`of 160,000 and an output vocabulary of 80,000. Thus the deep LSTM uses 8000 real numbers to
`represent a sentence. We found deep LSTMs to significantly outperform shallow LSTMs, where
`each additional layer reduced perplexity by nearly 10%, possibly due to their much larger hidden
`state. We used a naive softmax over 80,000 words at each output. The resulting LSTM has 384M
`parameters of which 64M are pure recurrent connections (32M for the “encoder” LSTM and 32M
`for the “decoder” LSTM). The complete training details are given below:
`
`• We initialized all of the LSTM’s parameters with the uniform distribution between -0.08
`and 0.08
`• We used stochastic gradient descent without momentum, with a fixed learning rate of 0.7.
`After 5 epochs, we begun halving the learning rate every half epoch. We trained our models
`for a total of 7.5 epochs.
`• We used batches of 128 sequences for the gradient and divided it the size of the batch
`(namely, 128).
`• Although LSTMs tend to not suffer from the vanishing gradient problem, they can have
`exploding gradients. Thus we enforced a hard constraint on the norm of the gradient [10,
`25] by scaling it when its norm exceeded a threshold. For each training batch, we compute
`s = kgk2, where g is the gradient divided by 128. If s > 5, we set g = 5g
`s .
`• Different sentences have different lengths. Most sentences are short (e.g., length 20-30)
`but some sentences are long (e.g., length > 100), so a minibatch of 128 randomly chosen
`training sentences will have many short sentences and few long sentences, and as a result,
`much of the computation in the minibatch is wasted. To address this problem, we made sure
`that all sentences in a minibatch are roughly of the same length, yielding a 2x speedup.
`
`3.5 Parallelization
`
`A C++ implementation of deep LSTM with the configuration from the previous section on a sin-
`gle GPU processes a speed of approximately 1,700 words per second. This was too slow for our
`purposes, so we parallelized our model using an 8-GPU machine. Each layer of the LSTM was
`executed on a different GPU and communicated its activations to the next GPU / layer as soon as
`they were computed. Our models have 4 layers of LSTMs, each of which resides on a separate
`GPU. The remaining 4 GPUs were used to parallelize the softmax, so each GPU was responsible
`for multiplying by a 1000 × 20000 matrix. The resulting implementation achieved a speed of 6,300
`(both English and French) words per second with a minibatch size of 128. Training took about a ten
`days with this implementation.
`
`3.6 Experimental Results
`
`We used the cased BLEU score [24] to evaluate the quality of our translations. We computed our
`BLEU scores using multi-bleu.pl1 on the tokenized predictions and ground truth. This way
`of evaluating the BELU score is consistent with [5] and [2], and reproduces the 33.3 score of [29].
`However, if we evaluate the best WMT’14 system [9] (whose predictions can be downloaded from
`statmt.org\matrix) in this manner, we get 37.0, which is greater than the 35.8 reported by
`statmt.org\matrix.
`
`The results are presented in tables 1 and 2. Our best results are obtained with an ensemble of LSTMs
`that differ in their random initializations and in the random order of minibatches. While the decoded
`translations of the LSTM ensemble do not outperform the best WMT’14 system, it is the first time
`that a pure neural translation system outperforms a phrase-based SMT baseline on a large scale MT
`
`1There several variants of the BLEU score, and each variant is defined with a perl script.
`
`5
`
`

`

`Method
`Bahdanau et al. [2]
`Baseline System [29]
`Single forward LSTM, beam size 12
`Single reversed LSTM, beam size 12
`Ensemble of 5 reversed LSTMs, beam size 1
`Ensemble of 2 reversed LSTMs, beam size 12
`Ensemble of 5 reversed LSTMs, beam size 2
`Ensemble of 5 reversed LSTMs, beam size 12
`
`test BLEU score (ntst14)
`28.45
`33.30
`26.17
`30.59
`33.00
`33.27
`34.50
`34.81
`
`Table 1: The performance of the LSTM on WMT’14 English to French test set (ntst14). Note that
`an ensemble of 5 LSTMs with a beam of size 2 is cheaper than of a single LSTM with a beam of
`size 12.
`
`Method
`Baseline System [29]
`Cho et al. [5]
`Best WMT’14 result [9]
`Rescoring the baseline 1000-best with a single forward LSTM
`Rescoring the baseline 1000-best with a single reversed LSTM
`Rescoring the baseline 1000-best with an ensemble of 5 reversed LSTMs
`Oracle Rescoring of the Baseline 1000-best lists
`
`test BLEU score (ntst14)
`33.30
`34.54
`37.0
`35.61
`35.85
`36.5
`∼45
`
`Table 2: Methods that use neural networks together with an SMT system on the WMT’14 English
`to French test set (ntst14).
`
`task by a sizeable margin, despite its inability to handle out-of-vocabulary words. The LSTM is
`within 0.5 BLEU points of the best WMT’14 result if it is used to rescore the 1000-best list of the
`baseline system.
`
`3.7 Performance on long sentences
`
`We were surprised to discover that the LSTM did well on long sentences, which is shown quantita-
`tively in figure 3. Table 3 presents several examples of long sentences and their translations.
`
`3.8 Model Analysis
`
`I was given a card by her in the garden
`
`In the garden , she gave me a card
`She gave me a card in the garden
`
`She was given a card by me in the garden
`In the garden , I gave her a card
`
`I gave her a card in the garden
`
`−10
`
`−5
`
`0
`
`5
`
`10
`
`15
`
`20
`
`15
`
`10
`
`5
`
`0
`
`−5
`
`−10
`
`−15
`
`−20
`−15
`
`Mary admires John
`
`Mary is in love with John
`
`Mary respects John
`
`John admires Mary
`
`John is in love with Mary
`
`John respects Mary
`
`−6
`
`−4
`
`−2
`
`0
`
`2
`
`4
`
`6
`
`8
`
`10
`
`01234
`
`−1
`
`−2
`
`−3
`
`−4
`
`−5
`
`−6
`−8
`
`Figure 2: The figure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtained
`after processing the phrases in the figures. The phrases are clustered by meaning, which in these examples is
`primarily a function of word order, which would be difficult to capture with a bag-of-words model. Notice that
`both clusters have similar internal structure.
`
`One of the attractive features of our model is its ability to turn a sequence of words into a vector
`of fixed dimensionality. Figure 2 visualizes some of the learned representations. The figure clearly
`shows that the representations are sensitive to the order of words, while being fairly insensitive to the
`
`6
`
`

`

`Truth
`
`Our model
`
`Sentence
`Type
`Our model Ulrich UNK , membre du conseil d’ administration du constructeur automobile Audi ,
`affirme qu’ il s’ agit d’ une pratique courante depuis des ann´ees pour que les t´el´ephones
`portables puissent ˆetre collect´es avant les r´eunions du conseil d’ administration afin qu’ ils
`ne soient pas utilis´es comme appareils d’ ´ecoute `a distance .
`Ulrich Hackenberg , membre du conseil d’ administration du constructeur automobile Audi ,
`d´eclare que la collecte des t´el´ephones portables avant les r´eunions du conseil , afin qu’ ils
`ne puissent pas ˆetre utilis´es comme appareils d’ ´ecoute `a distance , est une pratique courante
`depuis des ann´ees .
`“ Les t´el´ephones cellulaires , qui sont vraiment une question , non seulement parce qu’ ils
`pourraient potentiellement causer des interf´erences avec les appareils de navigation , mais
`nous savons , selon la FCC , qu’ ils pourraient interf´erer avec les tours de t´el´ephone cellulaire
`lorsqu’ ils sont dans l’ air ” , dit UNK .
`“ Les t´el´ephones portables sont v´eritablement un probl`eme , non seulement parce qu’ ils
`pourraient ´eventuellement cr´eer des interf´erences avec les instruments de navigation , mais
`parce que nous savons , d’ apr`es la FCC , qu’ ils pourraient perturber les antennes-relais de
`t´el´ephonie mobile s’ ils sont utilis´es `a bord ” , a d´eclar´e Rosenker .
`Our model Avec la cr´emation , il y a un “ sentiment de violence contre le corps d’ un ˆetre cher ” ,
`qui sera “ r´eduit `a une pile de cendres ” en tr`es peu de temps au lieu d’ un processus de
`d´ecomposition “ qui accompagnera les ´etapes du deuil ” .
`Il y a , avec la cr´emation , “ une violence faite au corps aim´e ” ,
`qui va ˆetre “ r´eduit `a un tas de cendres ” en tr`es peu de temps , et non apr`es un processus de
`d´ecomposition , qui “ accompagnerait les phases du deuil ” .
`
`Truth
`
`Truth
`
`Table 3: A few examples of long translations produced by the LSTM alongside the ground truth
`translations. The reader can verify that the translations are sensible using Google translate.
`
`LSTM (34.8)
`baseline (33.3)
`
`3000
`2500
`2000
`1500
`1000
`500
`test sentences sorted by average word frequency rank
`
`3500
`
`40
`
`35
`
`30
`
`BLEU score
`
`25
`
`20
`0
`
`LSTM (34.8)
`baseline (33.3)
`
`40
`
`35
`
`30
`
`BLEU score
`
`25
`
`20
`
`4 7 8
`
`12
`
`35
`28
`22
`17
`test sentences sorted by their length
`
`79
`
`Figure 3: The left plot shows the performance of our system as a function of sentence length, where the
`x-axis corresponds to the test sentences sorted by their length and is marked by the actual sequence lengths.
`There is no degradation on sentences with less than 35 words, there is only a minor degradation on the longest
`sentences. The right plot shows the LSTM’s performance on sentences with progressively more rare words,
`where the x-axis corresponds to the test sentences sorted by their “average word frequency rank”.
`
`replacement of an active voice with a passive voice. The two-dimensional projections are obtained
`using PCA.
`
`4 Related work
`
`There is a large body of work on applications of neural networks to machine translation. So far,
`the simplest and most effective way of applying an RNN-Language Model (RNNLM) [23] or a
`
`7
`
`

`

`Feedforward Neural Network Language Model (NNLM) [3] to an MT task is by rescoring the n-
`best lists of a strong MT baseline [22], which reliably improves translation quality.
`
`More recently, researchers have begun to look into ways of including information about the source
`language into the NNLM. Examples of this work include Auli et al. [1], who combine an NNLM
`with a topic model of the input sentence, which improves rescoring performance. Devlin et al. [8]
`followed a similar approach, but they incorporated their NNLM into the decoder of an MT system
`and used the decoder’s alignment information to provide the NNLM with the most useful words in
`the input sentence. Their approach was highly successful and it achieved large improvements over
`their baseline.
`
`Our work is closely related to Kalchbrenner and Blunsom [18], who were the first to map the input
`sentence into a vector and then back to a sentence, although they map sentences to vectors using
`convolutional neural networks, which lose the ordering of the words. Similarly to this work, Cho et
`al. [5] used an LSTM-like RNN architecture to map sentences into vectors and back, although their
`primary focus was on integrating their neural network into an SMT system. Bahdanau et al. [2] also
`attempted direct translations with a neural network that used an attention mechanism to overcome
`the poor performance on long sentences experienced by Cho et al. [5] and achieved encouraging
`results. Likewise, Pouget-Abadie et al. [26] attempted to address the memory problem of Cho et
`al. [5] by translating pieces of the source sentence in way that produces smooth translations, which
`is similar to a phrase-based approach. We suspect that they could achieve similar improvements by
`simply training their networks on reversed source sentences.
`
`End-to-end training is also the focus of Hermann et al. [12], whose model represents the inputs and
`outputs by feedforward networks, and map them to similar points in space. However, their approach
`cannot generate translations directly: to get a translation, they need to do a look up for closest vector
`in the pre-computed database of sentences, or to rescore a sentence.
`
`5 Conclusion
`
`In this work, we showed that a large deep LSTM, that has a limited vocabulary and that makes
`almost no assumption about problem structure can outperform a standard SMT-based system whose
`vocabulary is unlimited on a large-scale MT task. The success of our simple LSTM-based approach
`on MT suggests that it should do well on many other sequence learning problems, provided they
`have enough training data.
`
`We were surprised by the extent of the improvement obtained by reversing the words in the source
`sentences. We conclude that it is important to find a problem encoding that has the greatest number
`of short term dependencies, as they make the learning problem much simpler. In particular, while
`we were unable to train a standard RNN on the non-reversed translation problem (shown in fig. 1),
`we believe that a standard RNN should be easily trainable when the source sentences are reversed
`(although we did not verify it experimentally).
`
`We were also surprised by the ability of the LSTM to correctly translate very long sentences. We
`were initially convinced that the LSTM would fail on long sentences due to its limited memory,
`and other researchers reported poor performance on long sentences with a model similar to ours
`[5, 2, 26]. And yet, LSTMs trained on the reversed dataset had little difficulty translating long
`sentences.
`
`Most importantly, we demonstrated that a simple, straightforward and a relatively unoptimized ap-
`proach can outperform an SMT system, so further work will likely lead to even greater translation
`accuracies. These results suggest that our approach will likely do well on other challenging sequence
`to sequence problems.
`
`6 Acknowledgments
`
`We thank Samy Bengio, Jeff Dean, Matthieu Devin, Geoffrey Hinton, Nal Kalchbrenner, Thang Luong, Wolf-
`gang Macherey, Rajat Monga, Vincent Vanhoucke, Peng Xu, Wojciech Zaremba, and the Google Brain team
`for useful comments and discussions.
`
`8
`
`

`

`References
`[1] M. Auli, M. Galley, C. Quirk, and G. Zweig. Joint language and translation modeling with recurrent
`neural networks. In EMNLP, 2013.
`[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.
`arXiv preprint arXiv:1409.0473, 2014.
`[3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. In Journal of
`Machine Learning Research, pages 1137–1155, 2003.
`[4] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult.
`IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
`[5] K. Cho, B. Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase represen-
`tations using RNN encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078,
`2014.
`[6] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification.
`In CVPR, 2012.
`[7] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large
`vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing - Special
`Issue on Deep Learning for Speech and Language Processing, 2012.
`[8] J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, and J. Makhoul. Fast and robust neural network
`joint models for statistical machine translation. In ACL, 2014.
`[9] Nadir Durrani, Barry Haddow, Philipp Koehn, and Kenneth Heafield. Edinburgh’s phrase-based machine
`translation systems for wmt-14. In WMT, 2014.
`[10] A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint arXiv:1308.0850,
`2013.
`[11] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling
`unsegmented sequence data with recurrent neural networks. In ICML, 2006.
`[12] K. M. Hermann and P. Blunsom. Multilingual distributed representations without word alignment. In
`ICLR, 2014.
`[13] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,
`T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE
`Signal Processing Magazine, 2012.
`[14] S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Master’s thesis, Institut fur Infor-
`matik, Technische Universitat, Munchen, 1991.
`[15] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty
`of learning long-term dependencies, 2001.
`[16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.
`[17] S. Hochreiter and J. Schmidhuber. LSTM can solve hard long time lag problems. 1997.
`[18] N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, 2013.
`[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural
`networks. In NIPS, 2012.
`[20] Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, and A.Y. Ng. Building
`high-level features using large scale unsupervised learning. In ICML, 2012.
`[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
`Proceedings of the IEEE, 1998.
`[22] T. M

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket