throbber
Transformer Dissection: A Unified Understanding of
`Transformer’s Attention via the Lens of Kernel
`
`Yao-Hung Hubert Tsai1 Shaojie Bai1 Makoto Yamada34
`Louis-Philippe Morency2 Ruslan Salakhutdinov1
`{1Machine Learning Department,2Language Technology Institute}, Carnegie Mellon University
`3Kyoto University 4RIKEN AIP
`{yaohungt, shaojieb, morency, rsalakhu}@cs.cmu.edu, myamada@i.kyoto-u.ac.jp
`https://github.com/yaohungt/TransformerDissection
`
`Abstract
`
`Transformer is a powerful architecture that
`achieves superior performance on various se-
`quence learning tasks, including neural ma-
`chine translation, language understanding, and
`sequence prediction. At the core of the Trans-
`former is the attention mechanism, which con-
`currently processes all inputs in the streams.
`In this paper, we present a new formulation
`of attention via the lens of the kernel. To be
`more precise, we realize that the attention can
`be seen as applying kernel smoother over the
`inputs with the kernel scores being the simi-
`larities between inputs. This new formulation
`gives us a better way to understand individ-
`ual components of the Transformer’s attention,
`such as the better way to integrate the posi-
`tional embedding. Another important advan-
`tage of our kernel-based formulation is that it
`paves the way to a larger space of compos-
`ing Transformer’s attention. As an example,
`we propose a new variant of Transformer’s at-
`tention which models the input as a product
`of symmetric kernels. This approach achieves
`competitive performance to the current state of
`the art model with less computation.
`In our
`experiments, we empirically study different
`kernel construction strategies on two widely
`used tasks: neural machine translation and se-
`quence prediction.
`
`1 Introduction
`
`Transformer (Vaswani et al., 2017) is a relative
`new architecture which
`outperforms
`tradi-
`tional deep learning models such as Recurrent
`Neural Networks
`(RNNs)
`(Sutskever et al.,
`2014)
`and
`Temporal Convolutional Net-
`works (TCNs) (Bai et al., 2018) for sequence
`modeling tasks across neural machine trans-
`lations
`(Vaswani et al., 2017),
`language un-
`derstanding
`(Devlin et al.,
`2018),
`sequence
`prediction (Dai et al., 2019),
`image genera-
`
`tion (Child et al., 2019), video activity clas-
`sification (Wang et al., 2018), music genera-
`tion (Huang et al., 2018a),
`and multimodal
`sentiment analysis (Tsai et al., 2019a). Instead of
`performing recurrence (e.g., RNN) or convolution
`(e.g., TCN) over the sequences, Transformer is a
`feed-forward model that concurrently processes
`the entire sequence. At the core of the Transformer
`is its attention mechanism, which is proposed to
`integrate the dependencies between the inputs.
`There are up to three types of attention within the
`full Transformer model as exemplified with neural
`machine translation application (Vaswani et al.,
`2017): 1) Encoder self-attention considers the
`source sentence as input, generating a sequence
`of encoded representations, where each encoded
`token has a global dependency with other tokens
`in the input sequence. 2) Decoder self-attention
`considers the target sentence (e.g., predicted
`target sequence for translation) as input, gener-
`ating a sequence of decoded representations1 ,
`where each decoded token depends on previous
`decoded tokens.
`3) Decoder-encoder attention
`considers both encoded and decoded sequences,
`generating a sequence with the same length as the
`decoded sequence. It should be noted that some
`applications has only the decoder self-attention
`such as sequence prediction (Dai et al., 2019). In
`all cases, the Transformer’s attentions follow the
`same general mechanism.
`At the high level,
`the attention can be seen
`as a weighted combination of
`the input se-
`quence, where the weights are determined by
`the similarities between elements of the input se-
`quence. We note that this operation is order-
`agnostic to the permutation in the input se-
`
`1The generated sequence can be regarded as a translated
`sequence (i.e., translating from the encoded sequence), where
`each generated token depends on all tokens in the encoded
`sequence.
`
`arXiv:1908.11775v4 [cs.LG] 11 Nov 2019
`
`1
`
`Petitioner, EX1019
`IPR2024-01234
`Hugging Face, Inc., v. FriendliAI Inc.
`
`

`

`quence (order is encoded with extra positional em-
`bedding (Vaswani et al., 2017; Shaw et al., 2018;
`Dai et al., 2019)). The above observation inspires
`us to connect Transformer’s attention to kernel
`learning (Scholkopf and Smola, 2001): they both
`concurrently and order-agnostically process all in-
`puts by calculating the similarity between the
`inputs. Therefore,
`in the paper, we present a
`new formulation for Transformer’s attention via
`the lens of kernel.
`To be more precise,
`the
`new formulation can be interpreted as a kernel
`smoother (Wasserman, 2006) over the inputs in
`a sequence, where the kernel measures how sim-
`ilar two different inputs are. The main advantage
`of connecting attention to kernel is that it opens
`up a new family of attention mechanisms that can
`relate to the well-established literature in kernel
`learning (Scholkopf and Smola, 2001). As a re-
`sult, we develop a new variant of attention which
`simply considers a product of symmetric kernels
`when modeling non-positional and positional em-
`bedding.
`Furthermore, our proposed formulation high-
`lights naturally the main components of Trans-
`former’s attention, enabling a better understand-
`ing of this mechanism: recent variants of Trans-
`formers (Shaw et al., 2018; Huang et al., 2018b;
`Dai et al., 2019; Child et al., 2019; Lee et al.,
`2018; Wang et al., 2018; Tsai et al., 2019a) can
`be expressed through these individual compo-
`nents. Among all the components, we argue
`that
`the most
`important one is the construc-
`tion of the kernel function. We empirically
`study multiple kernel forms and the ways to in-
`tegrate positional embedding in neural machine
`translation (NMT) using IWSLT’14 German-
`English (De-En) dataset
`(Edunov et al., 2017)
`and sequence prediction (SP) using WikiText-103
`dataset (Merity et al., 2016).
`
`2 Attention
`
`This section aims at providing an understand-
`ing of attention in Transformer via the lens of
`kernel. The inspiration for connecting the ker-
`nel (Scholkopf and Smola, 2001) and attention in-
`stantiates from the observation: both operations
`concurrently processes all inputs and calculate the
`similarity between the inputs. We first introduce
`the background (i.e., the original formulation) of
`attention and then provide a new reformulation
`within the class of kernel smoothers (Wasserman,
`
`2006). Next, we show that this new formulation
`allows us to explore new family of attention while
`at the same time offering a framework to cate-
`gorize previous attention variants (Vaswani et al.,
`2017; Shaw et al., 2018; Huang et al., 2018b;
`Dai et al., 2019; Child et al., 2019; Lee et al.,
`2018; Wang et al., 2018; Tsai et al., 2019a). Last,
`we present a new form of attention, which requires
`fewer parameters and empirically reaches compet-
`itive performance as the state-of-the-art models.
`
`For notation, we use lowercase representing
`a vector (e.g., x), bold lowercase representing
`a matrix (e.g., x), calligraphy letter denoting a
`
`space (e.g., X ), and S denoting a set. To re-
`element of a sequence, x = [x1, x2, ⋯, xT] de-
`notes a sequence of features, Sx ={x1, x2, ⋯, xT}
`Sx as S.
`
`late the notations in sequence to sequence learn-
`ing (Vaswani et al., 2017), x represents a specific
`
`represents the set with its elements being the fea-
`tures in sequence x, and we refer the space of set
`
`2.1 Technical Background
`
`Unlike recurrent computation (Sutskever et al.,
`2014) (i.e., RNNs) and temporal convolutional
`computation (Bai et al., 2018) (i.e., TCNs), Trans-
`former’s attention is an order-agnostic opera-
`tion given the order in the inputs (Vaswani et al.,
`2017). Hence,
`in the presentation of the pa-
`per, we consider the inputs as a set
`instead
`of a sequence. When viewing sequence as a
`set, we lose the temporal (positional) informa-
`tion in inputs which is often crucial
`for se-
`quence modeling (Sutskever et al., 2014). As a
`result, Transformer (Vaswani et al., 2017) intro-
`duced positional embedding to indicate the po-
`sitional relation for the inputs. Formally, a se-
`
`quence x = [x1, x2, ⋯, xT] defines each element
`as xi = (fi, ti) with fi ∈ F being the non-
`temporal feature at time i and ti ∈ T as an tempo-
`
`ral feature (or we called it positional embedding).
`Note that fi can be the word representation (in
`neural machine translation (Vaswani et al., 2017)),
`a frame in a video (in video activity recogni-
`tion (Wang et al., 2018)), or a music unit (in music
`generation (Huang et al., 2018b)). ti can be a mix-
`ture of sine and cosine functions (Vaswani et al.,
`2017) or parameters that can be learned dur-
`ing back-propagation (Dai et al., 2019; Ott et al.,
`2019). The feature vector are defined over a joint
`
`space X ∶= (F × T ). The resulting permutation-
`
`2
`
`

`

`representation of sequence xk). The probability is
`determined by the dot product between xq and xk
`
`dk, which we note the dot-product operation is an
`instance of kernel function. We also introduce a
`
`xq observing the keys {xk}s in Sxk (Sxk is the set
`with additional mappings Wq~Wk and scaling by
`set filtering function M(xq, Sxk) ∶ X × S → S
`function M(⋅,⋅) plays as the role of the mask in de-
`
`which returns a set with its elements that operate
`with (or are connected/visible to) xq. The filtering
`
`coder self-attention (Vaswani et al., 2017). Putting
`these altogether, we re-represent Eq. (1) into the
`following definition.
`
`Definition 1. Given a non-negative kernel func-
`
`tion k(⋅,⋅) ∶ X × X → R+, a set filtering func-
`tion M(⋅,⋅) ∶ X × S → S, and a value function
`v(⋅) ∶ X → Y, the Attention function taking the
`input of a query feature xq ∈ X is defined as
`Attentionxq ; M(xq, Sxk)
`k(xq, xk)
`′∈M(xq,Sxk) k(xq, xk′) v(xk).
`Qxk∈M(xq,Sxk)
`∑xk
`
`=
`
`(2)
`
`invariant set
`
`is:
`
`Sx
`
`= {x1, x2, ⋯, xT} =
`{(f1, t1),(f2, t2), ⋯,(fT , tT)}.
`
`Followed the definition by Vaswani et al.
`(2017), we use queries(q)/keys(k)/values(v)
`to
`represent
`the inputs for the attention.
`To be
`more precise, x{q~k~v} is used for denoting a
`query/key/value data
`in the query/key/value
`sequence x{q~k~v} (x{q~k~v} ∈ Sx{q~k~v}) with
`Sx{q~k~v} being its set representation. We note
`that the input sequences are the same (xq = xk)
`
`for self-attention and are different (xq from de-
`coder and xk from encoder) for encoder-decoder
`attention.
`at-
`the
`notation,
`introduced
`the
`Given
`Trans-
`original
`mechanism
`in
`tention
`former (Vaswani et al., 2017) can be presented as:
`
`Attention(xq ; Sxk)
`= softmax xqWq(xkWk)⊺
`√dk
` xkWv
`with xq = fq + tq, xk = fk + tk, Wq~k~v being
`
`(1)
`
`the weight, and dk being the feature dimension of
`xkWk. Decoder self-attention further introduces a
`mask to block the visibility of elements in Sxk to
`xq. Particularly, decoder self-attention considers
`
`the decoded sequence as inputs (xk = xq), where
`
`the decoded token at time t is not allowed to access
`the future decoded tokens (i.e., tokens decoded at
`time greater than t). On the contrary, encoder self-
`attention and decoder-encoder attention consider
`no additional mask to Eq. (1).
`Recent work (Shaw et al., 2018; Dai et al.,
`2019; Huang et al., 2018b; Child et al., 2019;
`Lee et al., 2018; Parmar et al., 2018; Tsai et al.,
`2019a) proposed modifications to the Transformer
`for the purpose of better modeling inputs po-
`sitional relation (Shaw et al., 2018; Huang et al.,
`2018b; Dai et al., 2019), appending additional
`keys in Sxk (Dai et al., 2019), modifying the mask
`applied to Eq.
`(1) (Child et al., 2019), or ap-
`plying to distinct feature types (Lee et al., 2018;
`Parmar et al., 2018; Tsai et al., 2019a). These
`works adopt different designs of attention as com-
`paring to the original form (Eq. (1)). In our paper,
`we aim at providing an unified view via the lens of
`kernel.
`
`The Definition 1 is
`(Wasserman,
`smoothers
`smoothing:
`
`class of
`a
`2006) with
`
`linear
`kernel
`
`k(xq, xk)
`′∈M(xq,Sxk) k(xq, xk′) v(xk)
`Qxk∈M(xq,Sxk)
`∑xk
`p(xkxq) v(xk)(cid:6),
`= E
`v(xk)
`p(xkxq) =
`′)
`function depends on k and N when k(⋅,⋅) is
`2017), k(xq, xk) = exp ⟨xqWq, xkWk⟩~√dk
`and v(xk) = xkWv.
`form k(xq, xk)
`
`where
`
`the
`
`outputs
`k(xq,xk)
`) k(xq,xk
`′∈M (xq ,Sxk
`
`∑xk
`
`“values”
`
`and
`
`is a probability
`
`always positive. In the prior work (Vaswani et al.,
`
`the ker-
`Note that
`in the original Trans-
`nel
`former (Vaswani et al., 2017) is a asymmetric
`exponential kernel with additional mapping Wq
`and Wk (Wilson et al., 2016; Li et al., 2017)2.
`The new formulation defines a larger space
`for composing attention by manipulating its in-
`dividual components, and at the same time it is
`
`2.2 Reformulation via the Lens of Kernel
`
`We now provide the intuition to reformulate Eq.
`(1) via the lens of kernel. First, the softmax func-
`tion can be realized as a probability function for
`
`2We note that
`func-
`rigorous definition of kernel
`tion (Scholkopf and Smola, 2001) requires the kernel to be
`semi-positive definite and symmetric. While in the paper, the
`discussion on kernel allows it to be non-semi-positive definite
`and asymmetric. In Section 3, we will examine the kernels
`which are semi-positive and symmetric.
`
`3
`
`

`

`able to categorize different variants of attention in
`prior work (Shaw et al., 2018; Huang et al., 2018b;
`Dai et al., 2019; Child et al., 2019; Lee et al.,
`2018; Wang et al., 2018; Tsai et al., 2019a). In the
`following, we study these components by dissect-
`ing Eq.
`
`(2) into: 1) kernel feature space X , 2)
`kernel construction k(⋅,⋅), 3) value function v(⋅),
`and 4) set filtering function M(⋅,⋅).
`2.2.1 Kernel Feature Space X
`to construct a kernel on X ,
`
`X .
`
`the
`(2),
`In Eq.
`first thing is to identify the kernel feature space
`In addition to modeling sequences like
`word sentences (Vaswani et al., 2017) or music
`signals (Huang et al., 2018b),
`the Transformer
`can also be applied to images (Parmar et al.,
`2018), sets (Lee et al., 2018), and multimodal se-
`quences (Tsai et al., 2019a). Due to distinct data
`types, these applications admit various kernel fea-
`ture space:
`
`(i) Sequence Transformer (Vaswani et al., 2017;
`Dai et al., 2019):
`
`X ∶= (F ×T )
`with F being non-positional feature space and T
`
`being the positional embedding space of the posi-
`tion in the sequence.
`
`(ii) Image Transformer (Parmar et al., 2018):
`
`X ∶= (F ×H×W)
`with F being non-positional feature space, H be-
`and W being the positional space of the width in
`
`ing the positional space of the height in an image,
`
`an image.
`
`(iii) Set Transformer (Lee et al., 2018) and Non-
`Local Neural Networks (Wang et al., 2018):
`
`X ∶= (F)
`
`with no any positional information present.
`
`(iv) Multimodal Transformer (Tsai et al., 2019a):
`
`X ∶= (F ℓ ×F v ×F a ×T )
`with F ℓ representing the language feature space,
`F v representing the vision feature space, F a rep-
`resenting the audio feature space, andT represent-
`setting for sequence Transformer X = (F × T )
`
`ing the temporal indicator space.
`For the rest of the paper, we will focus on the
`
`and discuss the kernel construction on it.
`
`2.2.2 Kernel Construction and the Role of
`
`Positional Embedding k(⋅,⋅)
`The kernel construction on X = (F × T )
`
`of Trans-
`in variants
`design
`has distinct
`Dai et al.,
`formers
`(Vaswani et al.,
`2017;
`2019; Huang et al., 2018b; Shaw et al., 2018;
`Child et al., 2019). Since now the kernel feature
`space considers a joint space, we will first discuss
`
`feature space) and then discuss how different
`variants integrate the positional embedding (with
`
`the kernel construction on F (the non-positional
`the positional feature space T ) into the kernel.
`Kernel construction on F . All the work con-
`
`sidered the scaled asymmetric exponential kernel
`with the mapping Wq and Wk (Wilson et al., 2016;
`Li et al., 2017) for non-positional features fq and
`fk:
`
`kexp(fq, fk)= exp⟨fqWq, fkWk⟩
`√dk
`
` .
`
`(3)
`
`is
`the usage of asymmetric kernel
`Note that
`also commonly used in various machine learn-
`ing tasks (Yilmaz, 2007; Tsuda, 1999; Kulis et al.,
`2011), where they observed the kernel form can
`be flexible and even non-valid (i.e., a kernel that is
`not symmetric and positive semi-definite). In Sec-
`tion 3, we show that symmetric design of the ker-
`nel has similar performance for various sequence
`learning tasks, and we also examine different ker-
`nel choices (i.e., linear, polynomial, and rbf ker-
`nel).
`
`Kernel construction on X = (F × T ). The de-
`
`signs for integrating the positional embedding tq
`and tk are listed in the following.
`(i) Absolute Positional Embedding (Vaswani et al.,
`2017; Dai et al., 2019; Ott et al., 2019): For the
`original Transformer (Vaswani et al., 2017), each
`ti is represented by a vector with each dimen-
`sion being sine or cosine functions. For learned
`positional embedding (Dai et al., 2019; Ott et al.,
`2019), each ti
`is a learned parameter and is
`fixed for the same position for different sequences.
`These works defines the feature space as the di-
`rect sum of its temporal and non-temporal space:
`
`ilarity is defined as
`
`X = F ࣷT . Via the lens of kernel, the kernel sim-
`kxq, xk ∶= kexpfq + tq, fk + tk.
`
`(ii) Relative Positional Embedding in Transformer-
`XL (Dai et al., 2019): t represents the indicator of
`
`(4)
`
`4
`
`

`

`the position in the sequence, and the kernel is cho-
`sen to be asymmetric of mixing sine and cosine
`functions:
`
`(5)
`
`kxq, xk ∶= kexpfq, fk⋅ kfqtq, tk
`with kfqtq, tk being an asymmetric kernel with
`log kfqtq, tk =
`c2p sin( tq−tk
`
`512 )+ c2p+1 cos( tq−tk512 )
`∑⌊dk~2⌋−1
`with [c0,⋯, cdk−1] = fqWqWR where WR is
`
`coefficients inferred by fq:
`
`p=0
`
`2p
`
`10000
`
`2p
`
`10000
`
`an learned weight matrix. We refer readers
`to Dai et al. (2019) for more details.
`
`(iii) Relative Positional Embedding of Shaw et al.
`(2018) and Music Transformer
`(Huang et al.,
`2018b): t⋅ represents the indicator of the position
`in the sequence, and the kernel is modified to be
`indexed by a look-up table:
`
`(6)
`
`kxq, xk ∶= Ltq −tk ,fq ⋅ kexpfq, fk,
`where Ltq −tk ,fq = exp(fqWqatq −tk) with a⋅ be-
`
`ing a learnable matrix having matrix width to
`be the length of the sequence. We refer readers
`to Shaw et al. (2018) for more details.
`Dai et al. (2019) showed that the way to inte-
`grate positional embedding is better through Eq.
`(5) than through Eq. (6) and is better through Eq.
`(6) than through Eq.
`(4). We argue the reason
`is that if viewing fi and ti as two distinct spaces
`
`X ∶= (F ×T ), the direct sum xi = fi + ti may
`
`not be optimal when considering the kernel score
`between xq and xk. In contrast, Eq. (5) represents
`the kernel as a product of two kernels (one for fi
`and another for ti), which is able to capture the
`similarities for both temporal and non-temporal
`components.
`
`2.2.3 Value Function v(⋅)
`
`The current Transformers consider two different
`value function construction:
`
`(i) Original Transformer (Vaswani et al., 2017)
`and Sparse Transformer (Child et al., 2019):
`
`v(xk) = v((fk, tk)) ∶= (fk + tk)Wv.
`
`(7)
`
`(ii) Transformer-XL (Dai et al., 2019), Music
`Transformer (Huang et al., 2018b), Self-Attention
`with Relative Positional Embedding (Shaw et al.,
`2018):
`
`v(xk) = v((fk, tk)) ∶= fkWv.
`
`(8)
`
`(7) takes
`(8), Eq.
`(7) to Eq.
`Compared Eq.
`the positional embedding into account for con-
`structing the value function. In Section 3, we em-
`pirically observe that constructing value function
`with Eq. (8) constantly outperforms the construc-
`tion with Eq. (7), which suggests that we do not
`need positional embedding for value function.
`
`2.2.4 Set Filtering Function M(⋅,⋅)
`function M(xq, Sxk) defines how many keys and
`
`In Eq.
`
`(2), the returned set by the set filtering
`
`which keys are operating with xq. In the follow-
`ing, we itemize the corresponding designs for the
`variants in Transformers:
`(i) Encoder Self-Attention in original Trans-
`former (Vaswani et al., 2017): For each query xq
`
`tains the keys being all the tokens in the encoded
`sequence. Note that encoder self-attention consid-
`
`in the encoded sequence, M(xq, Sxk) = Sxk con-
`ers xq = xk with xq being the encoded sequence.
`
`(ii) Encoder-Decoder Attention in original Trans-
`former (Vaswani et al., 2017): For each query xq
`
`the keys being all the tokens in the encoded se-
`quence. Note that encode-decoder attention con-
`
`in decoded sequence, M(xq, Sxk) = Sxk contains
`siders xq ≠ xk with xq being the decoded se-
`
`quence and xk being the encoded sequence.
`
`(iii) Decoder Self-Attention in original Trans-
`former (Vaswani et al., 2017): For each query xq
`
`in the decoded sequence, M(xq, Sxk) returns a
`subset of Sxk (M(xq, Sxk) ⊂ Sxk ). Note that
`decoder self-attention considers xq = xk with xq
`
`being the decoded sequence. Since the decoded
`sequence is the output for previous timestep, the
`query at position i can only observe the keys being
`
`convenience, let us define S1 as the set returned by
`original Transformer (Vaswani et al., 2017) from
`
`the tokens that are decoded with position < i. For
`M(xq, Sxk), which we will use it later.
`
`(iv) Decoder Self-Attention in Transformer-
`XL (Dai et al., 2019): For each query xq in
`returns
`a set containing S1 and additional memories
`
`the decoded sequence, M(xq, Sxk)
`(M(xq, Sxk) = S1 + Smem, M(xq, Sxk) ⊃ S1).
`
`Smem refers to additional memories.
`
`(v) Decoder Self-Attention in Sparse Trans-
`former (Child et al., 2019): For each query xq in
`
`the decoded sentence, M(xq, Sxk) returns a sub-
`set of S1 (M(xq, Sxk) ⊂ S1).
`
`To compare the differences for various designs,
`we see the computation time is inversely propor-
`
`5
`
`

`

`tional to the number of elements in M(xq, Sxk).
`tional memories in M(xq, Sxk) are able to cap-
`
`For performance-wise comparisons, Transformer-
`XL (Dai et al., 2019) showed that,
`the addi-
`
`ture longer-term dependency than the original
`Transformer
`(Vaswani et al., 2017) and hence
`results in better performance.
`Sparse Trans-
`former (Child et al., 2019) showed that although
`
`having much fewer elements in M(xq, Sxk), if the
`
`elements are carefully chosen, the attention can
`still reach the same performance as Transformer-
`XL (Dai et al., 2019).
`
`2.3 Exploring the Design of Attention
`
`So far, we see how Eq. (2) connects to the vari-
`ants of Transformers. By changing the kernel con-
`struction in Section 2.2.2, we can define a larger
`space for composing attention. In this paper, we
`present a new form of attention with a kernel that
`is 1) valid (i.e., a kernel that is symmetric and pos-
`itive semi-definite) and 2) delicate in the sense of
`
`constructing a kernel on a joint space (i.e., X =
`(F ×T )):
`k(xq, xk)∶= kFfq, fk⋅ kTtq, tk
`with kF(fq, fk) = exp⟨fqWF , fkWF⟩
`√dk
`
`and kT(tq, tk) = exp⟨tqWT , tkWT⟩
`√dk
`,
`
`(9)
`
`where WF and WT are weight matrices. The new
`form considers product of kernels with the first
`kernel measuring similarity between non-temporal
`features and the second kernel measuring simi-
`larity between temporal features. Both kernels
`are symmetric exponential kernel. Note that ti
`here is chosen as the mixture of sine and co-
`sine functions as in the prior work (Vaswani et al.,
`2017; Ott et al., 2019).
`In our experiment, we
`find it reaching competitive performance as com-
`paring to the current state-of-the-art designs (Eq.
`(5) by Dai et al. (2019)). We fix the size of the
`weight matrices W⋅ in Eq. (9) and Eq. (5) which
`means we save 33% of the parameters in attention
`from Eq.
`(9) to Eq.
`(5) (Eq.
`(5) has weights
`
`WQ~WK~WR and Eq. (9) has weights WF~WT ).
`
`3 Experiments
`
`By viewing the attention mechanism with Eq. (2),
`we aims at answering the following questions re-
`garding the Transformer’s designs:
`
`Q1. What is the suggested way for incorporating
`positional embedding in the kernel function?
`
`Q2. What forms of kernel are recommended to
`choose in the attention mechanism? Can we re-
`place the asymmetric kernel with the symmetric
`version?
`
`Q3. Is there any exception that the attention mech-
`anism is not order-agnostic with respect to inputs?
`If so, can we downplay the role of positional em-
`bedding?
`
`Q4.
`Is positional embedding required in value
`function?
`We conduct experiments on neural machine
`translation (NMT) and sequence prediction (SP)
`tasks since these two tasks are commonly chosen
`for studying Transformers (Vaswani et al., 2017;
`Dai et al., 2019). Note that NMT has three
`different types of attentions (e.g., encoder self-
`attention, decoder-encoder attention, decoder self-
`attention) and SP has only one type of atten-
`tion (e.g., decoder self-attention). For the choice
`of datasets, we pick IWSLT’14 German-English
`(De-En) dataset (Edunov et al., 2017) for NMT
`and WikiText-103 dataset (Merity et al., 2016) for
`SP as suggested by Edunov et al. (Edunov et al.,
`2017) and Dai et al. (Dai et al., 2019). For fairness
`of comparisons, we train five random initializa-
`tions and report test accuracy with the highest val-
`idation score. We fix the position-wise operations
`in Transformer3 and only change the attention
`mechanism. Similar to prior work (Vaswani et al.,
`2017; Dai et al., 2019), we report BLEU score for
`NMT and perplexity for SP.
`
`3.1 Incorporating Positional Embedding
`
`In order to find the best way to integrate positional
`embedding (PE), we study different PE incorpora-
`
`tion in the kernel function k(⋅,⋅) in Eq. (2). Refer-
`
`ring to Sections 2.2.2 and 2.3, we consider four
`cases: 1) PE as direct sum in the feature space
`(see Eq.
`(4)), 2) PE as a look-up table (see Eq.
`(6)), 3) PE in product kernel with asymmetric ker-
`nel (see Eq. (5)), and 4) PE in product kernel with
`symmetric kernel (see Eq. (9)). We present the
`results in Table 1.
`First, we see that by having PE as a look-up
`
`3The computation of Transformer can be categorized into
`position-wise and inter-positions (i.e., the attention mecha-
`nism) operations. Position-wise operations include layer nor-
`malization, residual connection, and feed-forward mapping.
`We refer the readers to Vaswani et al. (Vaswani et al., 2017)
`for more details.
`
`6
`
`

`

`Table 1: Incorporating Positional Embedding (PE). NMT stands for neural machine translation on IWSLT’14
`De-En dataset (Edunov et al., 2017) and SP stands for sequence prediction on WikiText-103 dataset (Merity et al.,
`
`2016). ↑ means the upper the better and ↓ means the lower the better.
`
`Approach
`
`PE Incorporation
`
`Kernel Form
`
`NMT (BLEU↑)
`
`SP (Perplexity↓)
`
`Vaswani et al. (2017) (Eq. (4))
`
`Direct-Sum
`
`Shaw et al. (2018) (Eq. (6))
`
`Look-up Table
`
`Dai et al. (2019) (Eq. (5))
`
`Product Kernel
`
`Ours (Eq. (9))
`
`Product Kernel
`
`kexpfq + tq, fk + tk
`Ltq −tk,fq ⋅ kexpfq, fk
`kexpfq, fk⋅ kfqtq, tk
`kFfq, fk⋅ kTtq, tk
`
`33.98
`
`34.12
`
`33.62
`
`34.71
`
`30.97
`
`27.56
`
`24.10
`
`24.28
`
`Table 2: Kernel Types. Other than manipulating the kernel choice of the non-positional features, we fix the
`configuration by Vaswani et al. (2017) for NMT and the configuration by Dai et al. (2019) for SP.
`
`Type
`
`Kernel Form
`
`Linear
`
`Polynomial
`
`Exponential
`
`RBF
`
`⟨faWq, fbWk⟩
`⟨faWq, fbWk⟩2
`exp⟨faWq,fbWk⟩
`
`√dk
`exp− faWq −fbWk 2
`√dk
`
`
`
`NMT (BLEU↑)
`
`Asym. (Wq ≠ Wk)
`
`Sym. (Wq = Wk)
`
`SP (Perplexity↓)
`
`Asym. (Wq ≠ Wk)
`
`Sym. (Wq = Wk)
`
`not converge
`
`not converge
`
`not converge
`
`not converge
`
`32.72
`
`33.98
`
`34.26
`
`32.43
`
`33.78
`
`34.14
`
`25.91
`
`24.10
`
`24.13
`
`26.25
`
`24.01
`
`24.21
`
`table, it outperforms the case with having PE as
`direct-sum in feature space, especially for SP task.
`Note that the look-up table is indexed by the rela-
`
`tive position (i.e., tq − tk) instead of absolute posi-
`
`tion. Second, we see that PE in the product kernel
`proposed by Dai et al. (Dai et al., 2019) may not
`constantly outperform the other integration types
`(it has lower BLEU score for NMT). Our proposed
`product kernel reaches the best result in NMT and
`is competitive to the best result in SP.
`
`3.2 Kernel Types
`
`To find the best kernel form in the attention
`mechanism, in addition to the exponential kernel
`(see Eq. (3)), we compare different kernel forms
`(i.e., linear, polynomial, and rbf kernel) for the
`non-positional features. We also provide the re-
`sults for changing asymmetric to the symmetric
`
`kernel, when forcing Wq = Wk, so that the result-
`
`ing kernel is a valid kernel (Scholkopf and Smola,
`2001). The numbers are shown in Table 2. Note
`that, for fairness, other than manipulating the ker-
`nel choice of the non-positional features, we fix
`the configuration by Vaswani et al. (Vaswani et al.,
`2017) for NMT and the configuration by Dai et
`al. (Dai et al., 2019) for SP.
`We first observe that the linear kernel does not
`converge for both NMT and SP. We argue the rea-
`son is that the linear kernel may have negative
`
`value and thus it violates the assumption in ker-
`nel smoother that the kernel score must be pos-
`itive (Wasserman, 2006). Next, we observe the
`kernel with infinite feature space (i.e., exponen-
`tial and rbf kernel) outperforms the kernel with fi-
`nite feature space (i.e., polynomial kernel). And
`we see rbf kernel performs the best for NMT and
`exponential kernel performs the best for SP. We
`conclude that the choice of kernel matters for the
`design of attention in Transformer. Also, we see
`no much performance difference when comparing
`asymmetric to symmetric kernel.
`In the experi-
`ment, we fix the size of W⋅ in the kernel, and thus
`adopting the symmetric kernel benefits us from
`saving parameters.
`
`3.3 Order-Invariance in Attention
`
`The need of the positional embedding (PE) in the
`attention mechanism is based on the argument that
`the attention mechanism is an order-agnostic (or,
`permutation equivariant) operation (Vaswani et al.,
`2017; Shaw et al., 2018; Huang et al., 2018b;
`Dai et al., 2019; Child et al., 2019). However, we
`show that, for decoder self-attention, the opera-
`tion is not order-agnostic. For clarification, we
`are not attacking the claim made by the prior
`work (Vaswani et al., 2017; Shaw et al., 2018;
`Huang et al., 2018b; Dai et al., 2019; Child et al.,
`2019), but we aim at providing a new look at the
`
`7
`
`

`

`Table 3: Order-Invariance in Attention. To save the space, we denote Encoder Self-Attention / Encoder-Decoder
`Attention / Decoder Self-Attention as A/B/C. Note that SP only has decoder self-attention.
`
`Approach
`
`Positional Embedding NMT (BLEU↑)
`
`Approach
`
`Positional Embedding
`
`SP (Perplexity↓)
`
`Ours (Eq. (9))
`
`Ours (Eq. (9))
`
`No Positional Embedding
`
`In A/B/C
`
`In A/B
`
`none
`
`34.71
`
`34.49
`
`14.47
`
`Vaswani et al. (2017) (Eq.
`(4))
`
`Ours (Eq. (9)
`
`No Positional Embedding
`
`In C
`
`In C
`
`none
`
`30.97
`
`24.28
`
`30.92
`
`Table 4: Positional Embedding in Value Function.
`
`I: Value Function Considering Positional Embedding (Eq. (7)) / II: Value Function Considering no Positional Embedding (Eq. (8))
`
`Approach
`
`NMT (BLEU↑)
`
`I v(xk) ∶= (fk + tk)WV
`
`II v(xk) ∶= fkWV
`
`SP (Perplexity↓)
`
`I v(xk) ∶= (fk + tk)WV
`
`II v(xk) ∶= fkWV
`
`Vaswani et al. (2017) (Eq. (4))
`
`Shaw et al. (2018) (Eq. (6))
`
`Dai et al. (2019) (Eq. (5))
`
`Ours (Eq. (9))
`
`33.98
`
`34.04
`
`33.32
`
`34.60
`
`34.02
`
`34.12
`
`33.62
`
`34.71
`
`30.97
`
`27.56
`
`24.18
`
`24.42
`
`30.50
`
`27.45
`
`24.10
`
`24.28
`
`order-invariance problem when considering the at-
`tention mechanism with masks (masks refer to the
`set filtering function in our kernel formulation). In
`other words, previous work did not consider the
`mask between queries and keys when discussing
`the order-invariance problem (Pérez et al., 2019).
`To put it formally, we first present the definition
`by Lee et al. (2018) for a permutation equivariance
`function:
`
`Definition 2. Denote Π as the set of all permu-
`
`tations over [n] = {1,⋯, n}. A function f unc ∶
`X n → Y n is permutation equivariant iff for any
`permutation π ∈ Π, f unc(πx) = πf unc(x).
`
`Lee et al. (2018) showed that the standard atten-
`tion (encoder self-attention (Vaswani et al., 2017;
`Dai et al., 2019)
`)
`is permutation equivariant.
`Here, we present the non-permutation-equivariant
`problem on the decoder self-attention:
`
`1.
`Proposition
`self-
`Decoder
`attention (Vaswani et al., 2017; Dai et al., 2019)
`is not permutation equivariant.
`
`To proceed the proof, we need the following def-
`inition and propositions.
`
`Definition 3. Denote Π as the set of all permuta-
`as performing
`
`xk
`
`said to be permutation equivariant w.r.t. Sxk if
`
`tions over [n] = {1,⋯, n} and Sπ
`permutation π over Sxk . Attention(xq; Sxk) is
`
`and only if for any π ∈ Π, Attention(xq; Sπxk) =
`Attention(xq; Sxk).
`tion M(xq, Sxk) = Sxk is permutation equivariant
`
`Proposition 2. Attention with the set filtering func-
`
`w.r.t. Sxk .
`
`Sxk , Eq. (2) remains unchanged for any permu-
`tation π performed on Sxk .
`
`Proposition 3. Attention with the set difference
`
`ariant w.r.t. Sxk .
`
`Proof. It is easy to show that if M(xq, Sxk) =
`∎
`Sxk ∖ M(xq, Sxk) ≠ φ is not permutation equiv-
`Proof. First, suppose that ˆx ∈ Sxk ∖ M(xq, Sxk).
`
`ˆx ∈ M(xq, Sπxk).
`Attentionxq ; M(xq, Sxk) is not permutation
`∎
`xq ∼ Sxk . Hence, showing Attention(xq; Sxk)
`
`Then, we construct a permutation π such that
`It
`is obvious that Eq.
`this permutation and thus
`
`(2) changes after
`
`equivariant w.r.t. Sxk .
`
`Proof. [Proof for Proposition 1]

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket