`Transformer’s Attention via the Lens of Kernel
`
`Yao-Hung Hubert Tsai1 Shaojie Bai1 Makoto Yamada34
`Louis-Philippe Morency2 Ruslan Salakhutdinov1
`{1Machine Learning Department,2Language Technology Institute}, Carnegie Mellon University
`3Kyoto University 4RIKEN AIP
`{yaohungt, shaojieb, morency, rsalakhu}@cs.cmu.edu, myamada@i.kyoto-u.ac.jp
`https://github.com/yaohungt/TransformerDissection
`
`Abstract
`
`Transformer is a powerful architecture that
`achieves superior performance on various se-
`quence learning tasks, including neural ma-
`chine translation, language understanding, and
`sequence prediction. At the core of the Trans-
`former is the attention mechanism, which con-
`currently processes all inputs in the streams.
`In this paper, we present a new formulation
`of attention via the lens of the kernel. To be
`more precise, we realize that the attention can
`be seen as applying kernel smoother over the
`inputs with the kernel scores being the simi-
`larities between inputs. This new formulation
`gives us a better way to understand individ-
`ual components of the Transformer’s attention,
`such as the better way to integrate the posi-
`tional embedding. Another important advan-
`tage of our kernel-based formulation is that it
`paves the way to a larger space of compos-
`ing Transformer’s attention. As an example,
`we propose a new variant of Transformer’s at-
`tention which models the input as a product
`of symmetric kernels. This approach achieves
`competitive performance to the current state of
`the art model with less computation.
`In our
`experiments, we empirically study different
`kernel construction strategies on two widely
`used tasks: neural machine translation and se-
`quence prediction.
`
`1 Introduction
`
`Transformer (Vaswani et al., 2017) is a relative
`new architecture which
`outperforms
`tradi-
`tional deep learning models such as Recurrent
`Neural Networks
`(RNNs)
`(Sutskever et al.,
`2014)
`and
`Temporal Convolutional Net-
`works (TCNs) (Bai et al., 2018) for sequence
`modeling tasks across neural machine trans-
`lations
`(Vaswani et al., 2017),
`language un-
`derstanding
`(Devlin et al.,
`2018),
`sequence
`prediction (Dai et al., 2019),
`image genera-
`
`tion (Child et al., 2019), video activity clas-
`sification (Wang et al., 2018), music genera-
`tion (Huang et al., 2018a),
`and multimodal
`sentiment analysis (Tsai et al., 2019a). Instead of
`performing recurrence (e.g., RNN) or convolution
`(e.g., TCN) over the sequences, Transformer is a
`feed-forward model that concurrently processes
`the entire sequence. At the core of the Transformer
`is its attention mechanism, which is proposed to
`integrate the dependencies between the inputs.
`There are up to three types of attention within the
`full Transformer model as exemplified with neural
`machine translation application (Vaswani et al.,
`2017): 1) Encoder self-attention considers the
`source sentence as input, generating a sequence
`of encoded representations, where each encoded
`token has a global dependency with other tokens
`in the input sequence. 2) Decoder self-attention
`considers the target sentence (e.g., predicted
`target sequence for translation) as input, gener-
`ating a sequence of decoded representations1 ,
`where each decoded token depends on previous
`decoded tokens.
`3) Decoder-encoder attention
`considers both encoded and decoded sequences,
`generating a sequence with the same length as the
`decoded sequence. It should be noted that some
`applications has only the decoder self-attention
`such as sequence prediction (Dai et al., 2019). In
`all cases, the Transformer’s attentions follow the
`same general mechanism.
`At the high level,
`the attention can be seen
`as a weighted combination of
`the input se-
`quence, where the weights are determined by
`the similarities between elements of the input se-
`quence. We note that this operation is order-
`agnostic to the permutation in the input se-
`
`1The generated sequence can be regarded as a translated
`sequence (i.e., translating from the encoded sequence), where
`each generated token depends on all tokens in the encoded
`sequence.
`
`arXiv:1908.11775v4 [cs.LG] 11 Nov 2019
`
`1
`
`Petitioner, EX1019
`IPR2024-01234
`Hugging Face, Inc., v. FriendliAI Inc.
`
`
`
`quence (order is encoded with extra positional em-
`bedding (Vaswani et al., 2017; Shaw et al., 2018;
`Dai et al., 2019)). The above observation inspires
`us to connect Transformer’s attention to kernel
`learning (Scholkopf and Smola, 2001): they both
`concurrently and order-agnostically process all in-
`puts by calculating the similarity between the
`inputs. Therefore,
`in the paper, we present a
`new formulation for Transformer’s attention via
`the lens of kernel.
`To be more precise,
`the
`new formulation can be interpreted as a kernel
`smoother (Wasserman, 2006) over the inputs in
`a sequence, where the kernel measures how sim-
`ilar two different inputs are. The main advantage
`of connecting attention to kernel is that it opens
`up a new family of attention mechanisms that can
`relate to the well-established literature in kernel
`learning (Scholkopf and Smola, 2001). As a re-
`sult, we develop a new variant of attention which
`simply considers a product of symmetric kernels
`when modeling non-positional and positional em-
`bedding.
`Furthermore, our proposed formulation high-
`lights naturally the main components of Trans-
`former’s attention, enabling a better understand-
`ing of this mechanism: recent variants of Trans-
`formers (Shaw et al., 2018; Huang et al., 2018b;
`Dai et al., 2019; Child et al., 2019; Lee et al.,
`2018; Wang et al., 2018; Tsai et al., 2019a) can
`be expressed through these individual compo-
`nents. Among all the components, we argue
`that
`the most
`important one is the construc-
`tion of the kernel function. We empirically
`study multiple kernel forms and the ways to in-
`tegrate positional embedding in neural machine
`translation (NMT) using IWSLT’14 German-
`English (De-En) dataset
`(Edunov et al., 2017)
`and sequence prediction (SP) using WikiText-103
`dataset (Merity et al., 2016).
`
`2 Attention
`
`This section aims at providing an understand-
`ing of attention in Transformer via the lens of
`kernel. The inspiration for connecting the ker-
`nel (Scholkopf and Smola, 2001) and attention in-
`stantiates from the observation: both operations
`concurrently processes all inputs and calculate the
`similarity between the inputs. We first introduce
`the background (i.e., the original formulation) of
`attention and then provide a new reformulation
`within the class of kernel smoothers (Wasserman,
`
`2006). Next, we show that this new formulation
`allows us to explore new family of attention while
`at the same time offering a framework to cate-
`gorize previous attention variants (Vaswani et al.,
`2017; Shaw et al., 2018; Huang et al., 2018b;
`Dai et al., 2019; Child et al., 2019; Lee et al.,
`2018; Wang et al., 2018; Tsai et al., 2019a). Last,
`we present a new form of attention, which requires
`fewer parameters and empirically reaches compet-
`itive performance as the state-of-the-art models.
`
`For notation, we use lowercase representing
`a vector (e.g., x), bold lowercase representing
`a matrix (e.g., x), calligraphy letter denoting a
`
`space (e.g., X ), and S denoting a set. To re-
`element of a sequence, x = [x1, x2, ⋯, xT] de-
`notes a sequence of features, Sx ={x1, x2, ⋯, xT}
`Sx as S.
`
`late the notations in sequence to sequence learn-
`ing (Vaswani et al., 2017), x represents a specific
`
`represents the set with its elements being the fea-
`tures in sequence x, and we refer the space of set
`
`2.1 Technical Background
`
`Unlike recurrent computation (Sutskever et al.,
`2014) (i.e., RNNs) and temporal convolutional
`computation (Bai et al., 2018) (i.e., TCNs), Trans-
`former’s attention is an order-agnostic opera-
`tion given the order in the inputs (Vaswani et al.,
`2017). Hence,
`in the presentation of the pa-
`per, we consider the inputs as a set
`instead
`of a sequence. When viewing sequence as a
`set, we lose the temporal (positional) informa-
`tion in inputs which is often crucial
`for se-
`quence modeling (Sutskever et al., 2014). As a
`result, Transformer (Vaswani et al., 2017) intro-
`duced positional embedding to indicate the po-
`sitional relation for the inputs. Formally, a se-
`
`quence x = [x1, x2, ⋯, xT] defines each element
`as xi = (fi, ti) with fi ∈ F being the non-
`temporal feature at time i and ti ∈ T as an tempo-
`
`ral feature (or we called it positional embedding).
`Note that fi can be the word representation (in
`neural machine translation (Vaswani et al., 2017)),
`a frame in a video (in video activity recogni-
`tion (Wang et al., 2018)), or a music unit (in music
`generation (Huang et al., 2018b)). ti can be a mix-
`ture of sine and cosine functions (Vaswani et al.,
`2017) or parameters that can be learned dur-
`ing back-propagation (Dai et al., 2019; Ott et al.,
`2019). The feature vector are defined over a joint
`
`space X ∶= (F × T ). The resulting permutation-
`
`2
`
`
`
`representation of sequence xk). The probability is
`determined by the dot product between xq and xk
`
`dk, which we note the dot-product operation is an
`instance of kernel function. We also introduce a
`
`xq observing the keys {xk}s in Sxk (Sxk is the set
`with additional mappings Wq~Wk and scaling by
`set filtering function M(xq, Sxk) ∶ X × S → S
`function M(⋅,⋅) plays as the role of the mask in de-
`
`which returns a set with its elements that operate
`with (or are connected/visible to) xq. The filtering
`
`coder self-attention (Vaswani et al., 2017). Putting
`these altogether, we re-represent Eq. (1) into the
`following definition.
`
`Definition 1. Given a non-negative kernel func-
`
`tion k(⋅,⋅) ∶ X × X → R+, a set filtering func-
`tion M(⋅,⋅) ∶ X × S → S, and a value function
`v(⋅) ∶ X → Y, the Attention function taking the
`input of a query feature xq ∈ X is defined as
`Attentionxq ; M(xq, Sxk)
`k(xq, xk)
`′∈M(xq,Sxk) k(xq, xk′) v(xk).
`Qxk∈M(xq,Sxk)
`∑xk
`
`=
`
`(2)
`
`invariant set
`
`is:
`
`Sx
`
`= {x1, x2, ⋯, xT} =
`{(f1, t1),(f2, t2), ⋯,(fT , tT)}.
`
`Followed the definition by Vaswani et al.
`(2017), we use queries(q)/keys(k)/values(v)
`to
`represent
`the inputs for the attention.
`To be
`more precise, x{q~k~v} is used for denoting a
`query/key/value data
`in the query/key/value
`sequence x{q~k~v} (x{q~k~v} ∈ Sx{q~k~v}) with
`Sx{q~k~v} being its set representation. We note
`that the input sequences are the same (xq = xk)
`
`for self-attention and are different (xq from de-
`coder and xk from encoder) for encoder-decoder
`attention.
`at-
`the
`notation,
`introduced
`the
`Given
`Trans-
`original
`mechanism
`in
`tention
`former (Vaswani et al., 2017) can be presented as:
`
`Attention(xq ; Sxk)
`= softmax xqWq(xkWk)⊺
`√dk
` xkWv
`with xq = fq + tq, xk = fk + tk, Wq~k~v being
`
`(1)
`
`the weight, and dk being the feature dimension of
`xkWk. Decoder self-attention further introduces a
`mask to block the visibility of elements in Sxk to
`xq. Particularly, decoder self-attention considers
`
`the decoded sequence as inputs (xk = xq), where
`
`the decoded token at time t is not allowed to access
`the future decoded tokens (i.e., tokens decoded at
`time greater than t). On the contrary, encoder self-
`attention and decoder-encoder attention consider
`no additional mask to Eq. (1).
`Recent work (Shaw et al., 2018; Dai et al.,
`2019; Huang et al., 2018b; Child et al., 2019;
`Lee et al., 2018; Parmar et al., 2018; Tsai et al.,
`2019a) proposed modifications to the Transformer
`for the purpose of better modeling inputs po-
`sitional relation (Shaw et al., 2018; Huang et al.,
`2018b; Dai et al., 2019), appending additional
`keys in Sxk (Dai et al., 2019), modifying the mask
`applied to Eq.
`(1) (Child et al., 2019), or ap-
`plying to distinct feature types (Lee et al., 2018;
`Parmar et al., 2018; Tsai et al., 2019a). These
`works adopt different designs of attention as com-
`paring to the original form (Eq. (1)). In our paper,
`we aim at providing an unified view via the lens of
`kernel.
`
`The Definition 1 is
`(Wasserman,
`smoothers
`smoothing:
`
`class of
`a
`2006) with
`
`linear
`kernel
`
`k(xq, xk)
`′∈M(xq,Sxk) k(xq, xk′) v(xk)
`Qxk∈M(xq,Sxk)
`∑xk
`p(xk xq) v(xk)(cid:6),
`= E
`v(xk)
`p(xk xq) =
`′)
`function depends on k and N when k(⋅,⋅) is
`2017), k(xq, xk) = exp ⟨xqWq, xkWk⟩~√dk
`and v(xk) = xkWv.
`form k(xq, xk)
`
`where
`
`the
`
`outputs
`k(xq,xk)
`) k(xq,xk
`′∈M (xq ,Sxk
`
`∑xk
`
`“values”
`
`and
`
`is a probability
`
`always positive. In the prior work (Vaswani et al.,
`
`the ker-
`Note that
`in the original Trans-
`nel
`former (Vaswani et al., 2017) is a asymmetric
`exponential kernel with additional mapping Wq
`and Wk (Wilson et al., 2016; Li et al., 2017)2.
`The new formulation defines a larger space
`for composing attention by manipulating its in-
`dividual components, and at the same time it is
`
`2.2 Reformulation via the Lens of Kernel
`
`We now provide the intuition to reformulate Eq.
`(1) via the lens of kernel. First, the softmax func-
`tion can be realized as a probability function for
`
`2We note that
`func-
`rigorous definition of kernel
`tion (Scholkopf and Smola, 2001) requires the kernel to be
`semi-positive definite and symmetric. While in the paper, the
`discussion on kernel allows it to be non-semi-positive definite
`and asymmetric. In Section 3, we will examine the kernels
`which are semi-positive and symmetric.
`
`3
`
`
`
`able to categorize different variants of attention in
`prior work (Shaw et al., 2018; Huang et al., 2018b;
`Dai et al., 2019; Child et al., 2019; Lee et al.,
`2018; Wang et al., 2018; Tsai et al., 2019a). In the
`following, we study these components by dissect-
`ing Eq.
`
`(2) into: 1) kernel feature space X , 2)
`kernel construction k(⋅,⋅), 3) value function v(⋅),
`and 4) set filtering function M(⋅,⋅).
`2.2.1 Kernel Feature Space X
`to construct a kernel on X ,
`
`X .
`
`the
`(2),
`In Eq.
`first thing is to identify the kernel feature space
`In addition to modeling sequences like
`word sentences (Vaswani et al., 2017) or music
`signals (Huang et al., 2018b),
`the Transformer
`can also be applied to images (Parmar et al.,
`2018), sets (Lee et al., 2018), and multimodal se-
`quences (Tsai et al., 2019a). Due to distinct data
`types, these applications admit various kernel fea-
`ture space:
`
`(i) Sequence Transformer (Vaswani et al., 2017;
`Dai et al., 2019):
`
`X ∶= (F ×T )
`with F being non-positional feature space and T
`
`being the positional embedding space of the posi-
`tion in the sequence.
`
`(ii) Image Transformer (Parmar et al., 2018):
`
`X ∶= (F ×H×W)
`with F being non-positional feature space, H be-
`and W being the positional space of the width in
`
`ing the positional space of the height in an image,
`
`an image.
`
`(iii) Set Transformer (Lee et al., 2018) and Non-
`Local Neural Networks (Wang et al., 2018):
`
`X ∶= (F)
`
`with no any positional information present.
`
`(iv) Multimodal Transformer (Tsai et al., 2019a):
`
`X ∶= (F ℓ ×F v ×F a ×T )
`with F ℓ representing the language feature space,
`F v representing the vision feature space, F a rep-
`resenting the audio feature space, andT represent-
`setting for sequence Transformer X = (F × T )
`
`ing the temporal indicator space.
`For the rest of the paper, we will focus on the
`
`and discuss the kernel construction on it.
`
`2.2.2 Kernel Construction and the Role of
`
`Positional Embedding k(⋅,⋅)
`The kernel construction on X = (F × T )
`
`of Trans-
`in variants
`design
`has distinct
`Dai et al.,
`formers
`(Vaswani et al.,
`2017;
`2019; Huang et al., 2018b; Shaw et al., 2018;
`Child et al., 2019). Since now the kernel feature
`space considers a joint space, we will first discuss
`
`feature space) and then discuss how different
`variants integrate the positional embedding (with
`
`the kernel construction on F (the non-positional
`the positional feature space T ) into the kernel.
`Kernel construction on F . All the work con-
`
`sidered the scaled asymmetric exponential kernel
`with the mapping Wq and Wk (Wilson et al., 2016;
`Li et al., 2017) for non-positional features fq and
`fk:
`
`kexp(fq, fk)= exp⟨fqWq, fkWk⟩
`√dk
`
` .
`
`(3)
`
`is
`the usage of asymmetric kernel
`Note that
`also commonly used in various machine learn-
`ing tasks (Yilmaz, 2007; Tsuda, 1999; Kulis et al.,
`2011), where they observed the kernel form can
`be flexible and even non-valid (i.e., a kernel that is
`not symmetric and positive semi-definite). In Sec-
`tion 3, we show that symmetric design of the ker-
`nel has similar performance for various sequence
`learning tasks, and we also examine different ker-
`nel choices (i.e., linear, polynomial, and rbf ker-
`nel).
`
`Kernel construction on X = (F × T ). The de-
`
`signs for integrating the positional embedding tq
`and tk are listed in the following.
`(i) Absolute Positional Embedding (Vaswani et al.,
`2017; Dai et al., 2019; Ott et al., 2019): For the
`original Transformer (Vaswani et al., 2017), each
`ti is represented by a vector with each dimen-
`sion being sine or cosine functions. For learned
`positional embedding (Dai et al., 2019; Ott et al.,
`2019), each ti
`is a learned parameter and is
`fixed for the same position for different sequences.
`These works defines the feature space as the di-
`rect sum of its temporal and non-temporal space:
`
`ilarity is defined as
`
`X = F ࣷT . Via the lens of kernel, the kernel sim-
`kxq, xk ∶= kexpfq + tq, fk + tk.
`
`(ii) Relative Positional Embedding in Transformer-
`XL (Dai et al., 2019): t represents the indicator of
`
`(4)
`
`4
`
`
`
`the position in the sequence, and the kernel is cho-
`sen to be asymmetric of mixing sine and cosine
`functions:
`
`(5)
`
`kxq, xk ∶= kexpfq, fk⋅ kfqtq, tk
`with kfqtq, tk being an asymmetric kernel with
`log kfqtq, tk =
`c2p sin( tq−tk
`
`512 )+ c2p+1 cos( tq−tk512 )
`∑⌊dk~2⌋−1
`with [c0,⋯, cdk−1] = fqWqWR where WR is
`
`coefficients inferred by fq:
`
`p=0
`
`2p
`
`10000
`
`2p
`
`10000
`
`an learned weight matrix. We refer readers
`to Dai et al. (2019) for more details.
`
`(iii) Relative Positional Embedding of Shaw et al.
`(2018) and Music Transformer
`(Huang et al.,
`2018b): t⋅ represents the indicator of the position
`in the sequence, and the kernel is modified to be
`indexed by a look-up table:
`
`(6)
`
`kxq, xk ∶= Ltq −tk ,fq ⋅ kexpfq, fk,
`where Ltq −tk ,fq = exp(fqWqatq −tk) with a⋅ be-
`
`ing a learnable matrix having matrix width to
`be the length of the sequence. We refer readers
`to Shaw et al. (2018) for more details.
`Dai et al. (2019) showed that the way to inte-
`grate positional embedding is better through Eq.
`(5) than through Eq. (6) and is better through Eq.
`(6) than through Eq.
`(4). We argue the reason
`is that if viewing fi and ti as two distinct spaces
`
`X ∶= (F ×T ), the direct sum xi = fi + ti may
`
`not be optimal when considering the kernel score
`between xq and xk. In contrast, Eq. (5) represents
`the kernel as a product of two kernels (one for fi
`and another for ti), which is able to capture the
`similarities for both temporal and non-temporal
`components.
`
`2.2.3 Value Function v(⋅)
`
`The current Transformers consider two different
`value function construction:
`
`(i) Original Transformer (Vaswani et al., 2017)
`and Sparse Transformer (Child et al., 2019):
`
`v(xk) = v((fk, tk)) ∶= (fk + tk)Wv.
`
`(7)
`
`(ii) Transformer-XL (Dai et al., 2019), Music
`Transformer (Huang et al., 2018b), Self-Attention
`with Relative Positional Embedding (Shaw et al.,
`2018):
`
`v(xk) = v((fk, tk)) ∶= fkWv.
`
`(8)
`
`(7) takes
`(8), Eq.
`(7) to Eq.
`Compared Eq.
`the positional embedding into account for con-
`structing the value function. In Section 3, we em-
`pirically observe that constructing value function
`with Eq. (8) constantly outperforms the construc-
`tion with Eq. (7), which suggests that we do not
`need positional embedding for value function.
`
`2.2.4 Set Filtering Function M(⋅,⋅)
`function M(xq, Sxk) defines how many keys and
`
`In Eq.
`
`(2), the returned set by the set filtering
`
`which keys are operating with xq. In the follow-
`ing, we itemize the corresponding designs for the
`variants in Transformers:
`(i) Encoder Self-Attention in original Trans-
`former (Vaswani et al., 2017): For each query xq
`
`tains the keys being all the tokens in the encoded
`sequence. Note that encoder self-attention consid-
`
`in the encoded sequence, M(xq, Sxk) = Sxk con-
`ers xq = xk with xq being the encoded sequence.
`
`(ii) Encoder-Decoder Attention in original Trans-
`former (Vaswani et al., 2017): For each query xq
`
`the keys being all the tokens in the encoded se-
`quence. Note that encode-decoder attention con-
`
`in decoded sequence, M(xq, Sxk) = Sxk contains
`siders xq ≠ xk with xq being the decoded se-
`
`quence and xk being the encoded sequence.
`
`(iii) Decoder Self-Attention in original Trans-
`former (Vaswani et al., 2017): For each query xq
`
`in the decoded sequence, M(xq, Sxk) returns a
`subset of Sxk (M(xq, Sxk) ⊂ Sxk ). Note that
`decoder self-attention considers xq = xk with xq
`
`being the decoded sequence. Since the decoded
`sequence is the output for previous timestep, the
`query at position i can only observe the keys being
`
`convenience, let us define S1 as the set returned by
`original Transformer (Vaswani et al., 2017) from
`
`the tokens that are decoded with position < i. For
`M(xq, Sxk), which we will use it later.
`
`(iv) Decoder Self-Attention in Transformer-
`XL (Dai et al., 2019): For each query xq in
`returns
`a set containing S1 and additional memories
`
`the decoded sequence, M(xq, Sxk)
`(M(xq, Sxk) = S1 + Smem, M(xq, Sxk) ⊃ S1).
`
`Smem refers to additional memories.
`
`(v) Decoder Self-Attention in Sparse Trans-
`former (Child et al., 2019): For each query xq in
`
`the decoded sentence, M(xq, Sxk) returns a sub-
`set of S1 (M(xq, Sxk) ⊂ S1).
`
`To compare the differences for various designs,
`we see the computation time is inversely propor-
`
`5
`
`
`
`tional to the number of elements in M(xq, Sxk).
`tional memories in M(xq, Sxk) are able to cap-
`
`For performance-wise comparisons, Transformer-
`XL (Dai et al., 2019) showed that,
`the addi-
`
`ture longer-term dependency than the original
`Transformer
`(Vaswani et al., 2017) and hence
`results in better performance.
`Sparse Trans-
`former (Child et al., 2019) showed that although
`
`having much fewer elements in M(xq, Sxk), if the
`
`elements are carefully chosen, the attention can
`still reach the same performance as Transformer-
`XL (Dai et al., 2019).
`
`2.3 Exploring the Design of Attention
`
`So far, we see how Eq. (2) connects to the vari-
`ants of Transformers. By changing the kernel con-
`struction in Section 2.2.2, we can define a larger
`space for composing attention. In this paper, we
`present a new form of attention with a kernel that
`is 1) valid (i.e., a kernel that is symmetric and pos-
`itive semi-definite) and 2) delicate in the sense of
`
`constructing a kernel on a joint space (i.e., X =
`(F ×T )):
`k(xq, xk)∶= kFfq, fk⋅ kTtq, tk
`with kF(fq, fk) = exp⟨fqWF , fkWF⟩
`√dk
`
`and kT(tq, tk) = exp⟨tqWT , tkWT⟩
`√dk
`,
`
`(9)
`
`where WF and WT are weight matrices. The new
`form considers product of kernels with the first
`kernel measuring similarity between non-temporal
`features and the second kernel measuring simi-
`larity between temporal features. Both kernels
`are symmetric exponential kernel. Note that ti
`here is chosen as the mixture of sine and co-
`sine functions as in the prior work (Vaswani et al.,
`2017; Ott et al., 2019).
`In our experiment, we
`find it reaching competitive performance as com-
`paring to the current state-of-the-art designs (Eq.
`(5) by Dai et al. (2019)). We fix the size of the
`weight matrices W⋅ in Eq. (9) and Eq. (5) which
`means we save 33% of the parameters in attention
`from Eq.
`(9) to Eq.
`(5) (Eq.
`(5) has weights
`
`WQ~WK~WR and Eq. (9) has weights WF~WT ).
`
`3 Experiments
`
`By viewing the attention mechanism with Eq. (2),
`we aims at answering the following questions re-
`garding the Transformer’s designs:
`
`Q1. What is the suggested way for incorporating
`positional embedding in the kernel function?
`
`Q2. What forms of kernel are recommended to
`choose in the attention mechanism? Can we re-
`place the asymmetric kernel with the symmetric
`version?
`
`Q3. Is there any exception that the attention mech-
`anism is not order-agnostic with respect to inputs?
`If so, can we downplay the role of positional em-
`bedding?
`
`Q4.
`Is positional embedding required in value
`function?
`We conduct experiments on neural machine
`translation (NMT) and sequence prediction (SP)
`tasks since these two tasks are commonly chosen
`for studying Transformers (Vaswani et al., 2017;
`Dai et al., 2019). Note that NMT has three
`different types of attentions (e.g., encoder self-
`attention, decoder-encoder attention, decoder self-
`attention) and SP has only one type of atten-
`tion (e.g., decoder self-attention). For the choice
`of datasets, we pick IWSLT’14 German-English
`(De-En) dataset (Edunov et al., 2017) for NMT
`and WikiText-103 dataset (Merity et al., 2016) for
`SP as suggested by Edunov et al. (Edunov et al.,
`2017) and Dai et al. (Dai et al., 2019). For fairness
`of comparisons, we train five random initializa-
`tions and report test accuracy with the highest val-
`idation score. We fix the position-wise operations
`in Transformer3 and only change the attention
`mechanism. Similar to prior work (Vaswani et al.,
`2017; Dai et al., 2019), we report BLEU score for
`NMT and perplexity for SP.
`
`3.1 Incorporating Positional Embedding
`
`In order to find the best way to integrate positional
`embedding (PE), we study different PE incorpora-
`
`tion in the kernel function k(⋅,⋅) in Eq. (2). Refer-
`
`ring to Sections 2.2.2 and 2.3, we consider four
`cases: 1) PE as direct sum in the feature space
`(see Eq.
`(4)), 2) PE as a look-up table (see Eq.
`(6)), 3) PE in product kernel with asymmetric ker-
`nel (see Eq. (5)), and 4) PE in product kernel with
`symmetric kernel (see Eq. (9)). We present the
`results in Table 1.
`First, we see that by having PE as a look-up
`
`3The computation of Transformer can be categorized into
`position-wise and inter-positions (i.e., the attention mecha-
`nism) operations. Position-wise operations include layer nor-
`malization, residual connection, and feed-forward mapping.
`We refer the readers to Vaswani et al. (Vaswani et al., 2017)
`for more details.
`
`6
`
`
`
`Table 1: Incorporating Positional Embedding (PE). NMT stands for neural machine translation on IWSLT’14
`De-En dataset (Edunov et al., 2017) and SP stands for sequence prediction on WikiText-103 dataset (Merity et al.,
`
`2016). ↑ means the upper the better and ↓ means the lower the better.
`
`Approach
`
`PE Incorporation
`
`Kernel Form
`
`NMT (BLEU↑)
`
`SP (Perplexity↓)
`
`Vaswani et al. (2017) (Eq. (4))
`
`Direct-Sum
`
`Shaw et al. (2018) (Eq. (6))
`
`Look-up Table
`
`Dai et al. (2019) (Eq. (5))
`
`Product Kernel
`
`Ours (Eq. (9))
`
`Product Kernel
`
`kexpfq + tq, fk + tk
`Ltq −tk,fq ⋅ kexpfq, fk
`kexpfq, fk⋅ kfqtq, tk
`kFfq, fk⋅ kTtq, tk
`
`33.98
`
`34.12
`
`33.62
`
`34.71
`
`30.97
`
`27.56
`
`24.10
`
`24.28
`
`Table 2: Kernel Types. Other than manipulating the kernel choice of the non-positional features, we fix the
`configuration by Vaswani et al. (2017) for NMT and the configuration by Dai et al. (2019) for SP.
`
`Type
`
`Kernel Form
`
`Linear
`
`Polynomial
`
`Exponential
`
`RBF
`
`⟨faWq, fbWk⟩
`⟨faWq, fbWk⟩2
`exp⟨faWq,fbWk⟩
`
`√dk
`exp− faWq −fbWk 2
`√dk
`
`
`
`NMT (BLEU↑)
`
`Asym. (Wq ≠ Wk)
`
`Sym. (Wq = Wk)
`
`SP (Perplexity↓)
`
`Asym. (Wq ≠ Wk)
`
`Sym. (Wq = Wk)
`
`not converge
`
`not converge
`
`not converge
`
`not converge
`
`32.72
`
`33.98
`
`34.26
`
`32.43
`
`33.78
`
`34.14
`
`25.91
`
`24.10
`
`24.13
`
`26.25
`
`24.01
`
`24.21
`
`table, it outperforms the case with having PE as
`direct-sum in feature space, especially for SP task.
`Note that the look-up table is indexed by the rela-
`
`tive position (i.e., tq − tk) instead of absolute posi-
`
`tion. Second, we see that PE in the product kernel
`proposed by Dai et al. (Dai et al., 2019) may not
`constantly outperform the other integration types
`(it has lower BLEU score for NMT). Our proposed
`product kernel reaches the best result in NMT and
`is competitive to the best result in SP.
`
`3.2 Kernel Types
`
`To find the best kernel form in the attention
`mechanism, in addition to the exponential kernel
`(see Eq. (3)), we compare different kernel forms
`(i.e., linear, polynomial, and rbf kernel) for the
`non-positional features. We also provide the re-
`sults for changing asymmetric to the symmetric
`
`kernel, when forcing Wq = Wk, so that the result-
`
`ing kernel is a valid kernel (Scholkopf and Smola,
`2001). The numbers are shown in Table 2. Note
`that, for fairness, other than manipulating the ker-
`nel choice of the non-positional features, we fix
`the configuration by Vaswani et al. (Vaswani et al.,
`2017) for NMT and the configuration by Dai et
`al. (Dai et al., 2019) for SP.
`We first observe that the linear kernel does not
`converge for both NMT and SP. We argue the rea-
`son is that the linear kernel may have negative
`
`value and thus it violates the assumption in ker-
`nel smoother that the kernel score must be pos-
`itive (Wasserman, 2006). Next, we observe the
`kernel with infinite feature space (i.e., exponen-
`tial and rbf kernel) outperforms the kernel with fi-
`nite feature space (i.e., polynomial kernel). And
`we see rbf kernel performs the best for NMT and
`exponential kernel performs the best for SP. We
`conclude that the choice of kernel matters for the
`design of attention in Transformer. Also, we see
`no much performance difference when comparing
`asymmetric to symmetric kernel.
`In the experi-
`ment, we fix the size of W⋅ in the kernel, and thus
`adopting the symmetric kernel benefits us from
`saving parameters.
`
`3.3 Order-Invariance in Attention
`
`The need of the positional embedding (PE) in the
`attention mechanism is based on the argument that
`the attention mechanism is an order-agnostic (or,
`permutation equivariant) operation (Vaswani et al.,
`2017; Shaw et al., 2018; Huang et al., 2018b;
`Dai et al., 2019; Child et al., 2019). However, we
`show that, for decoder self-attention, the opera-
`tion is not order-agnostic. For clarification, we
`are not attacking the claim made by the prior
`work (Vaswani et al., 2017; Shaw et al., 2018;
`Huang et al., 2018b; Dai et al., 2019; Child et al.,
`2019), but we aim at providing a new look at the
`
`7
`
`
`
`Table 3: Order-Invariance in Attention. To save the space, we denote Encoder Self-Attention / Encoder-Decoder
`Attention / Decoder Self-Attention as A/B/C. Note that SP only has decoder self-attention.
`
`Approach
`
`Positional Embedding NMT (BLEU↑)
`
`Approach
`
`Positional Embedding
`
`SP (Perplexity↓)
`
`Ours (Eq. (9))
`
`Ours (Eq. (9))
`
`No Positional Embedding
`
`In A/B/C
`
`In A/B
`
`none
`
`34.71
`
`34.49
`
`14.47
`
`Vaswani et al. (2017) (Eq.
`(4))
`
`Ours (Eq. (9)
`
`No Positional Embedding
`
`In C
`
`In C
`
`none
`
`30.97
`
`24.28
`
`30.92
`
`Table 4: Positional Embedding in Value Function.
`
`I: Value Function Considering Positional Embedding (Eq. (7)) / II: Value Function Considering no Positional Embedding (Eq. (8))
`
`Approach
`
`NMT (BLEU↑)
`
`I v(xk) ∶= (fk + tk)WV
`
`II v(xk) ∶= fkWV
`
`SP (Perplexity↓)
`
`I v(xk) ∶= (fk + tk)WV
`
`II v(xk) ∶= fkWV
`
`Vaswani et al. (2017) (Eq. (4))
`
`Shaw et al. (2018) (Eq. (6))
`
`Dai et al. (2019) (Eq. (5))
`
`Ours (Eq. (9))
`
`33.98
`
`34.04
`
`33.32
`
`34.60
`
`34.02
`
`34.12
`
`33.62
`
`34.71
`
`30.97
`
`27.56
`
`24.18
`
`24.42
`
`30.50
`
`27.45
`
`24.10
`
`24.28
`
`order-invariance problem when considering the at-
`tention mechanism with masks (masks refer to the
`set filtering function in our kernel formulation). In
`other words, previous work did not consider the
`mask between queries and keys when discussing
`the order-invariance problem (Pérez et al., 2019).
`To put it formally, we first present the definition
`by Lee et al. (2018) for a permutation equivariance
`function:
`
`Definition 2. Denote Π as the set of all permu-
`
`tations over [n] = {1,⋯, n}. A function f unc ∶
`X n → Y n is permutation equivariant iff for any
`permutation π ∈ Π, f unc(πx) = πf unc(x).
`
`Lee et al. (2018) showed that the standard atten-
`tion (encoder self-attention (Vaswani et al., 2017;
`Dai et al., 2019)
`)
`is permutation equivariant.
`Here, we present the non-permutation-equivariant
`problem on the decoder self-attention:
`
`1.
`Proposition
`self-
`Decoder
`attention (Vaswani et al., 2017; Dai et al., 2019)
`is not permutation equivariant.
`
`To proceed the proof, we need the following def-
`inition and propositions.
`
`Definition 3. Denote Π as the set of all permuta-
`as performing
`
`xk
`
`said to be permutation equivariant w.r.t. Sxk if
`
`tions over [n] = {1,⋯, n} and Sπ
`permutation π over Sxk . Attention(xq; Sxk) is
`
`and only if for any π ∈ Π, Attention(xq; Sπxk) =
`Attention(xq; Sxk).
`tion M(xq, Sxk) = Sxk is permutation equivariant
`
`Proposition 2. Attention with the set filtering func-
`
`w.r.t. Sxk .
`
`Sxk , Eq. (2) remains unchanged for any permu-
`tation π performed on Sxk .
`
`Proposition 3. Attention with the set difference
`
`ariant w.r.t. Sxk .
`
`Proof. It is easy to show that if M(xq, Sxk) =
`∎
`Sxk ∖ M(xq, Sxk) ≠ φ is not permutation equiv-
`Proof. First, suppose that ˆx ∈ Sxk ∖ M(xq, Sxk).
`
`ˆx ∈ M(xq, Sπxk).
`Attentionxq ; M(xq, Sxk) is not permutation
`∎
`xq ∼ Sxk . Hence, showing Attention(xq; Sxk)
`
`Then, we construct a permutation π such that
`It
`is obvious that Eq.
`this permutation and thus
`
`(2) changes after
`
`equivariant w.r.t. Sxk .
`
`Proof. [Proof for Proposition 1]