throbber
Sampled Softmax with Random Fourier Features
`
`Ankit Singh Rawat, Jiecao Chen, Felix Yu, Ananda Theertha Suresh, and Sanjiv Kumar
`
`Google Research
`New York, NY 10011
`{ankitsrawat, chenjiecao, felixyu, theertha, sanjivk}@google.com.
`
`January 1, 2020
`
`Abstract
`The computational cost of training with softmax cross entropy loss grows linearly with the
`number of classes. For the settings where a large number of classes are involved, a common
`method to speed up training is to sample a subset of classes and utilize an estimate of the loss
`gradient based on these classes, known as the sampled softmax method. However, the sampled
`softmax provides a biased estimate of the gradient unless the samples are drawn from the
`exact softmax distribution, which is again expensive to compute. Therefore, a widely employed
`practical approach involves sampling from a simpler distribution in the hope of approximating
`the exact softmax distribution. In this paper, we develop the first theoretical understanding of
`the role that different sampling distributions play in determining the quality of sampled softmax.
`Motivated by our analysis and the work on kernel-based sampling, we propose the Random
`Fourier Softmax (RF-softmax) method that utilizes the powerful Random Fourier Features to
`enable more efficient and accurate sampling from an approximate softmax distribution. We show
`that RF-softmax leads to low bias in estimation in terms of both the full softmax distribution
`and the full softmax gradient. Furthermore, the cost of RF-softmax scales only logarithmically
`with the number of classes.
`
`1
`
`Introduction
`
`The cross entropy loss based on softmax function is widely used in multi-class classification tasks
`such as natural language processing [1], image classification [2], and recommendation systems [3].
`In multi-class classification, given an input x ∈ X , the goal is to predict its class t ∈ {1, 2, . . . , n},
`where n is the number of classes. Given an input feature x, the model (often a neural network) first
`computes an input embedding h ∈ Rd and then the raw scores or logits for classes o = (o1, . . . , on)
`as the product of the input embedding h and the class embeddings c1, . . . , cn ∈ Rd,
`
`Here, τ is often referred to as the (inverse) temperature parameter of softmax. Given the logits, the
`probability that the model assigns to the i-th class is computed using the full softmax function
`
`oi = τ hT ci.
`
`(1)
`
`arXiv:1907.10747v2 [cs.LG] 31 Dec 2019
`
`where Z =(cid:80)n
`
`i=1 eoi is called the partition function. The distribution in (2) is commonly referred to
`as the softmax distribution. Given a training set, the model parameters are estimated by minimizing
`
`pi = eoi/Z,
`
`(2)
`
`1
`
`Petitioner, EX1020
`IPR2024-01234
`Hugging Face, Inc., v. FriendliAI Inc.
`
`

`

`an empirical risk over the training set, where the empirical risk is defined by the cross-entropy loss
`based on softmax function or the full softmax loss. Let t ∈ [n] denote the true class for the input x,
`then the full softmax loss is defined as1
`L(x, t) := − log pt = −ot + log Z.
`One typically employs first order optimization methods to train neural network models. This
`requires computing the gradient of the loss with respect to the model parameter θθθ during each
`iteration
`
`(3)
`
`n(cid:88)
`
`∇θθθL(x, t) = −∇θθθot +
`
`eoi
`Z
`
`· ∇θθθoi = −∇θθθot + Es∼p [∇θθθos] ,
`
`(4)
`
`i=1
`where the expectation is taken over the softmax distribution (cf. (2)). As evident from (4), computing
`the gradient of the full softmax loss takes O(dn) time due to the contributions from all n classes.
`Therefore, training a model using the full softmax loss becomes prohibitively expensive in the settings
`where a large number of classes are involved. To this end, various approaches have been proposed for
`efficient training. This includes different modified loss functions: hierarchical softmax [5] partitions
`the classes into a tree based on class similarities, allowing for O(d log n) training and inference time;
`spherical softmax [6, 7] replaces the exponential function by a quadratic function, enabling efficient
`algorithm to compute the updates of the output weights irrespective of the output size. Efficient
`hardware-specific implementations of softmax are also being actively studied [8].
`
`1.1 Sampled softmax
`A popular approach to speed up the training of full softmax loss is using sampled softmax: instead
`of including all classes during each iteration, a small random subset of n classes is considered, where
`each negative class is sampled with some probability. Formally, let the number of sampled classes
`during each iteration be m, with class i being picked with probability qi. Let Nt (cid:44) [n]\{t} be the
`set of negative classes. Assuming that s1, . . . , sm ∈ Nt denote the sampled class indices, following
`[9], we define the adjusted logits o(cid:48) = {o(cid:48)
`
`
`1, o(cid:48)2, . . . , o(cid:48)m+1} such that o(cid:48)1 = ot and for i ∈ [m],
`
`i+1 = osi − log(mqsi).
`o(cid:48)
`Z(cid:48) , where Z(cid:48) =(cid:80)m+1
`i = eo(cid:48)
`j=1 eo(cid:48)
`Accordingly, we define the sampled softmax distribution as p(cid:48)
`j . The
`sampled softmax loss corresponds to the cross entropy loss with respect to the sampled softmax
`distribution:
`
`(5)
`
`i
`
`L(cid:48)(x, t) = − log p(cid:48)t = −ot + log Z(cid:48).
`Here, we note that adjusting the logits for the sampled negative classes using their expected number
`of occurrence in (5) ensures that Z(cid:48) is an unbiased estimator of Z [9]. Since L(cid:48)(x, t) depends only
`on m + 1 classes, the computational cost is reduced from O(dn) to O(dm) as compared to the full
`softmax loss in (3).
`In order to realize the training with the full softmax loss, one would like the gradient of the
`sampled softmax loss to be an unbiased estimator of the gradient of the full softmax loss2, i.e.,
`
`(6)
`
`(7)
`
`
`
`E(cid:2)∇θθθL(cid:48)(cid:3) = ∇θθθL,
`
`1The results of this paper generalize to a multi-label setting by using multi-label to multi-class reductions [4].
`2Since it is clear from the context, in what follows, we denote L(x, t) and L(cid:48)(x, t) by L and L(cid:48), respectively.
`
`2
`
`

`

`where the expectation is taken over the sampling distribution q. As it turns out, the sampling
`distribution plays a crucial role in ensuring the unbiasedness of ∇θθθL(cid:48). Bengio and Senécal [9] show
`that (7) holds if the sampling distribution is the full softmax distribution itself, i.e., qi ∝ eoi (cf. (2)).
`However, sampling from the softmax distribution itself is again computationally expensive: one
`needs to compute the partition function Z during each iteration, which is again an O(dn) operation
`since Z depends both on the current model parameters and the input. As a feasible alternative, one
`usually samples from a distribution which does not depend on the current model parameters and the
`input. Common choices are uniform, log-uniform, or the global prior of classes [10, 1]. However, since
`these distributions are far from the full softmax distribution, they can lead to significantly worse
`solutions. Various approaches have been proposed to improve negative sampling. For example, a
`separate model can be used to track the distribution of softmax in language modeling tasks [9]. One
`can also use an LSH algorithm to find the approximate nearest classes in the embedding space which
`in turn helps in sampling from the softmax distribution efficiently [11]. Quadratic kernel softmax
`[12] uses a kernel-based sampling method and quadratic approximation of the softmax function
`to draw each sample in sublinear time. Similarly, the Gumbel trick has been proposed to sample
`from the softmax distribution in sublinear time [13]. The partition function can also be written in a
`double-sum formulation to enable an unbiased sampling algorithm for SGD [14, 15].
`Among other training approaches based on sampled losses, Noise Contrastive Estimation (NCE)
`and its variants avoid computing the partition function [16], and (semi-)hard negative sampling
`[17, 4, 18] selects the negatives that most violate the current objective function. Hyvärinen [19]
`proposes minimization of Fisher divergence (a.k.a. score matching) to avoid computation of the
`partition function Z. However, in our setting, the partition function depends on the input embedding
`h, which changes during the training. Thus, while calculating the score function (taking derivative
`of Z with respect to (h, c)), the partition function has a non-trivial contribution which makes this
`approach inapplicable to our setting. We also note the existence of MCMC based approaches in
`the literature (see, e.g., [20]) for sampling classes with a distribution that is close to the softmax
`distribution. Such methods do not come with precise computational complexity guarantees.
`
`1.2 Our contributions
`
`Theory. Despite a large body of work on improving the quality of sampled softmax, developing a
`theoretical understanding of the performance of sampled softmax has not received much attention.
`Blanc and Rendle [12] show that the full softmax distribution is the only distribution that provides
`an unbiased estimate of the true gradient ∇θθθL. However, it is not clear how different sampling
`distributions affect the bias ∇θθθL − E [∇θθθL(cid:48)]. In this paper, we address this issue and characterize
`the bias of the gradient for a generic sampling distribution (cf. Section 2).
`
`Algorithm. In Section 3, guided by our analysis and recognizing the practical appeal of kernel-based
`sampling [12], we propose Random Fourier softmax (RF-softmax), a new kernel-based sampling
`method for the settings with normalized embeddings. RF-softmax employs the powerful Random
`Fourier Features [21] and guarantees small bias of the gradient estimate. Furthermore, the complexity
`of sampling one class for RF-softmax is O(D log n), where D denotes the number of random features
`used in RF-softmax. In contrast, assuming that d denotes the embedding dimension, the full softmax
`and the prior kernel-based sampling method (Quadratic-softmax [12]) incur O(dn) and O(d2 log n)
`computational cost to generate one sample, respectively.
`In practice, D can be two orders of
`magnitudes smaller than d2 to achieve similar or better performance. As a result, RF-softmax has
`two desirable features: 1) better accuracy due to lower bias and 2) computational efficiency due to
`low sampling cost.
`
`3
`
`

`

`Experiments. We conduct experiments on widely used NLP and extreme classification datasets to
`demonstrate the utility of the proposed RF-softmax method (cf. Section 4).
`
`2 Gradient bias of sampled softmax
`
`The goal of sampled softmax is to obtain a computationally efficient estimate of the true gradient
`∇θθθL (cf. (4)) of the full softmax loss (cf. (3)) with small bias. In this section we develop a theoretical
`understanding of how different sampling distributions affect the bias of the gradient. To the best of
`our knowledge, this is the first result of this kind.
`For the cross entropy loss based on the sampled softmax (cf. (6)), the training algorithm employs
`the following estimate of ∇θθθL.
`
`eot∇θθθot +(cid:80)
`eot +(cid:80)
`
`∇θθθosi
`
`.
`
`∇θθθL(cid:48) = −∇θθθot +
`
`(8)
`
`(9)
`
`(10)
`
`· 1,
`
`(11)
`
`(cid:17)(cid:33)
`
`(cid:16) 1
`
`m
`
`+o
`
`(cid:12)(cid:12)(cid:12)Zt
`(cid:125)
`
`with
`
`UB (cid:44)
`
`(cid:32)(cid:80)
`(cid:124)
`where Zt (cid:44)(cid:80)
`
`− Z 2
`
`t
`
`(cid:125)
`
`e2oj
`j∈Nt
`qj
`mZ 3
`
`UB1
`
`(cid:123)(cid:122)
`eoj , g (cid:44)(cid:80)
`
`eok
`
`k∈Nt
`mZ 2
`
`qk
`
`1 − o
`
`m
`
`· 1,
`
`(cid:17)(cid:33)
`
`+o
`
`m
`
`· g +
`
`2M
`m
`
`maxi,i(cid:48)∈Nt
`
`qi
`
`− eo
`i(cid:48)
`qi(cid:48)
`e2oj
`qj
`
`j∈Nt
`
`UB2
`
`eosi
`i∈[m]
`mqsi
`eosi
`i∈[m]
`mqsi
`The following result bounds the bias of the estimate ∇θθθL(cid:48). Without loss of generality, we work
`with the sampling distributions that assign strictly positive probability to each negative class, i.e.,
`qi > 0 ∀ i ∈ Nt.
`Theorem 1. Let ∇θθθL(cid:48) (cf. (8)) be the estimate of ∇θθθL based on m negative classes s1, . . . , sm,
`drawn according to the sampling distribution q. We further assume that the gradients of the logits
`LB ≤ E(cid:2)∇θθθL(cid:48)(cid:3) − ∇θθθL ≤ UB
`∇θθθoi have their coordinates bounded3 by M. Then, the bias of ∇θθθL(cid:48) satisfies
`(cid:12)(cid:12)(cid:12)Zt − eok
`(cid:12)(cid:12)(cid:12)
`LB (cid:44) − M(cid:80)
`(cid:18)
`(cid:17)(cid:19)
`(cid:16) 1
`(cid:12)(cid:12)(cid:12) eoi
`(cid:32)
`(cid:16) 1
`Z 2 +(cid:80)
`(cid:123)(cid:122)
`(cid:124)
`eoj∇θθθoj and 1 is the all one vector.
`j∈Nt
`j∈Nt
`The proof of Theorem 1 is presented in Appendix A. Theorem 1 captures the effect of the
`underlying sampling distribution q on the bias of gradient estimate in terms of three (closely related)
`quantities:
`
`(cid:88)
`
`(cid:12)(cid:12)(cid:12).
`
`(12)
`
`(cid:12)(cid:12)(cid:12)(cid:88)
`(cid:12)(cid:12)(cid:12), and
`(cid:12)(cid:12)(cid:12) eoj
`(cid:1) ·(cid:16)(cid:88)
`(cid:17) ≥(cid:16)(cid:88)
`=(cid:0)(cid:88)
`
`e2oj
`qj
`
`, max
`j,j(cid:48)∈Nt
`
`qj
`
`− eoj(cid:48)
`qj(cid:48)
`
`eoj − eok
`qk
`
`j∈Nt
`j∈Nt
`Ideally, we would like to pick a sampling distribution for which all these quantities are as small as
`possible. Since q is a probability distribution, it follows from Cauchy-Schwarz inequality that
`
`(cid:17)2
`
`eoj
`
`.
`
`(13)
`
`qj
`
`e2oj
`qj
`
`j
`
`j
`
`j
`
`e2oj
`qj
`
`(cid:88) j
`
`3This assumption naturally holds in most of the practical implementations, where each of the gradient coordinates
`or norm of the gradient is clipped by a threshold.
`
`4
`
`

`

`If qj ∝ eoj , then (13) is attained (equivalently,(cid:80)
`
`eoj(cid:80)
`
`e2oj
`is minimized). In particular, this implies that
`j
`qj
`UB1 in (11) disappears. Furthermore, for such a distribution we have qj =
`eoi . This implies
`i∈Nt
`that both UB2 and LB disappear for such a distribution as well. This guarantees a small bias of the
`gradient estimate ∇θθθL(cid:48).
`Since sampling exactly from the distribution q such that qj ∝ eoj is computationally expensive,
`one has to resort to other distributions that incur smaller computational cost. However, to ensure
`small bias of the gradient estimate ∇θθθL(cid:48), Theorem 1 and the accompanying discussion suggest that
`it is desirable to employ a distribution that ensures that eoj
`is as close to 1 as possible for each
`qj
`j ∈ Nt and all possible values of the logit oj. In other words, we are interested in those sampling
`distributions that provide a tight uniform multiplicative approximation of eoj in a computationally
`efficient manner.
`This motivates our main contribution in the next section, where we rely on kernel-based sampling
`methods to efficiently implement a distribution that uniformly approximates the softmax distribution.
`
`3 Random Fourier Softmax (RF-Softmax)
`
`In this section, guided by the conclusion in Section 2, we propose Random Fourier Softmax (RF-
`softmax), as a new sampling method that employs Random Fourier Features to tightly approximate
`the full softmax distribution. RF-softmax falls under the broader class of kernel-based sampling
`methods which are amenable to efficient implementation. Before presenting RF-softmax, we briefly
`describe the kernel-based sampling and an existing method based on quadratic kernels [12].
`
`3.1 Kernel-based sampling and Quadratic-softmax
`Given a kernel K : Rd × Rd → R, the input embedding h ∈ Rd, and the class embeddings
`(cid:80)n
`c1, . . . , cn ∈ Rd, kernel-based sampling selects the class i with probability qi = K(h,ci)
`j=1 K(h,cj ). Note
`that if K(h, ci) = exp(oi) = exp(τ hT ci), this amounts to directly sampling from the softmax
`distribution. Blanc and Steffen [12] show that if the kernel can be linearized by a mapping
`φ : Rd → RD such that K(h, ci) = φ(h)T φ(ci), sampling one point from the distribution takes only
`O(D log n) time by a divide-and-conquer algorithm. We briefly review the algorithm in this section.
`Under the linearization assumption, the sampling distribution takes the following form.
`
`qi =
`
`K(h, ci)
`j=1 φ(h)T φ(cj)
`
`=
`
`φ(h)T φ(ci)
`j=1 φ(cj)
`
`.
`
`The idea is to organize the classes in a binary tree with individual classes at the leaves. We then
`sample along a path on this tree recursively until we reach a single class. Each sampling step takes
`j∈S φ(cj) where S is any subset of classes. Similarly, when the
`j∈S φ(cj) along the path between the root
`
`and this class is again O(D log n).
`j∈[n] φ(cj) for the root node. Now, suppose the left neighbor and the
`right neighbor of the root node divide all the classes into two disjoint set S1 and S2, respectively. In
`φ(cj) for the left neighbor and the right neighbor
`j∈S1
`j∈S2
`of the root node, respectively. First, the probability of the sampled class being in S1 is
`
`(cid:80)n
`φ(h)T(cid:80)n
`O(D) time as we can pre-compute(cid:80)
`embedding of a class changes, the cost of updating all(cid:80)
`Note that we pre-compute(cid:80)
`this case, we pre-compute(cid:80)
`φ(cj) and(cid:80)
`φ(h)T(cid:80)
`φ(h)T(cid:80)
`φ(cj) + φ(h)T(cid:80)
`
`qS1 =
`
`j∈S1
`
`5
`
`j∈S1
`
`φ(cj)
`
`φ(cj)
`
`(14)
`
`j∈S2
`
`

`

`Method Quadratic [12]
`2562
`D
`MSE
`2.8e-3
`
`Random Fourier [21]
`2562
`100
`1000
`2.6e-3
`2.7e-4
`5.5e-6
`
`Random Maclaurin [22]
`2562
`8.8e-2
`
`Table 1: Mean squared error (MSE) of approximating a kernel exp(τ hT c). h and c are randomly sampled
`from the USPS dataset (d = 256). The data is normalized, i.e., ||h||2 = 1, ||c||2 = 1. For Quadratic, we
`assume the form is α · (hT ci)2 + β and solve α and β in a linear system to get the optimal MSE. In practice,
`with fixed α and β as in [12], the MSE will be larger. Random Fourier has much lower MSE with same D,
`and much smaller D with similar MSE. Also note that Random Fourier and Random Maclaurin are unbiased.
`
`And the probability of the sampled class being in S2 is 1 − qS1. We then recursively sample along a
`path until a single classes is reached, i.e., we reach a leaf node. After updating the embedding of the
`sampled class, we recursively trace up the tree to update the sums stored on all the nodes until we
`reach the root node.
`Given the efficiency of the kernel-based sampling, Blanc and Steffen [12] propose a sampled
`softmax approach which utilizes the quadratic kernel to define the sampling distribution.
`Kquad(h, ci) = α · (hT ci)2 + 1.
`
`(15)
`√
`α · (z ⊗ z), 1].
`Note that the quadratic kernel can be explicitly linearized by the mapping φ(z) = [
`This implies that D = O(d2); consequently, sampling one class takes O(d2 log n) time. Despite the
`promising results of Quadratic-softmax, it has the following caveats:
`• The quadratic kernel with α = 100, the value used in [12], does not give a tight multiplicative
`approximation of the exponential kernel eoj . Thus, according to the analysis in Section 2, this
`results in a gradient estimate with large bias.
`• The O(d2) computational cost can be prohibitive for models with large embedding dimensions.
`• Since a quadratic function is a poor approximation of the exponential function for negative
`numbers, it is used to approximate a modified absolute softmax loss function in [12] instead,
`where absolute values of logits serve as input to the softmax function.
`
`Next, we present a novel kernel-based sampling method that addresses all of these shortcomings.
`
`3.2 RF-softmax
`Given the analysis in Section 2 and low computational cost of the linearized kernel-based sampling
`methods, our goal is to come up with a linearizable kernel (better than quadratic) that provides a
`good uniform multiplicative approximation of the exponential kernel K(h, c) = eo = eτ hT c. More
`concretely, we would like to find a nonlinear map φ(·) : Rd → RD such that the error between K(h, c)
`and ˆK(h, c) = φ(h)T φ(c) is small for all values of h and c.
`The Random Maclaurin Features [22] seem to be an obvious choice here since it provides an
`unbiased estimator of the exponential kernel. However, due to rank deficiency of the produced
`features, it requires large D in order to achieve small mean squared error [23, 24]. We verify that
`this is indeed a poor choice in Table 1.
`In contrast, the Random Fourier Features (RFF) [21] is much more compact [23]. Moreover, these
`features and their extensions are also theoretically better understood (see, e.g., [25, 26, 27]). This
`leads to the natural question: Can we use RFF to approximate the exponential kernel? However, this
`
`6
`
`

`

`approach faces a major challenge at the outset: RFF only works for positive definite shift-invariant
`kernels such as the Gaussian kernel, while the exponential kernel is not shift-invariant.
`A key observation is that when the input embedding h and the class embedding c are normalized,
`the exponential kernel becomes the Gaussian kernel (up to a multiplicative constant):
`
`eτ hT c = eτ e− τ||h−c||2
`
`2
`
`2
`
`.
`
`(16)
`
`We note that normalized embeddings are widely used in practice to improve training stability and
`model quality [28, 29, 30]. In particular, it attains improved performance as long as τ (cf (1)) is
`large enough to ensure that the output of softmax can cover (almost) the entire range (0,1). In
`Section 4, we verify that restricting ourselves to the normalized embeddings does not hurt and, in
`fact, improves the final performance.
`For the Gaussian kernel K(x− y) = e− ν(cid:107)x−y(cid:107)2
`RFF map takes the following form.
`
`2
`
`with temperature parameter ν, the D-dimensional
`
`1√
`D
`where w1, . . . , wD ∼ N (0, I/ν). The RFF map provides an unbiased approximation of the Gaussian
`kernel [26, Lemma 1]
`
`
`
`
`
`
`
`,
`
`(17)
`
`(cid:104)
`
`φ(u) =
`
`
`
`cos(wT1 u), . . . , cos(wTDu), sin(wT1 u), . . . , sin(wTDu)
`
`(cid:105)
`
`2
`
`e− ν(cid:107)x−y(cid:107)2
`≈ φ(x)T φ(y).
`(18)
`Now, given an input embedding h, if we sample class i with probability qi ∝ exp (−τ(cid:107)ci − h(cid:107)2/2),
`then it follows from (16) that our sampling distribution is the same as the softmax distribution.
`Therefore, with normalized embeddings, one can employ the kernel-based sampling to realize the
`sampled softmax such that class i is sampled with the probability
`qi ∝ φ(ci)T φ(h).
`(19)
`We refer to this method as Random Fourier Softmax (RF-softmax). RF-softmax costs O(D log n) to
`sample one point (cf. Section 3.1). Note that computing the nonlinear map takes O(Dd) time with
`the classic Random Fourier Feature. One can easily use the structured orthogonal random feature
`(SORF) technique [26] to reduce this complexity to O(D log d) with even lower approximation error.
`Since typically the embedding dimension d is on the order of hundreds and we consider large n:
`d (cid:28) n, the overall complexity of RF-softmax is O(D log n).
`As shown in Table 1, the RFF map approximates the exponential kernel with the mean squared
`error that is orders of magnitudes smaller than that for the quadratic map with the same mapping
`dimension D. The use of RFF raises interesting implementation related challenges in terms of
`selecting the temperature parameter ν to realize low biased sampling, which we address in the
`following subsection.
`
`3.3 Analysis and discussions
`Recall that the discussion following Theorem 1 implies that, for low-bias gradient estimates, the
`sampling distribution qi needs to form a tight multiplicative approximation of the softmax distribution
`pi, where pi ∝ exp(oi) = exp(τ hT ci). In the following result, we quantify the quality of the proposed
`RF-softmax based on the ratio |pi/qi|. The proof of the result is presented in Appendix B.
`
`7
`
`

`

`Theorem 2. Given the (cid:96)2-normalized input embedding h and (cid:96)2-normalized class embeddings
`{c1, . . . , cn}, let oi = τ hT ci be the logits associated with the class i. Let qi denote the probability of
`C · φ(ci)T φ(h),
`sampling the i-th class based on D-dimensional Random Fourier Features, i.e., qi = 1
`· √
`where C is the normalizing constant. Then, as long as e2ν ≤ γ
`√
`log D , the following holds with
`D
`

`
`d
`
`probability at least 1 − O(cid:0) 1
`
`D2
`
`(cid:1).
`
`e(τ−ν)hT ci · (1 − 2γ) ≤
`
`where γ and ρ are positive constants.
`
`1(cid:80)
`
`i∈Nt
`
`eoi
`
`·(cid:12)(cid:12)(cid:12) eoi
`
`qi
`
`(cid:12)(cid:12)(cid:12) ≤ e(τ−ν)hT ci · (1 + 4γ),
`
`(20)
`
`γ = a(cid:48) ·(cid:16)
`
`e2τ · ρ
`
`Remark 1. With large enough D, we may invoke Theorem 2 with ν = τ and
`√
`d log D√
`D
`where a(cid:48) > 1 is a fixed constant. In this case, since γ = oD(1), it follows from (20) that qi ∝
`(1 ± oD(1)) · pi, for each i ∈ Nt. In particular, at D = ∞, we have qi ∝ pi.
`We now combine Theorem 1 and Theorem 2 to obtain the following corollary, which characterizes
`the bias of the gradient estimate for RF-softmax in the regime where D is large. The proof is
`presented in Appendix C.
`
`(cid:17)
`
`,
`
`−oD(1) · 1 ≤ E[∇θθθL(cid:48)] − ∇θθθL ≤(cid:16)
`where g (cid:44) (cid:80)
`
`C · φ(ci)T φ(h) denote the probability of sampling the i-th class under
`Corollary 1. Let qi = 1
`(cid:16)
`RF-softmax. Then, with high probability, for large enough D, by selecting ν = τ ensures that
`eoj∇θθθoj and 1 is the all one vector. Here, the expectation is taken over the
`j∈Nt
`sampling distribution q.
`
`oD(1) + o(cid:0)1/m(cid:1)(cid:17) · g +
`
`oD(1) + o(cid:0)1/m(cid:1)(cid:17) · 1,
`
`Note that Theorem 2 and Remark 1 highlight an important design issue in the implementation
`of the RF-softmax approach. The ability of q to approximate p (as stated in (20)) degrades as the
`difference |τ − ν| increases. Therefore, one would like to pick ν to be as close to τ as possible (ideally
`exactly equal to τ). However, for the fixed dimensional feature map φ, the approximation guarantee
`· √
`in (20) holds for only those values of ν such that e2ν ≤ γ
`√
`log D . Therefore, the dimension of the
`D

`d
`feature map dictates which Gaussian kernels we can utilize in the proposed RF-softmax approach.
`On the other hand, the variance of the kernel approximation of Random Fourier feature grows with
`ν [26].
`Additionally, in order to work with the normalized embeddings, it’s necessary to select a reasonably
`large value for the temperature parameter4 τ. Therefore, choosing ν = τ in this case will result in
`larger variance of the estimated kernel.
`
`Remark 2. As a trade off between bias and variance, while approximating the exponential kernel
`with a limited D and large τ, ν should be set as a value smaller than τ.
`
`In Section 4, we explore different choices for the value of ν and confirm that some ν < τ achieves
`the best empirical performance.
`
`4This provides a wide enough range for the logits values; thus, ensuring the underlying model has sufficient
`expressive power. The typical choice ranges from 5 to 30.
`
`8
`
`

`

`4 Experiments
`
`In this section, we experimentally evaluate the proposed RF-softmax and compare it with several
`baselines using simple neural networks to validate the utility of the proposed sampling method and
`the accompanying theoretical analysis. We show that, in terms of the final model quality, RF-softmax
`with computational cost O(D log n) (D (cid:28) d2) performs at par with the sampled softmax approach
`based on the full softmax distribution which incurs the computational cost of O(dn). We also show
`that RF-softmax outperforms Quadratic-softmax [12] that uses the quadratic function to approximate
`the exponential kernel and has computational cost O(d2 log n). In addition, we highlight the effect
`of different design choices regarding the underlying RFF map φ (cf. (17)) on the final performance
`of RF-softmax.
`
`4.1 Datasets and models
`NLP datasets. PennTreeBank [31] is a popular benchmark for NLP tasks with a vocabulary of
`size 10, 000. We train a language model using LSTM, where the normalized output of the LSTM
`serves as the input embedding. Bnews [32] is another NLP dataset. For this dataset, we select the
`most frequent 64, 000 words as the vocabulary. Our model architecture for Bnews is the same as
`the one used for PennTreeBank with more parameters. We fix the embedding dimension to be
`d = 200 for PennTreeBank and d = 512 for Bnews.
`
`Extreme classification datasets. We test the proposed method on three classification datasets
`with a large number of classes [33]. For each data point that is defined by a v-dimensional sparse
`feature vector, we first map the data point to a 128-dimensional vector by using a v × 128 matrix.
`Once normalized, this vector serves as the input embedding h. For i ∈ [n], class i is mapped
`to a 128-dimensional normalized vector ci. Following the convention of extreme classification
`literature [34, 35, 4], we report precision at k (prec@k) on the test set.
`Recall that the proposed RF-softmax draws samples with computational complexity O(D log n).
`We compare it with the following baselines.
`• Full, the full softmax loss, which has computational complexity O(dn).
`• Exp, the sampled softmax approach with the full softmax distribution (cf. (2)) as the sampling
`distribution. This again has computational complexity O(dn) to draw one sample.
`• Uniform, the sampled softmax approach that draws negative classes uniformly at random,
`amounting to O(1) computational complexity to draw one sample.
`• Quadratic, the sampled softmax approach based on a quadratic kernel (cf. (15)). We follow the
`implementation of [12] and use αo2 + 1 with α = 100. Since αo2 + 1 is a better approximation
`(cid:80)
`eτ|oi|
`of e|o| than it is of eo, [12] proposes to use the absolute softmax function ˜pi =
`j∈[n] eτ|oj| while
`computing the cross entropy loss. We employ this modified loss function for the quadratic kernel
`as it gives better results as compared to the full softmax loss in (3). The cost of drawing one
`sample with this method is O(d2 log n).
`√
`Here, we refer to 1/
`τ as the temperature of softmax (cf. (2)). We set this parameter to 0.3 as
`it leads to the best performance for the Full baseline. This is a natural choice given that sampled
`softmax aims at approximating this loss function with a small subset of (sampled) negative classes.
`
`4.2 Experimental results
`Wall time. We begin with verifying that RF-softmax indeed incurs low sampling cost. In Table 2,
`we compare the wall-time that different model dependent sampling methods take to compute the
`
`9
`
`

`

`10, 000
`
`# classes (n) Method
`Exp
`Quadratic
`Rff (D = 50)
`Rff (D = 200)
`Rff (D = 500)
`Rff (D = 1, 000)
`Exp
`Quadratic
`Rff (D = 50)
`Rff (D = 200)
`Rff (D = 500)
`Rff (D = 1, 000)
`
`500, 000
`
`Wall time
`1.4 ms
`6.5 ms
`0.5 ms
`0.6 ms
`1.2 ms
`1.4 ms
`32.3 ms
`8.2 ms
`1.6 ms
`1.7 ms
`2.0 ms
`2.4 ms
`
`Table 2: Comparison of wall time for computing sampled softmax loss for different model-dependent sampling
`methods. (Batch size = 10, m = 10 and d = 64.)
`
`sampled softmax loss (cf. (6)) for different sets of parameters.
`
`Normalized vs. unnormalized embeddings. To justify our choice of working with normalized
`embeddings, we ran experiments on Full with and without embedding normalization for both
`PennTreeBank and AmazonCat-13k. On PennTreeBank, after 10 epochs, the unnormalized
`version has much worse validation perplexity of 126 as compared to 120 with the normalized version.
`On AmazonCat-13k, both versions have prec@1 87%.
`Next, we discuss the key design choices for RF-softmax: the choice of the Gaussian sampling
`kernel defined by ν (cf. (16)); and the dimension of the RFF map D.
`
`Figure 1: Validation perplexity
`for RF-softmax on PennTree-
`Bank with m = 100, D = 1024
`and varying values of T .
`
`Figure 2: Validation perplexity
`for RF-softmax on PennTree-
`Bank with m = 100 and varying
`D.
`
`Figure 3: Comparison of RF-
`softmax with other baselines on Pen-
`nTreeBank with m = 100 and val-
`idation perplexity as the metric.
`
`Effect of the parameter ν. As discussed in Section 3.2, for a finite D, we should choose ν < τ
`as a trade off between the variance and the bias. Figure 1 shows the performance of the proposed
`RF-softmax method on PennTreeBank for different values of ν. In particular, we vary T = 1√
`ν as
`it defines the underlying RFF map (cf. (18)). The best performance is attained at T = 0.5. This
`choice of ν < τ is in line with our discussion in Section 3.3. We use this same setting in the remaining
`experiments.
`
`Effect of D. The accuracy of the approximation of the Gaussian kernel using the RFF map improves
`
`10
`
`310
`
`270
`
`230
`
`190
`
`150
`
`110
`
`T = 0.3
`T = 0.4
`T = 0.5
`T = 0.6
`T = 0.7
`T = 0.8
`
`2
`
`4
`
`6
`Epoch
`
`8
`
`10
`
`Full
`D = 128
`D = 256
`D = 512
`D = 1024
`
`250
`230
`210
`190
`170
`150
`130
`110
`
`2
`
`4
`
`6
`Epoch
`
`8
`
`10
`
`300
`
`262
`
`224
`
`186
`
`148
`
`110
`
`2
`
`4
`
`6
`Epoch
`
`8
`
`10
`
`Full
`Exp
`RF-softmax
`(D = 1024)
`Quadratic
`Uniform
`
`

`

`Dataset
`AmazonCat-
`13k
`n = 13, 330
`v = 203, 882
`Delicious-
`200k
`n = 205, 443
`v = 782, 585
`WikiLSHTC
`n = 325, 056
`=
`v
`1, 617, 899
`
`Method
`Exp
`Uniform
`Quadratic
`Rff
`Exp
`Uniform
`Quadratic
`Rff
`Exp
`Uniform
`Quadratic
`Rff
`
`prec@1
`
`prec@3
`
`prec@5
`
`0.87
`0.83
`0.84
`0.87
`0.42
`0.36
`0.40
`0.41
`0.58
`0.47
`0.57
`0.56
`
`0.76
`0.69
`0.74
`0.75
`0.38
`0.34
`0.36
`0.37
`0.37
`0.29
`0.37
`0.35
`
`0.62
`0.55
`0.60
`0.61
`0.37
`0.32
`0.34
`0.36
`0.29
`0.22
`0.28
`0.26
`
`Table 3: Comparison among sampled softmax methods on extreme classification datasets. We report the
`metrics based on the same number of training iterations for all methods.
`
`as we increase the dimension of the map D. Figure 2 demonstrates the performance of the proposed
`RF-softm

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket