`
`Ankit Singh Rawat, Jiecao Chen, Felix Yu, Ananda Theertha Suresh, and Sanjiv Kumar
`
`Google Research
`New York, NY 10011
`{ankitsrawat, chenjiecao, felixyu, theertha, sanjivk}@google.com.
`
`January 1, 2020
`
`Abstract
`The computational cost of training with softmax cross entropy loss grows linearly with the
`number of classes. For the settings where a large number of classes are involved, a common
`method to speed up training is to sample a subset of classes and utilize an estimate of the loss
`gradient based on these classes, known as the sampled softmax method. However, the sampled
`softmax provides a biased estimate of the gradient unless the samples are drawn from the
`exact softmax distribution, which is again expensive to compute. Therefore, a widely employed
`practical approach involves sampling from a simpler distribution in the hope of approximating
`the exact softmax distribution. In this paper, we develop the first theoretical understanding of
`the role that different sampling distributions play in determining the quality of sampled softmax.
`Motivated by our analysis and the work on kernel-based sampling, we propose the Random
`Fourier Softmax (RF-softmax) method that utilizes the powerful Random Fourier Features to
`enable more efficient and accurate sampling from an approximate softmax distribution. We show
`that RF-softmax leads to low bias in estimation in terms of both the full softmax distribution
`and the full softmax gradient. Furthermore, the cost of RF-softmax scales only logarithmically
`with the number of classes.
`
`1
`
`Introduction
`
`The cross entropy loss based on softmax function is widely used in multi-class classification tasks
`such as natural language processing [1], image classification [2], and recommendation systems [3].
`In multi-class classification, given an input x ∈ X , the goal is to predict its class t ∈ {1, 2, . . . , n},
`where n is the number of classes. Given an input feature x, the model (often a neural network) first
`computes an input embedding h ∈ Rd and then the raw scores or logits for classes o = (o1, . . . , on)
`as the product of the input embedding h and the class embeddings c1, . . . , cn ∈ Rd,
`
`Here, τ is often referred to as the (inverse) temperature parameter of softmax. Given the logits, the
`probability that the model assigns to the i-th class is computed using the full softmax function
`
`oi = τ hT ci.
`
`(1)
`
`arXiv:1907.10747v2 [cs.LG] 31 Dec 2019
`
`where Z =(cid:80)n
`
`i=1 eoi is called the partition function. The distribution in (2) is commonly referred to
`as the softmax distribution. Given a training set, the model parameters are estimated by minimizing
`
`pi = eoi/Z,
`
`(2)
`
`1
`
`Petitioner, EX1020
`IPR2024-01234
`Hugging Face, Inc., v. FriendliAI Inc.
`
`
`
`an empirical risk over the training set, where the empirical risk is defined by the cross-entropy loss
`based on softmax function or the full softmax loss. Let t ∈ [n] denote the true class for the input x,
`then the full softmax loss is defined as1
`L(x, t) := − log pt = −ot + log Z.
`One typically employs first order optimization methods to train neural network models. This
`requires computing the gradient of the loss with respect to the model parameter θθθ during each
`iteration
`
`(3)
`
`n(cid:88)
`
`∇θθθL(x, t) = −∇θθθot +
`
`eoi
`Z
`
`· ∇θθθoi = −∇θθθot + Es∼p [∇θθθos] ,
`
`(4)
`
`i=1
`where the expectation is taken over the softmax distribution (cf. (2)). As evident from (4), computing
`the gradient of the full softmax loss takes O(dn) time due to the contributions from all n classes.
`Therefore, training a model using the full softmax loss becomes prohibitively expensive in the settings
`where a large number of classes are involved. To this end, various approaches have been proposed for
`efficient training. This includes different modified loss functions: hierarchical softmax [5] partitions
`the classes into a tree based on class similarities, allowing for O(d log n) training and inference time;
`spherical softmax [6, 7] replaces the exponential function by a quadratic function, enabling efficient
`algorithm to compute the updates of the output weights irrespective of the output size. Efficient
`hardware-specific implementations of softmax are also being actively studied [8].
`
`1.1 Sampled softmax
`A popular approach to speed up the training of full softmax loss is using sampled softmax: instead
`of including all classes during each iteration, a small random subset of n classes is considered, where
`each negative class is sampled with some probability. Formally, let the number of sampled classes
`during each iteration be m, with class i being picked with probability qi. Let Nt (cid:44) [n]\{t} be the
`set of negative classes. Assuming that s1, . . . , sm ∈ Nt denote the sampled class indices, following
`[9], we define the adjusted logits o(cid:48) = {o(cid:48)
`
`
`1, o(cid:48)2, . . . , o(cid:48)m+1} such that o(cid:48)1 = ot and for i ∈ [m],
`
`i+1 = osi − log(mqsi).
`o(cid:48)
`Z(cid:48) , where Z(cid:48) =(cid:80)m+1
`i = eo(cid:48)
`j=1 eo(cid:48)
`Accordingly, we define the sampled softmax distribution as p(cid:48)
`j . The
`sampled softmax loss corresponds to the cross entropy loss with respect to the sampled softmax
`distribution:
`
`(5)
`
`i
`
`L(cid:48)(x, t) = − log p(cid:48)t = −ot + log Z(cid:48).
`Here, we note that adjusting the logits for the sampled negative classes using their expected number
`of occurrence in (5) ensures that Z(cid:48) is an unbiased estimator of Z [9]. Since L(cid:48)(x, t) depends only
`on m + 1 classes, the computational cost is reduced from O(dn) to O(dm) as compared to the full
`softmax loss in (3).
`In order to realize the training with the full softmax loss, one would like the gradient of the
`sampled softmax loss to be an unbiased estimator of the gradient of the full softmax loss2, i.e.,
`
`(6)
`
`(7)
`
`
`
`E(cid:2)∇θθθL(cid:48)(cid:3) = ∇θθθL,
`
`1The results of this paper generalize to a multi-label setting by using multi-label to multi-class reductions [4].
`2Since it is clear from the context, in what follows, we denote L(x, t) and L(cid:48)(x, t) by L and L(cid:48), respectively.
`
`2
`
`
`
`where the expectation is taken over the sampling distribution q. As it turns out, the sampling
`distribution plays a crucial role in ensuring the unbiasedness of ∇θθθL(cid:48). Bengio and Senécal [9] show
`that (7) holds if the sampling distribution is the full softmax distribution itself, i.e., qi ∝ eoi (cf. (2)).
`However, sampling from the softmax distribution itself is again computationally expensive: one
`needs to compute the partition function Z during each iteration, which is again an O(dn) operation
`since Z depends both on the current model parameters and the input. As a feasible alternative, one
`usually samples from a distribution which does not depend on the current model parameters and the
`input. Common choices are uniform, log-uniform, or the global prior of classes [10, 1]. However, since
`these distributions are far from the full softmax distribution, they can lead to significantly worse
`solutions. Various approaches have been proposed to improve negative sampling. For example, a
`separate model can be used to track the distribution of softmax in language modeling tasks [9]. One
`can also use an LSH algorithm to find the approximate nearest classes in the embedding space which
`in turn helps in sampling from the softmax distribution efficiently [11]. Quadratic kernel softmax
`[12] uses a kernel-based sampling method and quadratic approximation of the softmax function
`to draw each sample in sublinear time. Similarly, the Gumbel trick has been proposed to sample
`from the softmax distribution in sublinear time [13]. The partition function can also be written in a
`double-sum formulation to enable an unbiased sampling algorithm for SGD [14, 15].
`Among other training approaches based on sampled losses, Noise Contrastive Estimation (NCE)
`and its variants avoid computing the partition function [16], and (semi-)hard negative sampling
`[17, 4, 18] selects the negatives that most violate the current objective function. Hyvärinen [19]
`proposes minimization of Fisher divergence (a.k.a. score matching) to avoid computation of the
`partition function Z. However, in our setting, the partition function depends on the input embedding
`h, which changes during the training. Thus, while calculating the score function (taking derivative
`of Z with respect to (h, c)), the partition function has a non-trivial contribution which makes this
`approach inapplicable to our setting. We also note the existence of MCMC based approaches in
`the literature (see, e.g., [20]) for sampling classes with a distribution that is close to the softmax
`distribution. Such methods do not come with precise computational complexity guarantees.
`
`1.2 Our contributions
`
`Theory. Despite a large body of work on improving the quality of sampled softmax, developing a
`theoretical understanding of the performance of sampled softmax has not received much attention.
`Blanc and Rendle [12] show that the full softmax distribution is the only distribution that provides
`an unbiased estimate of the true gradient ∇θθθL. However, it is not clear how different sampling
`distributions affect the bias ∇θθθL − E [∇θθθL(cid:48)]. In this paper, we address this issue and characterize
`the bias of the gradient for a generic sampling distribution (cf. Section 2).
`
`Algorithm. In Section 3, guided by our analysis and recognizing the practical appeal of kernel-based
`sampling [12], we propose Random Fourier softmax (RF-softmax), a new kernel-based sampling
`method for the settings with normalized embeddings. RF-softmax employs the powerful Random
`Fourier Features [21] and guarantees small bias of the gradient estimate. Furthermore, the complexity
`of sampling one class for RF-softmax is O(D log n), where D denotes the number of random features
`used in RF-softmax. In contrast, assuming that d denotes the embedding dimension, the full softmax
`and the prior kernel-based sampling method (Quadratic-softmax [12]) incur O(dn) and O(d2 log n)
`computational cost to generate one sample, respectively.
`In practice, D can be two orders of
`magnitudes smaller than d2 to achieve similar or better performance. As a result, RF-softmax has
`two desirable features: 1) better accuracy due to lower bias and 2) computational efficiency due to
`low sampling cost.
`
`3
`
`
`
`Experiments. We conduct experiments on widely used NLP and extreme classification datasets to
`demonstrate the utility of the proposed RF-softmax method (cf. Section 4).
`
`2 Gradient bias of sampled softmax
`
`The goal of sampled softmax is to obtain a computationally efficient estimate of the true gradient
`∇θθθL (cf. (4)) of the full softmax loss (cf. (3)) with small bias. In this section we develop a theoretical
`understanding of how different sampling distributions affect the bias of the gradient. To the best of
`our knowledge, this is the first result of this kind.
`For the cross entropy loss based on the sampled softmax (cf. (6)), the training algorithm employs
`the following estimate of ∇θθθL.
`
`eot∇θθθot +(cid:80)
`eot +(cid:80)
`
`∇θθθosi
`
`.
`
`∇θθθL(cid:48) = −∇θθθot +
`
`(8)
`
`(9)
`
`(10)
`
`· 1,
`
`(11)
`
`(cid:17)(cid:33)
`
`(cid:16) 1
`
`m
`
`+o
`
`(cid:12)(cid:12)(cid:12)Zt
`(cid:125)
`
`with
`
`UB (cid:44)
`
`(cid:32)(cid:80)
`(cid:124)
`where Zt (cid:44)(cid:80)
`
`− Z 2
`
`t
`
`(cid:125)
`
`e2oj
`j∈Nt
`qj
`mZ 3
`
`UB1
`
`(cid:123)(cid:122)
`eoj , g (cid:44)(cid:80)
`
`eok
`
`k∈Nt
`mZ 2
`
`qk
`
`1 − o
`
`m
`
`· 1,
`
`(cid:17)(cid:33)
`
`+o
`
`m
`
`· g +
`
`2M
`m
`
`maxi,i(cid:48)∈Nt
`
`qi
`
`− eo
`i(cid:48)
`qi(cid:48)
`e2oj
`qj
`
`j∈Nt
`
`UB2
`
`eosi
`i∈[m]
`mqsi
`eosi
`i∈[m]
`mqsi
`The following result bounds the bias of the estimate ∇θθθL(cid:48). Without loss of generality, we work
`with the sampling distributions that assign strictly positive probability to each negative class, i.e.,
`qi > 0 ∀ i ∈ Nt.
`Theorem 1. Let ∇θθθL(cid:48) (cf. (8)) be the estimate of ∇θθθL based on m negative classes s1, . . . , sm,
`drawn according to the sampling distribution q. We further assume that the gradients of the logits
`LB ≤ E(cid:2)∇θθθL(cid:48)(cid:3) − ∇θθθL ≤ UB
`∇θθθoi have their coordinates bounded3 by M. Then, the bias of ∇θθθL(cid:48) satisfies
`(cid:12)(cid:12)(cid:12)Zt − eok
`(cid:12)(cid:12)(cid:12)
`LB (cid:44) − M(cid:80)
`(cid:18)
`(cid:17)(cid:19)
`(cid:16) 1
`(cid:12)(cid:12)(cid:12) eoi
`(cid:32)
`(cid:16) 1
`Z 2 +(cid:80)
`(cid:123)(cid:122)
`(cid:124)
`eoj∇θθθoj and 1 is the all one vector.
`j∈Nt
`j∈Nt
`The proof of Theorem 1 is presented in Appendix A. Theorem 1 captures the effect of the
`underlying sampling distribution q on the bias of gradient estimate in terms of three (closely related)
`quantities:
`
`(cid:88)
`
`(cid:12)(cid:12)(cid:12).
`
`(12)
`
`(cid:12)(cid:12)(cid:12)(cid:88)
`(cid:12)(cid:12)(cid:12), and
`(cid:12)(cid:12)(cid:12) eoj
`(cid:1) ·(cid:16)(cid:88)
`(cid:17) ≥(cid:16)(cid:88)
`=(cid:0)(cid:88)
`
`e2oj
`qj
`
`, max
`j,j(cid:48)∈Nt
`
`qj
`
`− eoj(cid:48)
`qj(cid:48)
`
`eoj − eok
`qk
`
`j∈Nt
`j∈Nt
`Ideally, we would like to pick a sampling distribution for which all these quantities are as small as
`possible. Since q is a probability distribution, it follows from Cauchy-Schwarz inequality that
`
`(cid:17)2
`
`eoj
`
`.
`
`(13)
`
`qj
`
`e2oj
`qj
`
`j
`
`j
`
`j
`
`e2oj
`qj
`
`(cid:88) j
`
`3This assumption naturally holds in most of the practical implementations, where each of the gradient coordinates
`or norm of the gradient is clipped by a threshold.
`
`4
`
`
`
`If qj ∝ eoj , then (13) is attained (equivalently,(cid:80)
`
`eoj(cid:80)
`
`e2oj
`is minimized). In particular, this implies that
`j
`qj
`UB1 in (11) disappears. Furthermore, for such a distribution we have qj =
`eoi . This implies
`i∈Nt
`that both UB2 and LB disappear for such a distribution as well. This guarantees a small bias of the
`gradient estimate ∇θθθL(cid:48).
`Since sampling exactly from the distribution q such that qj ∝ eoj is computationally expensive,
`one has to resort to other distributions that incur smaller computational cost. However, to ensure
`small bias of the gradient estimate ∇θθθL(cid:48), Theorem 1 and the accompanying discussion suggest that
`it is desirable to employ a distribution that ensures that eoj
`is as close to 1 as possible for each
`qj
`j ∈ Nt and all possible values of the logit oj. In other words, we are interested in those sampling
`distributions that provide a tight uniform multiplicative approximation of eoj in a computationally
`efficient manner.
`This motivates our main contribution in the next section, where we rely on kernel-based sampling
`methods to efficiently implement a distribution that uniformly approximates the softmax distribution.
`
`3 Random Fourier Softmax (RF-Softmax)
`
`In this section, guided by the conclusion in Section 2, we propose Random Fourier Softmax (RF-
`softmax), as a new sampling method that employs Random Fourier Features to tightly approximate
`the full softmax distribution. RF-softmax falls under the broader class of kernel-based sampling
`methods which are amenable to efficient implementation. Before presenting RF-softmax, we briefly
`describe the kernel-based sampling and an existing method based on quadratic kernels [12].
`
`3.1 Kernel-based sampling and Quadratic-softmax
`Given a kernel K : Rd × Rd → R, the input embedding h ∈ Rd, and the class embeddings
`(cid:80)n
`c1, . . . , cn ∈ Rd, kernel-based sampling selects the class i with probability qi = K(h,ci)
`j=1 K(h,cj ). Note
`that if K(h, ci) = exp(oi) = exp(τ hT ci), this amounts to directly sampling from the softmax
`distribution. Blanc and Steffen [12] show that if the kernel can be linearized by a mapping
`φ : Rd → RD such that K(h, ci) = φ(h)T φ(ci), sampling one point from the distribution takes only
`O(D log n) time by a divide-and-conquer algorithm. We briefly review the algorithm in this section.
`Under the linearization assumption, the sampling distribution takes the following form.
`
`qi =
`
`K(h, ci)
`j=1 φ(h)T φ(cj)
`
`=
`
`φ(h)T φ(ci)
`j=1 φ(cj)
`
`.
`
`The idea is to organize the classes in a binary tree with individual classes at the leaves. We then
`sample along a path on this tree recursively until we reach a single class. Each sampling step takes
`j∈S φ(cj) where S is any subset of classes. Similarly, when the
`j∈S φ(cj) along the path between the root
`
`and this class is again O(D log n).
`j∈[n] φ(cj) for the root node. Now, suppose the left neighbor and the
`right neighbor of the root node divide all the classes into two disjoint set S1 and S2, respectively. In
`φ(cj) for the left neighbor and the right neighbor
`j∈S1
`j∈S2
`of the root node, respectively. First, the probability of the sampled class being in S1 is
`
`(cid:80)n
`φ(h)T(cid:80)n
`O(D) time as we can pre-compute(cid:80)
`embedding of a class changes, the cost of updating all(cid:80)
`Note that we pre-compute(cid:80)
`this case, we pre-compute(cid:80)
`φ(cj) and(cid:80)
`φ(h)T(cid:80)
`φ(h)T(cid:80)
`φ(cj) + φ(h)T(cid:80)
`
`qS1 =
`
`j∈S1
`
`5
`
`j∈S1
`
`φ(cj)
`
`φ(cj)
`
`(14)
`
`j∈S2
`
`
`
`Method Quadratic [12]
`2562
`D
`MSE
`2.8e-3
`
`Random Fourier [21]
`2562
`100
`1000
`2.6e-3
`2.7e-4
`5.5e-6
`
`Random Maclaurin [22]
`2562
`8.8e-2
`
`Table 1: Mean squared error (MSE) of approximating a kernel exp(τ hT c). h and c are randomly sampled
`from the USPS dataset (d = 256). The data is normalized, i.e., ||h||2 = 1, ||c||2 = 1. For Quadratic, we
`assume the form is α · (hT ci)2 + β and solve α and β in a linear system to get the optimal MSE. In practice,
`with fixed α and β as in [12], the MSE will be larger. Random Fourier has much lower MSE with same D,
`and much smaller D with similar MSE. Also note that Random Fourier and Random Maclaurin are unbiased.
`
`And the probability of the sampled class being in S2 is 1 − qS1. We then recursively sample along a
`path until a single classes is reached, i.e., we reach a leaf node. After updating the embedding of the
`sampled class, we recursively trace up the tree to update the sums stored on all the nodes until we
`reach the root node.
`Given the efficiency of the kernel-based sampling, Blanc and Steffen [12] propose a sampled
`softmax approach which utilizes the quadratic kernel to define the sampling distribution.
`Kquad(h, ci) = α · (hT ci)2 + 1.
`
`(15)
`√
`α · (z ⊗ z), 1].
`Note that the quadratic kernel can be explicitly linearized by the mapping φ(z) = [
`This implies that D = O(d2); consequently, sampling one class takes O(d2 log n) time. Despite the
`promising results of Quadratic-softmax, it has the following caveats:
`• The quadratic kernel with α = 100, the value used in [12], does not give a tight multiplicative
`approximation of the exponential kernel eoj . Thus, according to the analysis in Section 2, this
`results in a gradient estimate with large bias.
`• The O(d2) computational cost can be prohibitive for models with large embedding dimensions.
`• Since a quadratic function is a poor approximation of the exponential function for negative
`numbers, it is used to approximate a modified absolute softmax loss function in [12] instead,
`where absolute values of logits serve as input to the softmax function.
`
`Next, we present a novel kernel-based sampling method that addresses all of these shortcomings.
`
`3.2 RF-softmax
`Given the analysis in Section 2 and low computational cost of the linearized kernel-based sampling
`methods, our goal is to come up with a linearizable kernel (better than quadratic) that provides a
`good uniform multiplicative approximation of the exponential kernel K(h, c) = eo = eτ hT c. More
`concretely, we would like to find a nonlinear map φ(·) : Rd → RD such that the error between K(h, c)
`and ˆK(h, c) = φ(h)T φ(c) is small for all values of h and c.
`The Random Maclaurin Features [22] seem to be an obvious choice here since it provides an
`unbiased estimator of the exponential kernel. However, due to rank deficiency of the produced
`features, it requires large D in order to achieve small mean squared error [23, 24]. We verify that
`this is indeed a poor choice in Table 1.
`In contrast, the Random Fourier Features (RFF) [21] is much more compact [23]. Moreover, these
`features and their extensions are also theoretically better understood (see, e.g., [25, 26, 27]). This
`leads to the natural question: Can we use RFF to approximate the exponential kernel? However, this
`
`6
`
`
`
`approach faces a major challenge at the outset: RFF only works for positive definite shift-invariant
`kernels such as the Gaussian kernel, while the exponential kernel is not shift-invariant.
`A key observation is that when the input embedding h and the class embedding c are normalized,
`the exponential kernel becomes the Gaussian kernel (up to a multiplicative constant):
`
`eτ hT c = eτ e− τ||h−c||2
`
`2
`
`2
`
`.
`
`(16)
`
`We note that normalized embeddings are widely used in practice to improve training stability and
`model quality [28, 29, 30]. In particular, it attains improved performance as long as τ (cf (1)) is
`large enough to ensure that the output of softmax can cover (almost) the entire range (0,1). In
`Section 4, we verify that restricting ourselves to the normalized embeddings does not hurt and, in
`fact, improves the final performance.
`For the Gaussian kernel K(x− y) = e− ν(cid:107)x−y(cid:107)2
`RFF map takes the following form.
`
`2
`
`with temperature parameter ν, the D-dimensional
`
`1√
`D
`where w1, . . . , wD ∼ N (0, I/ν). The RFF map provides an unbiased approximation of the Gaussian
`kernel [26, Lemma 1]
`
`
`
`
`
`
`
`,
`
`(17)
`
`(cid:104)
`
`φ(u) =
`
`
`
`cos(wT1 u), . . . , cos(wTDu), sin(wT1 u), . . . , sin(wTDu)
`
`(cid:105)
`
`2
`
`e− ν(cid:107)x−y(cid:107)2
`≈ φ(x)T φ(y).
`(18)
`Now, given an input embedding h, if we sample class i with probability qi ∝ exp (−τ(cid:107)ci − h(cid:107)2/2),
`then it follows from (16) that our sampling distribution is the same as the softmax distribution.
`Therefore, with normalized embeddings, one can employ the kernel-based sampling to realize the
`sampled softmax such that class i is sampled with the probability
`qi ∝ φ(ci)T φ(h).
`(19)
`We refer to this method as Random Fourier Softmax (RF-softmax). RF-softmax costs O(D log n) to
`sample one point (cf. Section 3.1). Note that computing the nonlinear map takes O(Dd) time with
`the classic Random Fourier Feature. One can easily use the structured orthogonal random feature
`(SORF) technique [26] to reduce this complexity to O(D log d) with even lower approximation error.
`Since typically the embedding dimension d is on the order of hundreds and we consider large n:
`d (cid:28) n, the overall complexity of RF-softmax is O(D log n).
`As shown in Table 1, the RFF map approximates the exponential kernel with the mean squared
`error that is orders of magnitudes smaller than that for the quadratic map with the same mapping
`dimension D. The use of RFF raises interesting implementation related challenges in terms of
`selecting the temperature parameter ν to realize low biased sampling, which we address in the
`following subsection.
`
`3.3 Analysis and discussions
`Recall that the discussion following Theorem 1 implies that, for low-bias gradient estimates, the
`sampling distribution qi needs to form a tight multiplicative approximation of the softmax distribution
`pi, where pi ∝ exp(oi) = exp(τ hT ci). In the following result, we quantify the quality of the proposed
`RF-softmax based on the ratio |pi/qi|. The proof of the result is presented in Appendix B.
`
`7
`
`
`
`Theorem 2. Given the (cid:96)2-normalized input embedding h and (cid:96)2-normalized class embeddings
`{c1, . . . , cn}, let oi = τ hT ci be the logits associated with the class i. Let qi denote the probability of
`C · φ(ci)T φ(h),
`sampling the i-th class based on D-dimensional Random Fourier Features, i.e., qi = 1
`· √
`where C is the normalizing constant. Then, as long as e2ν ≤ γ
`√
`log D , the following holds with
`D
`
`ρ
`
`d
`
`probability at least 1 − O(cid:0) 1
`
`D2
`
`(cid:1).
`
`e(τ−ν)hT ci · (1 − 2γ) ≤
`
`where γ and ρ are positive constants.
`
`1(cid:80)
`
`i∈Nt
`
`eoi
`
`·(cid:12)(cid:12)(cid:12) eoi
`
`qi
`
`(cid:12)(cid:12)(cid:12) ≤ e(τ−ν)hT ci · (1 + 4γ),
`
`(20)
`
`γ = a(cid:48) ·(cid:16)
`
`e2τ · ρ
`
`Remark 1. With large enough D, we may invoke Theorem 2 with ν = τ and
`√
`d log D√
`D
`where a(cid:48) > 1 is a fixed constant. In this case, since γ = oD(1), it follows from (20) that qi ∝
`(1 ± oD(1)) · pi, for each i ∈ Nt. In particular, at D = ∞, we have qi ∝ pi.
`We now combine Theorem 1 and Theorem 2 to obtain the following corollary, which characterizes
`the bias of the gradient estimate for RF-softmax in the regime where D is large. The proof is
`presented in Appendix C.
`
`(cid:17)
`
`,
`
`−oD(1) · 1 ≤ E[∇θθθL(cid:48)] − ∇θθθL ≤(cid:16)
`where g (cid:44) (cid:80)
`
`C · φ(ci)T φ(h) denote the probability of sampling the i-th class under
`Corollary 1. Let qi = 1
`(cid:16)
`RF-softmax. Then, with high probability, for large enough D, by selecting ν = τ ensures that
`eoj∇θθθoj and 1 is the all one vector. Here, the expectation is taken over the
`j∈Nt
`sampling distribution q.
`
`oD(1) + o(cid:0)1/m(cid:1)(cid:17) · g +
`
`oD(1) + o(cid:0)1/m(cid:1)(cid:17) · 1,
`
`Note that Theorem 2 and Remark 1 highlight an important design issue in the implementation
`of the RF-softmax approach. The ability of q to approximate p (as stated in (20)) degrades as the
`difference |τ − ν| increases. Therefore, one would like to pick ν to be as close to τ as possible (ideally
`exactly equal to τ). However, for the fixed dimensional feature map φ, the approximation guarantee
`· √
`in (20) holds for only those values of ν such that e2ν ≤ γ
`√
`log D . Therefore, the dimension of the
`D
`ρ
`d
`feature map dictates which Gaussian kernels we can utilize in the proposed RF-softmax approach.
`On the other hand, the variance of the kernel approximation of Random Fourier feature grows with
`ν [26].
`Additionally, in order to work with the normalized embeddings, it’s necessary to select a reasonably
`large value for the temperature parameter4 τ. Therefore, choosing ν = τ in this case will result in
`larger variance of the estimated kernel.
`
`Remark 2. As a trade off between bias and variance, while approximating the exponential kernel
`with a limited D and large τ, ν should be set as a value smaller than τ.
`
`In Section 4, we explore different choices for the value of ν and confirm that some ν < τ achieves
`the best empirical performance.
`
`4This provides a wide enough range for the logits values; thus, ensuring the underlying model has sufficient
`expressive power. The typical choice ranges from 5 to 30.
`
`8
`
`
`
`4 Experiments
`
`In this section, we experimentally evaluate the proposed RF-softmax and compare it with several
`baselines using simple neural networks to validate the utility of the proposed sampling method and
`the accompanying theoretical analysis. We show that, in terms of the final model quality, RF-softmax
`with computational cost O(D log n) (D (cid:28) d2) performs at par with the sampled softmax approach
`based on the full softmax distribution which incurs the computational cost of O(dn). We also show
`that RF-softmax outperforms Quadratic-softmax [12] that uses the quadratic function to approximate
`the exponential kernel and has computational cost O(d2 log n). In addition, we highlight the effect
`of different design choices regarding the underlying RFF map φ (cf. (17)) on the final performance
`of RF-softmax.
`
`4.1 Datasets and models
`NLP datasets. PennTreeBank [31] is a popular benchmark for NLP tasks with a vocabulary of
`size 10, 000. We train a language model using LSTM, where the normalized output of the LSTM
`serves as the input embedding. Bnews [32] is another NLP dataset. For this dataset, we select the
`most frequent 64, 000 words as the vocabulary. Our model architecture for Bnews is the same as
`the one used for PennTreeBank with more parameters. We fix the embedding dimension to be
`d = 200 for PennTreeBank and d = 512 for Bnews.
`
`Extreme classification datasets. We test the proposed method on three classification datasets
`with a large number of classes [33]. For each data point that is defined by a v-dimensional sparse
`feature vector, we first map the data point to a 128-dimensional vector by using a v × 128 matrix.
`Once normalized, this vector serves as the input embedding h. For i ∈ [n], class i is mapped
`to a 128-dimensional normalized vector ci. Following the convention of extreme classification
`literature [34, 35, 4], we report precision at k (prec@k) on the test set.
`Recall that the proposed RF-softmax draws samples with computational complexity O(D log n).
`We compare it with the following baselines.
`• Full, the full softmax loss, which has computational complexity O(dn).
`• Exp, the sampled softmax approach with the full softmax distribution (cf. (2)) as the sampling
`distribution. This again has computational complexity O(dn) to draw one sample.
`• Uniform, the sampled softmax approach that draws negative classes uniformly at random,
`amounting to O(1) computational complexity to draw one sample.
`• Quadratic, the sampled softmax approach based on a quadratic kernel (cf. (15)). We follow the
`implementation of [12] and use αo2 + 1 with α = 100. Since αo2 + 1 is a better approximation
`(cid:80)
`eτ|oi|
`of e|o| than it is of eo, [12] proposes to use the absolute softmax function ˜pi =
`j∈[n] eτ|oj| while
`computing the cross entropy loss. We employ this modified loss function for the quadratic kernel
`as it gives better results as compared to the full softmax loss in (3). The cost of drawing one
`sample with this method is O(d2 log n).
`√
`Here, we refer to 1/
`τ as the temperature of softmax (cf. (2)). We set this parameter to 0.3 as
`it leads to the best performance for the Full baseline. This is a natural choice given that sampled
`softmax aims at approximating this loss function with a small subset of (sampled) negative classes.
`
`4.2 Experimental results
`Wall time. We begin with verifying that RF-softmax indeed incurs low sampling cost. In Table 2,
`we compare the wall-time that different model dependent sampling methods take to compute the
`
`9
`
`
`
`10, 000
`
`# classes (n) Method
`Exp
`Quadratic
`Rff (D = 50)
`Rff (D = 200)
`Rff (D = 500)
`Rff (D = 1, 000)
`Exp
`Quadratic
`Rff (D = 50)
`Rff (D = 200)
`Rff (D = 500)
`Rff (D = 1, 000)
`
`500, 000
`
`Wall time
`1.4 ms
`6.5 ms
`0.5 ms
`0.6 ms
`1.2 ms
`1.4 ms
`32.3 ms
`8.2 ms
`1.6 ms
`1.7 ms
`2.0 ms
`2.4 ms
`
`Table 2: Comparison of wall time for computing sampled softmax loss for different model-dependent sampling
`methods. (Batch size = 10, m = 10 and d = 64.)
`
`sampled softmax loss (cf. (6)) for different sets of parameters.
`
`Normalized vs. unnormalized embeddings. To justify our choice of working with normalized
`embeddings, we ran experiments on Full with and without embedding normalization for both
`PennTreeBank and AmazonCat-13k. On PennTreeBank, after 10 epochs, the unnormalized
`version has much worse validation perplexity of 126 as compared to 120 with the normalized version.
`On AmazonCat-13k, both versions have prec@1 87%.
`Next, we discuss the key design choices for RF-softmax: the choice of the Gaussian sampling
`kernel defined by ν (cf. (16)); and the dimension of the RFF map D.
`
`Figure 1: Validation perplexity
`for RF-softmax on PennTree-
`Bank with m = 100, D = 1024
`and varying values of T .
`
`Figure 2: Validation perplexity
`for RF-softmax on PennTree-
`Bank with m = 100 and varying
`D.
`
`Figure 3: Comparison of RF-
`softmax with other baselines on Pen-
`nTreeBank with m = 100 and val-
`idation perplexity as the metric.
`
`Effect of the parameter ν. As discussed in Section 3.2, for a finite D, we should choose ν < τ
`as a trade off between the variance and the bias. Figure 1 shows the performance of the proposed
`RF-softmax method on PennTreeBank for different values of ν. In particular, we vary T = 1√
`ν as
`it defines the underlying RFF map (cf. (18)). The best performance is attained at T = 0.5. This
`choice of ν < τ is in line with our discussion in Section 3.3. We use this same setting in the remaining
`experiments.
`
`Effect of D. The accuracy of the approximation of the Gaussian kernel using the RFF map improves
`
`10
`
`310
`
`270
`
`230
`
`190
`
`150
`
`110
`
`T = 0.3
`T = 0.4
`T = 0.5
`T = 0.6
`T = 0.7
`T = 0.8
`
`2
`
`4
`
`6
`Epoch
`
`8
`
`10
`
`Full
`D = 128
`D = 256
`D = 512
`D = 1024
`
`250
`230
`210
`190
`170
`150
`130
`110
`
`2
`
`4
`
`6
`Epoch
`
`8
`
`10
`
`300
`
`262
`
`224
`
`186
`
`148
`
`110
`
`2
`
`4
`
`6
`Epoch
`
`8
`
`10
`
`Full
`Exp
`RF-softmax
`(D = 1024)
`Quadratic
`Uniform
`
`
`
`Dataset
`AmazonCat-
`13k
`n = 13, 330
`v = 203, 882
`Delicious-
`200k
`n = 205, 443
`v = 782, 585
`WikiLSHTC
`n = 325, 056
`=
`v
`1, 617, 899
`
`Method
`Exp
`Uniform
`Quadratic
`Rff
`Exp
`Uniform
`Quadratic
`Rff
`Exp
`Uniform
`Quadratic
`Rff
`
`prec@1
`
`prec@3
`
`prec@5
`
`0.87
`0.83
`0.84
`0.87
`0.42
`0.36
`0.40
`0.41
`0.58
`0.47
`0.57
`0.56
`
`0.76
`0.69
`0.74
`0.75
`0.38
`0.34
`0.36
`0.37
`0.37
`0.29
`0.37
`0.35
`
`0.62
`0.55
`0.60
`0.61
`0.37
`0.32
`0.34
`0.36
`0.29
`0.22
`0.28
`0.26
`
`Table 3: Comparison among sampled softmax methods on extreme classification datasets. We report the
`metrics based on the same number of training iterations for all methods.
`
`as we increase the dimension of the map D. Figure 2 demonstrates the performance of the proposed
`RF-softm