`
`237
`
`Probability Estimation in Arithmetic and
`Adaptive-Huffman Entropy Coders
`
`Donald L. Duttweiler, Fellow, IEEE, and Christodoulos Chamzas, Senior Member, IEEE
`
`Abstract—Entropy coders, such as Huffman and arithmetic
`coders, achieve compression by exploiting nonuniformity in the
`probabilities under which a random variable to be coded takes
`on its possible values. Practical realizations generally require
`running adaptive estimates of these probabilities. An analysis
`of the relationship between estimation quality and the resulting
`coding efficiency suggests a particular scheme, dubbed scaled-
`count, for obtaining such estimates. It can optimally balance
`estimation accuracy against a need for rapid response to changing
`underlying statistics. When the symbols being coded are from a
`binary alphabet, simple hardware and software implementations
`requiring almost no computation are possible.
`A scaled-count adaptive probability estimator of the type de-
`scribed in this paper is used in the arithmetic coder of the JB1G
`and JPEG image coding standards.
`
`I. INTRODUCTION
`
`E NTROPY coders achieve compression by exploiting
`
`nonuniformity in the probabilities under which a signal
`to be compressed takes on its allowed values. Practical
`implementations generally require adaptive, on-line estimation
`of these probabilities either because sufficient statistical
`knowledge of the signal is lacking or because its statistics
`are time varying. How best to make such estimates is a
`question of much practical importance.
`We became interested in this problem from our involvement
`with two standards groups. The joint bilevel image group
`(JBIG) is under the auspices of both CCITT and ISO and
`is chartered to develop a standard for progressively coding
`bilevel (two-tone or black-white) images [1]. The joint pho-
`tographic experts group (JPEG) is under the same auspices
`and is chartered to develop a standard for coding photographic
`(grey-scale or color) images [2], [3]. Both of these groups have
`finished their work and specifications are available [4], [5].
`The JBIG algorithm and the JPEG algorithm (in some of
`its parameterizations) use a common arithmetic coder [6]—[8].
`Although our work was strongly focused on these two image
`coding applications for arithmetic coders, it is clear that much
`of what was eventually learned has application beyond just
`image coding and even for forms of entropy coding other than
`arithmetic, such as adaptive Huffman.
`In the next section of this paper, we review adaptive
`entropy coding and develop some useful analytical relation-
`ships between the quality of probability estimation and the
`
`Manuscript received May 31, 1992; revised November 26, 1993. The
`assosciate editor coordinating the review of this paper and approving it for
`publication was Prof. Michel Barlaud.
`D. L. Duttweiler is with AT&T Bell Laboratories, Holmdel, NJ 07760 USA.
`C. Chamzas is with Democritus University of Thrace, Xanthi, Greece.
`IEEE Log Number 9408197.
`
`Entropy Encoder
`
`Entropy Decoder
`
`atri1
`
`[
`t)
`
`Entropy
`Mapper
`
`Entropy
`Unmapper
`
`
`
` Model
`
`Ornt
`
`PO.
`
`Model
`
`Fig. 1. General model for entropy encoding and decoding.
`
`resulting coding inefficiency. The analysis strongly suggests
`a particular structure for adaptive probability estimation. This
`structure, which is dubbed scaled-count, makes it possible to
`achieve any desired balance between the steady-state quality of
`probability estimation and the speed of response to changing
`underlying statistics. Section III presents this structure and
`addresses implementation issues. In some environments (those
`of JBIG and JPEG among them), both software and hardware
`implementation become extremely simple, requiring almost no
`computation.
`
`II. ENTROPY CODING REVIEW AND ANALYSIS
`
`A. General Framework
`Let X(0), X(1), . . be a random process taking on at each
`time n one value from a finite alphabet of possible values.
`Since the particular values taken on will just serve as labels
`in what follows, they can, without loss of any generality, be
`taken as 0, 1,
`, K — 1, where K denotes the alphabet size.
`Fig. 1 shows a general model for one-symbol-at-a-time
`entropy encoding and decoding. The function of the block
`labeled "Model" is to provide a vectors
`P(r)= [Poo-o, .
`of estimates Pk (n) of
`take
`the probability X(n) will
`on the value k. It may use any of the observed values
`X(0),
`, X(n — 1) in making that estimate. The unit delay
`block labeled "D" emphasizes that X(n) itself may not be
`used in forming P(?),). If it were to be used, the entropy
`decoder could not generate a tracking P (n).
`Ideally, the block labeled "entropy mapper" places
`
`(1)
`
`1og2 (1/Pk (n))
`
`(2)
`
`bits of information on its output stream whenever a symbol
`X(n) having an instance value k is coded. If the true proba-
`bilities Pk (n) of X(n) taking on its various values k are highly
`
`Throughout this paper, vectors are shown in bold-face type. Vector
`components are separated by commas and grouped by square brackets. A
`superscript T denotes vector transposition.
`
`1057-7149/95$04.00 © 1995 IEEE
`
`Unified Patents, Ex. 1011
`
`000001
`
`
`
`238
`
`IEEE TRANSACTIONS ON IMAGE PROCESSING. VOL. 4, NO. 3, MARCH 1995
`
`nonuniform with some symbols much more likely than others,
`then the average of the number of bits the mapper uses to code
`a symbol can be well under log2 K, and there is substantial
`advantage to be gained with entropy coding.
`Entropy coders are always lossless. The sequence X(0),
`X(1), . . . is recovered exactly by the decoder. This is generally
`the type of coding desired if the sequence X(0), X(1), ...
`represents a text file in a computer system. It is also common
`for the coding of bilevel images (facsimile) to be done
`losslessly. In coding other types of images and in coding audio,
`some coding distortion is often allowed. Even when coding of
`this sort is being done, however, lossless coding often becomes
`a part of a larger structure in which some intermediate values,
`which are invertible to the original signal only with distortion,
`are themselves coded losslessly. The "baseline" JPEG system
`provides one example of this.
`The name "entropy mapper" is somewhat awkward and
`is nonstandard terminology. The same is true even more so
`for the decoding counterpart labeled "entropy unmapper." The
`reason for coining these terms here is to make it possible to
`carefully distinguish these mapping functions from the total
`entropy encoding and decoding functions. The term "entropy
`encoder" will be reserved for the combination of a probability
`estimator and an entropy mapper. "Entropy decoder" will
`always mean the combination of a probability estimator and
`an entropy unmapper.
`The well-known Huffman scheme [9] can attain the ideal
`of (2) whenever all the Pk (n) happen to equal inverse powers
`of two but, in general, falls somewhat short. The arithmetic
`mapper used in arithmetic coders does not suffer from this
`"breakage" problem and achieves the ideal.
`When the alphabet size K is large, Huffman mapping often
`can come acceptably close to the ideal and can be a viable
`choice for entropy mapping. With binary alphabets, it cannot.
`The breakage problem is unacceptably severe in the interesting
`situation where one of the symbol probabilities is small and
`the other is almost one. If the natural alphabet is binary and it
`is desired to use Huffman mapping, the standard way to solve
`the breakage problem is to recast the original problem into an
`equivalent one with a larger alphabet by grouping consecutive
`symbols X (n) together. With an arithmetic mapper there is
`no need to resort to an artifice like this, and indeed, there
`are significant implementation advantages with the binary
`alphabet.
`Conceptually, an arithmetic coder maps the sequence
`. to a real number on the unit interval [0.0, 1.0).
`X(0). X(1),
`What is transmitted in lieu of the sequence X(0), X(1), . . .
`is a binary encoding of this real number. The particular
`real number to be transmitted is determined recursively as
`shown in Fig. 2 for an example with a binary alphabet and
`X(0), X(1), . . . beginning with the particular instance values
`0, 1, 0.
`In general, at any given time, there exists a "current coding
`interval." Initially, it is the unit interval [0.0, 1.0). At time
`n, it is divided into K subintervals with the kth having a
`size proportional to Pk(n). The current coding interval at time
`n +1 is that subinterval associated with the instance k = X(n)
`actually occurring.
`
`1.000
`
`P(11
`
`P(01 I
`
`P(010)
`
`P(0101)
`
`P10100)
`
`P(0)
`
`P(011
`
`P(00)
`
`0.000
`
`Symbols to be coded:
`
`0
`
`1
`
`0
`
`Fig. 2.
`
`Interval subdivision example with a binary alphabet.
`
`The idea of coding by recursive interval subdivision is an
`old idea that is arguably an "obvious" and "immediate" appli-
`cation of the usual equations and theory developed in infor-
`mation theory texts. Abramson [10] credits Elias with having
`conceived it soon after Shannon's seminal paper [11] on infor-
`mation theory. Nonetheless, arithmetic coding was initially no
`more that an intellectual curiosity because a straightforward
`application of the recursive-interval-subdivision idea requires
`infinite precision arithmetic. What has changed recently is that
`it has been realized that it is possible to implement recursive
`interval subdivision with finite precision arithmetic and in a
`way allowing the "pipelining" of the encoder and decoder
`[12]—[14]. With a pipelined encoder and decoder, the encoder
`can begin to put out the bits of the compressed data stream
`before seeing the entire (possibly infinite) input sequence
`X(0), X(1), .. ., and the decoder can begin to put out decoded
`symbols before seeing the entire (possibly infinite) sequence
`of compressed bits. The existence of finite-precision imple-
`mentations that can be pipelined is crucial for the practical
`application of arithmetic coders, but detail about how these
`properties are achieved is inconsequential and unnecessary
`for the purposes of this paper. Some additional discussion of
`arithmetic coding appears later in Section III-F, but it is only
`to the minimal depth needed to illustrate the points central
`to this paper. Readers interested in more detail on arithmetic
`coding are referred to [7] and [8].
`
`B. What is Desired From the Model?
`denote the string
`For notational convenience, let X[i:
`, X[j]. The block labeled "Model" in Fig. 1 is free to
`use any of the X[0: n — 1] in forming its estimate P(n). Let
`
`T(n) = {i: X(i) is used in forming 1"(n)}
`
`(3)
`
`denote the set of time indices of the symbols actually used in
`forming P(n).
`The conditional entropy
`
`H(P(n) X(i),i E kit (n)) = E Pk(n)log2(1/Pk(n)) (4)
`
`K -1
`
`k=0
`
`is the average number of bits that will be used to code X(n)
`In this equation
`given X (i), i E
`
`Pk (n) = Pr{X(n) = k I X(i),i E kIi(n)}
`
`(5)
`
`Unified Patents, Ex. 1011
`
`000002
`
`
`
`DUTTWEILER AND CHAMZAS. PROBABILITY ESTIMATION IN ENTROPY CODERS
`
`239
`
`is the true conditional probability of an occurrence of symbol
`k given the particular pattern X(i). i E 11(n) of past symbols
`that has occurred. The conditional expected bit count provided
`by (4) is minimized if
`
`P(n) = P(n).
`
`(6)
`
`Therefore, not surprisingly, one thing desired from the model
`for efficient coding is accurate estimates P(n).
`More is needed, however. A significant entropy coding
`advantage is only achievable if the probability distribution
`P(n) is nonuniform with one or possibly a few of the symbols
`taken on by X (n) much more likely than the others. Thus,
`the other requirement for good modeling is a conditioning set
`tP(n) allowing good prediction so that at least as an average
`over past patterns X(i), i E W(n), the probability estimate
`P(n) is highly nonuniform with one (or, less desirably, a few)
`symbols predicted to occur with high likelihood.
`The unconditional entropy
`
`H(P(n) 1I1(n))
`= E[H(P(n) I X (i),i E W(n))]
`
`J
`
`PrIX(i),i E
`
`(n)1H(P(n) X(i), i E 11(0)(1,x
`
`(7)
`
`is the expected bit count to code symbol n using the probability
`estimate P(n), depending on past symbols X(i), i E 41(n). If
`P(n) = P(n) so that the coding is the best possible with the
`conditioning indices kV(n), the expected bit count for coding
`X(n) is
`
`11(41(n)) =
`
`( i) I X (i), E kif(n))]
`
`Pk(n)10g2(1/Pk(n))
`
`(8)
`
`k=0
`
`= E[K-1
`
`C. Approximating Excess Bit Count
`Before specializing the structure of Fig. 1 to something
`allowing practical realization, it is useful to proceed a little
`further within the general framework and derive a broadly
`applicable approximation to excess bit count.
`The conditional entropy difference
`
`D(P(n) I X(i),i E qi(n))
`= H(P(rt) I X(i),i E 111(n)) — H(P(n) I X(i),t E qi(n))
`= E pk(n)log,(Pk(n)/Pk(n))
`
`(13)
`
`k=0
`
`is the expected excess bit count due to error in probability
`estimation when X(i), i E tlf(n), has been observed. If we
`assume that the estimate Pk (n) is at least close to Pk (n) so
`that Pk(n)/Pk (n) is close to one and the approximation
`
`x — 1
`
`(14)
`
`with "In" denoting the natural logarithm is valid, then (13)
`becomes
`
`D(P(n) X(i), i E T(n))
`K-1
`
`1
`
`
`Pk(n)
`
`(Pk (n)
`
`1
`
`k=0
`K-1
`
`1
`= —
`1n2
`
` (15)
` (Pk(n)::::»
`1
`1 -I- Pk(n)-Pk(n)
`
`Pk(n)
`
`k=0
`
`The further approximation
`
`1
`1+e
`
`1 — e
`
`(16)
`
`for small e, and an assumption that Pk (n) is sane in the sense
`that
`
`K -1 E Pk(n) = 1
`
`k=0
`
`(17)
`
`It can be shown (see, for example, pp. 105-111 of 110]
`or almost any discussion on equivocation in an information
`theory text) that if
`
`Ti(n) C II/2(n)
`
`(9)
`
`leads to
`
`then
`
`11(Pier0)
`
`FT(T2(T))-
`
`Hence, the best model sets
`
`W(n) = [0,n— 1]
`
`(10)
`
`(11)
`
`to use all the past information and then makes the probability
`estimate P(n) = P.
`From a theoretical standpoint, the discussion thus far says all
`there is to say about modeling. From a practical standpoint,
`it says nothing. For most real-world applications, statistical
`knowledge about the source is usually nowhere near sufficient
`to begin to calculate
`
`Pk(n) = Pk(n) = Pr{X(n) = k I X(i),i E Ten)}
`
`(12)
`
`and even if it were, the calculation would in all likelihood be
`hopelessly complex.
`
`D(P(n) I X(i),
`
`Ten))
`
`Let
`
`1 ic-1 (Pk(n) Pk(n))2
`lug E
`Pk(n)
`k-=0
`
`(18)
`
`Q(P(n) X(i),i E T(7)) =
`
`D(P(n) X(i).i E 14"))
`H(P(n) X(i), i E T(0
`(19)
`
`denote the fractional (or normalized) excess bit count due to
`estimation error, that is, the expected coding inefficiency due
`to estimation error when XW, i E 111(n) has been observed.
`A simple formula for this coding inefficiency Q(P(n)
`X(i), i E T(n)) can be derived under the assumption that
`one symbol, say, kina.(n), is highly likely to occur, and all
`the others are unlikely. In the case of a binary alphabet, this
`will always be true whenever there is anything to be gained
`from entropy coding. With a multisymbol alphabet, it is a
`
`Unified Patents, Ex. 1011
`
`000003
`
`
`
`240
`
`X(n)
`
`IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 4, NO. 3, MARCH 1995
`
`X(n)
`
`Entropy
`Mapper
`
`Context
`Extractor
`
`Tin)
`
`Frequency
`Observer
`
`0(n)
`
`Entropy
`Mapper
`
`Frequency
`Observer
`
`0(n)
`
`Fig. 3. Adap ive probability estimation for entropy encoding and decoding.
`
`Fig. 4. Simplified adaptive probability estimation with only one context.
`
`somewhat restrictive assumption since entropy coding would
`be advantageous if a few (as opposed to just one) of the
`symbols were likely and all the others nonlikely. However,
`additional structure in multisymbol applications will often
`justify this assumption.
`Assuming the symbol kmax(n) is highly likely and all the
`others unlikely
`
`H(P (n) I X(i),i E )11(n))
`(1 — Pmax(n))log2(1/(1 — Pmax(n)))
`
`
`= IA(1 — P„,ax(n))
`ln2
`
`where
`
`Pmax (n) = Pkn.„.(n)(n)
`
`(20)
`
`(21)
`
`is the probability of the dominating symbol, and for conve-
`nience
`
`A(q) = q In(l/q).
`
`(22)
`
`In words, (20) approximates the entropy H(P(„) X(i), i E
`T(n)) by the information in the fact that the expected symbol
`does not occur times the probability that it does occur. The true
`entropy also adds in the probability of the expected symbol
`times its information and the entropy needed to distinguish
`exactly which of the unlikely events did occur, but these
`two components to the total entropy are comparatively much
`smaller.
`Using (20) in (19) finally gives
`
`Q(P(n) X(i),i E kli(n))
`
`1
`A(1 — Pmax(n))
`
`(Pk (n) — Pk (n))2
`Pk (n)
`
`(23)
`
`D. Adaptive Probability Estimation
`Fig. 3 shows an entropy encoding structure that is much
`more restrictive than that of Fig. 1, but it has the important
`advantage of not requiring any a priori statistical knowledge
`about the source for the sequence X(0), X(1), . . . . The de-
`coding counterpart, which is similar, is not shown.
`With this adaptive entropy encoder, the conditioning index
`set )11(n) has the special form
`
`)1/(n) =
`
`i -= n — j with j E
`
`(24)
`
`where 1 is a set of positive integers. In other words, the
`condition set )IJ(n) at all times n is always past symbols
`
`X(n) at particular offsets relative to the current symbol.2 The
`conditioning set 52 is often referred to as a template because
`it specifies a pattern of past symbols to use in predicting the
`current one. It is important to note that the template is fixed3
`and not dependent on n. In image coding, the template is
`usually chosen as a few pixels that are near and causally (in a
`raster scan sense) related to a given pixel to be coded [16]. In
`text coding, the template is usually one or two of the preceding
`characters [7].
`The "context extractor" outputs a unique integer T(n) for
`each possible instance of conditioning values {X(n — i), i E
`D}. For example, if 52 = {1, 2}, one possible way to define
`the context T(n) is by
`
`T(n) = X(n —1)K° + X(n — 2)K1.
`
`(25)
`
`The exact manner in which particular instances of conditioning
`values are mapped to an integer is inconsequential as the
`context T(n) just serves as a labeling in what follows.
`The "frequency observer" notes the context T(n) and, based
`on this, makes a probability estimate P(n). An obvious way
`in which it might do this adaptively is by maintaining counts
`of how often each particular instance k of X(n) has occurred
`when the context has been T(n). Estimated probabilities are
`roughly (but preferably not exactly, as will be discussed at
`length later) proportional to occurrence frequencies.
`Such a strategy implicitly assumes at least "local" ergodicity
`so that averages over past time are good predictors of ensemble
`averages. The qualifier "local" is added because an advantage
`of considerable practical importance for running (that is, on-
`line or adaptive) probability estimation like this is an ability
`to track slowly varying input statistics. This, too, will be the
`subject of much further discussion.
`Context is a powerful idea. It is essential for efficient coding
`to use it in the JBIG and JPEG environments as well as the
`text coding environment. In spite of the power and utility
`of the context idea, the remainder of the analysis in this
`paper assumes that there is only one context. The reason for
`doing an analysis under this unrealistic condition is simply
`that the extension of all results back to the more interesting
`multicontext environment is trivial. Carrying the necessary
`notation for context throughout the analysis adds nothing but
`uneeded clutter. Assuming a single context application, the
`model of Fig. 3 reduces to that of Fig. 4.
`
`2 For small n, ,P(rr) may reference nonexistent symbols. There are many
`reasonable ways of patching this up, and we will ignore this uninteresting
`detail here.
`3 In the JBIG application, it was found to be advantageous to infrequently
`allow SI to change [15], but this kind of detail is best ignored here.
`
`Unified Patents, Ex. 1011
`
`000004
`
`
`
`DU1TWEILER AND CHAMZAS. PROBABILITY ESTIMATION IN ENTROPY CODERS
`
`241
`
`III. SCALED-COUNT PROBABILITY ESTIMATION
`
`A. Bayesian Estimation
`Assume the sequence X(0). X(1), . . . is a sequence of
`independent, identically distributed, random variables each
`taking on the value k with probability Rk. If the vector
`
`R = [1-?0, . . . .RK-l]T
`
`(26)
`
`of symbol probabilities is itself a random variable with a priori
`probability density fo(r), then the a posteriori probability
`density f,(r) of the random variable R given an obseration
`of X(0)
`X(n - 1) is computable by Bayes formula. In
`particular,
`
`f„(r) = PORI X[0: n - I]}
`- Pr{lt, X[0: n- 1]}
`Pr{X[0: ra - 1]1
`PrIX[0: n - 1] I rhio(r)
`f Pr{X[0: n - 1] r}fo(r)dr
`
`The expected value
`
`rf, (r)dr
`
`(27)
`
`(28)
`
`is the Bayesian estimate of R.
`Analytical evaluation of these equations is possible for some
`particular a priori densities fo(r). One that is well known is
`the unform density
`
`fo(r) = 6I 1
`
`K -1
`
`-
`
`)
`rk
`
`k=0
`
`(29)
`
`where the Dirac delta function 6(•) simply forces the proba-
`bilities to add to one and lowers the effective dimensionality
`of the probability density from K
`to K- 1. With a uniform
`a priori density
`
`factorial to noninteger numbers. Some of its properties that
`are useful here are
`
`F(0) = 1
`F(x + 1) = xr(x), x > 0
`(x) = (x - 1)!, x > 1 and integer.
`
`(36)
`(37)
`(38)
`
`The estimate it of (30) makes intuitive sense. If n is
`much larger than K. it becomes simply a ratio of observed
`frequencies. The "+I" in the numerator and the "+K"
`in the
`denominator bias early estimates of R toward equiprobable.
`Another a priori density that allows analytical evaluation
`and one that leads to a class of estimates that at least for the
`JBIG and JPEG applications has been found quite powerful is
`fo(r) = ry(0)r,5 -1 • • • rZ=1,6 1 - E rk
`
`K-1
`
`)
`
`(
`
`(39)
`
`k=0
`where A > 0 is a real number that is a free parameter and the
`normalizing constant a(0) will be given later. This particular
`prior has also been described by Zandi and Langdon [17].
`If A = 1, this density becomes the uniform density already
`discussed. With A < 1, the apriori density fo(r) tends to favor
`symbol probability vectors R near the edges rather than the
`center of the n-dimensional unit hypercube. Such a density is
`of interest because the symbol probability vectors it faors are
`just those for which entropy coding is rewarding. There are of
`course many other such apriori densities favoring edges and
`there is no argument for considering one over them other than
`it happening to lead to analytically tractable calculations.
`Under the apriori density 39,
`
`C(n) +
`R(n) =
`+ KA
`
`
`
`(40)
`
`and
`
`f„(r) = n roc ' (n)+A-1
`
`• • r Ch
`
`11(n)+-1-16 (1 -
`
`and
`
`C(n) + 1
`n+ K
`
`(30)
`
`where
`
`k=0 rh
`(41)
`
`fn (r) = a(n)roc' (n) • • rrk-t('
`k i
`
`)6 1 --
`
`K -1
`
`)
`
`rk
`
`(31)
`
`k=0
`
`where
`
`C(n) = [Co(n)
`
`1 = [1 ,
`
`
`
`
`1]
`
`T
`
`Ch (n)r.
`
`a(n) =
`
`P(n + K)
`
`r(Co(n) + K)
`
`• F(C K - _1(n) + K) .
`
`F(x) =
`
`Ax-l e-AdA, x > 0
`
`(32)
`(33)
`
`(34)
`
`(35)
`
`and the Ck(n) are counts of the number of occurrences of
`the instance k in X[0: n
`- 1]. The Gamma function P(x)
`is a standard mathematical function that extends the idea of
`
`(n ) -=-
`
`F(n + KO)
`P(Co(n) + A) •
`• F(CK_ (n) + A )
`With A < 1 (the interesting parameterization), the Bayesian
`estimate (40) is quicker to jump to conclusions about the
`underlying probability density. As an example, suppose K =
`2, and X(0) happens to be 1. Equation (30) (or, equivalently,
`(40) with A = 1) estimates
`
`.
`
`(42)
`
`(43)
`whereas with very small A, (40) has already concluded that
`
`/51(1) = 2/3
`
`Pi(1).-':, 1.
`
`(44)
`Rashness of this sort empirically has been found desirable for
`the JBIG and JPEG application for which the optimal value
`of Z seems to be about 0.4.
`
`Unified Patents, Ex. 1011
`
`000005
`
`
`
`242
`
`IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 4, NO. 3, MARCH 1995
`
`P(n) =
`
`B. Balancing Estimation Accuracy Against Estimation Speed
`. is again
`For this section, the sequence X(0), X(1),
`independent and identically distributed as in the last section,
`but there is no longer any need to conceptualize the symbol
`probability vector r as having been first determined by some
`underlying random process. It is best now to simply think of
`r as fixed but unknown. We further assume than an estimate
`C(n) + O1
`n K A
`has been chosen for estimating the probabilities under which
`X(n) takes on its possible values, but the reason for having
`made this particular choice is immaterial. In particular, the fact
`this estimate has a Bayesian interpolation is inconsequential.
`None of the assumed a priori densities or calculated a poste-
`riori densities of the last section are assumed or used in what
`follows.
`Assuming either that the alphabet is binary or that it
`is multilevel with one symbol dominating, the conditional
`fractional increase in bit count due to probability error is given
`by (23) as
`
`(45)
`
`Q(P(n) I X (n — i),i E
`
` h-1 (Pk(n)
`1
`A(1 — max) kE=0
`rk
`
`rk)2
`
`(46)
`
`where rmax denotes the largest rk. Averaging over all instances
`of the conditioning set gives the average fractional increase in
`bit count as
`
`1
`Q(P(n)) A(1 — r„,ax) k=0
`
`of,k(n) - rk)2]
`
`rk
`
`(47)
`
`The mean-square error required in (47) is readily calculated.
`Toward this end, define
`
`The difference
`
`m(n) = E[P(n)]
`nr + A 1
`n + KA
`
`b(n) = m(n) — r
`
`(1 Kr)
`n + KA
`
`(48)
`
`(49)
`
`between the expected value of P(n) and its desired value r is
`the estimation bias. Bayesian estimates are notorious for being
`biased. However, the bias is small and goes to 0 with n. It is
`interesting that it happens to be smaller with small A as well.
`The mean-square of interest in (47) is evaluated as
`
`ERPk(n) — rk)2]
`=- ER/3k (n) — mk(n))2] + bi2,(n)
`Az
`
`= (n + K A)2
`
`rk(1
`
`rk) +
`
`(it + K A)2
`
`(1 Krk)2.
`
`For large n or small A
`ERA(n) — rk)21, rk(1
`n
`
`rk)
`
`(50)
`
`(51)
`
`Hence
`
`Q(n)
`
`1
`A(1 — rmax)
`
`k —1
`
`k —o
`
`rk (1 — rk)
`
`nnk
`
`K —
`nA(1 — rmax) .
`The probability rma„ of the most likely symbol is not known
`but can be estimated as Pmax (n). Thus
`K — 1
`nA(1 — P,..(n))
`K — 1
`nA(Ekok_C k (n)ln) .
`
`(52)
`
`(53)
`
`Q(n)
`
`If the alphabet is binary, (53) becomes amazingly simple:
`
`Q(n) ^
`1
`nA(Cmin(n)ln)
`1
`Cmin(n) ln(n/C,„„(n))
`1
`Cmin(n) ln(1/rmin(n))
`where e mm(n) is the smallest count at time n, and rmin is the
`probability of the least likely symbol.
`
`(54)
`
`C. Scaled-Count Iteration
`Suppose a binary environment and that some particular
`fractional increase Q* has been deemed acceptable. To achieve
`it, 1/(Q* ln(1/rmm)) of the less likely symbols must be
`observed. Collecting this many samples of the less likely event
`will require comparatively few total observations when this
`event is only moderately rare but comparatively many when
`it is extremely rare. Hence, the averaging window needed to
`estimate small probabilities increases as the disparity, or, skew,
`in r„th, and r„,ax increases.
`In many applications for entropy coders, the underlying
`statistics of the process being coded change with time. This
`is definitely true in the JBIG and JPEG applications. It is
`extremely desirable that the probability estimate track this
`time variability as much as possible, which is consistent with
`maintaining some minimal penalty in increased fractional bit
`count due to estimation error. Equation (54) and the discussion
`of the preceding paragraph suggest the following modification
`to standard Bayesian estimation to achieve this goal [18].
`Initialize for 0 < k < K
`
`Ck (0) = 0.
`
`Estimate for 0 < k < K
`
`Pk(n) =
`
`Ck (n) + A
`
`ELo l (Ci(n)+ A)
`Iterate for 0< Jr < K and n> 1
`Ck (n — 1) + 1.
`if X (n —1) = k
`Ck(n) = {Ck (n — 1),
`otherwise
`Cm = min{Can),
`, CK_i(n)}
`C
`if C;„„„ < CL,
`d(Ck (n) + A) — A, otherwise
`
`Ck(n) =
`
`(55)
`
`(56)
`
`(57)
`
`(58)
`
`(59)
`
`Unified Patents, Ex. 1011
`
`000006
`
`
`
`DUTTWEILER AND CHANIZAS PROBABILITY ESTIMATION IN ENTROPY CODERS
`
`243
`
`,•• 0
`
`'
`
`-•>• • -)
`
`> •
`
`•
`
`
`
`,• • =-
`
`--> •
`
`T T
`T T.. T
`
`••
`
`•
`
`•
`
`•
`
`be done to keep the sum of all the counts but the largest
`under some threshold. The shown criteria involving just the
`minimum count is easier to implement and, for us, has more
`intuitive appeal. Experimental data from a real multisymbol
`application would be helpful in choosing between the two
`alternatives. Our practical experience is only with binary
`alphabets.
`In practice, setting Cmin is best done by experimentation.
`Equation (54) nicely quantifies the cost to coding efficiency
`associated with noisy estimation of true probabilities. If this
`were the only criteria, large values for cmin would clearly be
`best. The other side, though, is tracking ability, and although it
`is clear that smaller values for emir, will allow better tracking
`of changing underlying statistics, there is little to be said
`quantitatively. The exact nature of the time variability is
`crucial and not something amenable to analysis.
`
`•
`
`•
`
`•
`
`Co
`
`•
`
`1
`
`• -I.
`
`
`
`I
`
`• -4.
`
`Fig. 5. Markov interpretation for the scaled-counter probability estimator.
`
`w here
`
`=
`
`Cur,
`Crc,in + A
`
`
`(60)
`
`We refer to an estimator like that defined by the iteration
`(55)—(60) as a scaled-count estimator. The basic idea is to
`mimic a Bayesian estimator until the smallest count exceeds
`some threshold (.1",;,.,. The exact numerical value of Cni *,i„ is a
`design parameter controlling the balance between estimation
`accuracy and estimation speed. When the minimum count
`attempts to become greater than
`all the counts are scaled
`back to bring the count for the least likely symbol just back
`to CL„.
`There are many possible variations on this theme. The "A's"
`in (59) and (60) cause the scaling of counts to be such that
`the probability estimates (56) are unaffected. Alternatively, the
`"A's" in (59) and (60) could be dropped, in which case, count
`proportionality rather than probability-estimate proportionality
`would be maintained. Theoretically, the scaling shown perhaps
`makes more sense, but as a practical matter, the choice is likely
`to be inconsequential.
`The scaling /3 of (60) drops the minimum count by L
`It could be made smaller so that the minimum count can
`be dropped back even more. A value of 0.5 might make
`implementation easier in some instances [19].
`Notice that the counts Ck (n) are initially all integers, but
`once scaling has been done, some become noninteger. Practical
`realizations will require rounding off the scalings to the
`precision available for representation. Much more discussion
`of this kind of problem appears following this section.
`Practical realizations will also have a limit on how large a
`number can be represented. One fix is to also scale back counts
`whenever the largest count attempts to exceed this limit.
`The theoretical justification for the scaled-count estimator
`of (55)—(60) is much stronger with a binary alphabet than a
`multisymbol alphabet. In the multisymbol case, the awkward
`assumption that one probability dominates is needed to reach
`(23). That equation, in fact, suggests that instead of scaling to
`keep the smallest count under some threshold, scaling should
`
`D. Malloy Implementation
`A Markov chain interpretation of the scaled-count equations
`exists whenever the counts C5 are quantized after scaling and
`limits are placed on how- large counts may grow. Any real
`implementation necessarily satisfies both these conditions.
`Fig. 5 helps to explain the Markov chain interpretation.
`In drawing this figure, we made the following additional
`particular assumptions:
`• The alphabet is binary.
`• The quantization is to integers.
`• ()Liu = 2 .
`• Any count is also scaled whenever it exceeds C,..%), =
`6, and
`• A = 0.5.
`A numerical value of 2 for Cm*
`is slightly smaller than
`numbers we found optimal for image coding, which were
`generally in the 4 to 16 range. A numerical value of 6 for the
`allowed maximum is ridiculously small, by many orders of
`magnitude. Its virtue here is that it is large enough to illustrate
`key ideas to come, yet small enough that a plot like that in
`Fig. 5 stays manageable.
`The solid squares are the Markov states, and the solid square
`in the lower left corner is the initial state. The observed
`sequence X(0), X(I), . . . dictates the transitions between
`these states. If X(n) is 0, the transition is nominally 1 unit
`to the right. If it is 1, it is one unit up. Occasionally, such
`a transition would take the state to a point in Co, C1 space
`outside the region of allowed states. Such transitions are shown
`by dotted arrows to open squares, these squares representing
`what we call "virtual" states. Any transition to a virtual state
`is immediately ma