throbber
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 4. NO. 3. MARCH 1995
`
`237
`
`Probability Estimation in Arithmetic and
`Adaptive-Huffman Entropy Coders
`
`Donald L. Duttweiler, Fellow, IEEE, and Christodoulos Chamzas, Senior Member, IEEE
`
`Abstract—Entropy coders, such as Huffman and arithmetic
`coders, achieve compression by exploiting nonuniformity in the
`probabilities under which a random variable to be coded takes
`on its possible values. Practical realizations generally require
`running adaptive estimates of these probabilities. An analysis
`of the relationship between estimation quality and the resulting
`coding efficiency suggests a particular scheme, dubbed scaled-
`count, for obtaining such estimates. It can optimally balance
`estimation accuracy against a need for rapid response to changing
`underlying statistics. When the symbols being coded are from a
`binary alphabet, simple hardware and software implementations
`requiring almost no computation are possible.
`A scaled-count adaptive probability estimator of the type de-
`scribed in this paper is used in the arithmetic coder of the JB1G
`and JPEG image coding standards.
`
`I. INTRODUCTION
`
`E NTROPY coders achieve compression by exploiting
`
`nonuniformity in the probabilities under which a signal
`to be compressed takes on its allowed values. Practical
`implementations generally require adaptive, on-line estimation
`of these probabilities either because sufficient statistical
`knowledge of the signal is lacking or because its statistics
`are time varying. How best to make such estimates is a
`question of much practical importance.
`We became interested in this problem from our involvement
`with two standards groups. The joint bilevel image group
`(JBIG) is under the auspices of both CCITT and ISO and
`is chartered to develop a standard for progressively coding
`bilevel (two-tone or black-white) images [1]. The joint pho-
`tographic experts group (JPEG) is under the same auspices
`and is chartered to develop a standard for coding photographic
`(grey-scale or color) images [2], [3]. Both of these groups have
`finished their work and specifications are available [4], [5].
`The JBIG algorithm and the JPEG algorithm (in some of
`its parameterizations) use a common arithmetic coder [6]—[8].
`Although our work was strongly focused on these two image
`coding applications for arithmetic coders, it is clear that much
`of what was eventually learned has application beyond just
`image coding and even for forms of entropy coding other than
`arithmetic, such as adaptive Huffman.
`In the next section of this paper, we review adaptive
`entropy coding and develop some useful analytical relation-
`ships between the quality of probability estimation and the
`
`Manuscript received May 31, 1992; revised November 26, 1993. The
`assosciate editor coordinating the review of this paper and approving it for
`publication was Prof. Michel Barlaud.
`D. L. Duttweiler is with AT&T Bell Laboratories, Holmdel, NJ 07760 USA.
`C. Chamzas is with Democritus University of Thrace, Xanthi, Greece.
`IEEE Log Number 9408197.
`
`Entropy Encoder
`
`Entropy Decoder
`
`atri1
`
`[
`t)
`
`Entropy
`Mapper
`
`Entropy
`Unmapper
`
`
`
` Model
`
`Ornt
`
`PO.
`
`Model
`
`Fig. 1. General model for entropy encoding and decoding.
`
`resulting coding inefficiency. The analysis strongly suggests
`a particular structure for adaptive probability estimation. This
`structure, which is dubbed scaled-count, makes it possible to
`achieve any desired balance between the steady-state quality of
`probability estimation and the speed of response to changing
`underlying statistics. Section III presents this structure and
`addresses implementation issues. In some environments (those
`of JBIG and JPEG among them), both software and hardware
`implementation become extremely simple, requiring almost no
`computation.
`
`II. ENTROPY CODING REVIEW AND ANALYSIS
`
`A. General Framework
`Let X(0), X(1), . . be a random process taking on at each
`time n one value from a finite alphabet of possible values.
`Since the particular values taken on will just serve as labels
`in what follows, they can, without loss of any generality, be
`taken as 0, 1,
`, K — 1, where K denotes the alphabet size.
`Fig. 1 shows a general model for one-symbol-at-a-time
`entropy encoding and decoding. The function of the block
`labeled "Model" is to provide a vectors
`P(r)= [Poo-o, .
`of estimates Pk (n) of
`take
`the probability X(n) will
`on the value k. It may use any of the observed values
`X(0),
`, X(n — 1) in making that estimate. The unit delay
`block labeled "D" emphasizes that X(n) itself may not be
`used in forming P(?),). If it were to be used, the entropy
`decoder could not generate a tracking P (n).
`Ideally, the block labeled "entropy mapper" places
`
`(1)
`
`1og2 (1/Pk (n))
`
`(2)
`
`bits of information on its output stream whenever a symbol
`X(n) having an instance value k is coded. If the true proba-
`bilities Pk (n) of X(n) taking on its various values k are highly
`
`Throughout this paper, vectors are shown in bold-face type. Vector
`components are separated by commas and grouped by square brackets. A
`superscript T denotes vector transposition.
`
`1057-7149/95$04.00 © 1995 IEEE
`
`Unified Patents, Ex. 1011
`
`000001
`
`

`

`238
`
`IEEE TRANSACTIONS ON IMAGE PROCESSING. VOL. 4, NO. 3, MARCH 1995
`
`nonuniform with some symbols much more likely than others,
`then the average of the number of bits the mapper uses to code
`a symbol can be well under log2 K, and there is substantial
`advantage to be gained with entropy coding.
`Entropy coders are always lossless. The sequence X(0),
`X(1), . . . is recovered exactly by the decoder. This is generally
`the type of coding desired if the sequence X(0), X(1), ...
`represents a text file in a computer system. It is also common
`for the coding of bilevel images (facsimile) to be done
`losslessly. In coding other types of images and in coding audio,
`some coding distortion is often allowed. Even when coding of
`this sort is being done, however, lossless coding often becomes
`a part of a larger structure in which some intermediate values,
`which are invertible to the original signal only with distortion,
`are themselves coded losslessly. The "baseline" JPEG system
`provides one example of this.
`The name "entropy mapper" is somewhat awkward and
`is nonstandard terminology. The same is true even more so
`for the decoding counterpart labeled "entropy unmapper." The
`reason for coining these terms here is to make it possible to
`carefully distinguish these mapping functions from the total
`entropy encoding and decoding functions. The term "entropy
`encoder" will be reserved for the combination of a probability
`estimator and an entropy mapper. "Entropy decoder" will
`always mean the combination of a probability estimator and
`an entropy unmapper.
`The well-known Huffman scheme [9] can attain the ideal
`of (2) whenever all the Pk (n) happen to equal inverse powers
`of two but, in general, falls somewhat short. The arithmetic
`mapper used in arithmetic coders does not suffer from this
`"breakage" problem and achieves the ideal.
`When the alphabet size K is large, Huffman mapping often
`can come acceptably close to the ideal and can be a viable
`choice for entropy mapping. With binary alphabets, it cannot.
`The breakage problem is unacceptably severe in the interesting
`situation where one of the symbol probabilities is small and
`the other is almost one. If the natural alphabet is binary and it
`is desired to use Huffman mapping, the standard way to solve
`the breakage problem is to recast the original problem into an
`equivalent one with a larger alphabet by grouping consecutive
`symbols X (n) together. With an arithmetic mapper there is
`no need to resort to an artifice like this, and indeed, there
`are significant implementation advantages with the binary
`alphabet.
`Conceptually, an arithmetic coder maps the sequence
`. to a real number on the unit interval [0.0, 1.0).
`X(0). X(1),
`What is transmitted in lieu of the sequence X(0), X(1), . . .
`is a binary encoding of this real number. The particular
`real number to be transmitted is determined recursively as
`shown in Fig. 2 for an example with a binary alphabet and
`X(0), X(1), . . . beginning with the particular instance values
`0, 1, 0.
`In general, at any given time, there exists a "current coding
`interval." Initially, it is the unit interval [0.0, 1.0). At time
`n, it is divided into K subintervals with the kth having a
`size proportional to Pk(n). The current coding interval at time
`n +1 is that subinterval associated with the instance k = X(n)
`actually occurring.
`
`1.000
`
`P(11
`
`P(01 I
`
`P(010)
`
`P(0101)
`
`P10100)
`
`P(0)
`
`P(011
`
`P(00)
`
`0.000
`
`Symbols to be coded:
`
`0
`
`1
`
`0
`
`Fig. 2.
`
`Interval subdivision example with a binary alphabet.
`
`The idea of coding by recursive interval subdivision is an
`old idea that is arguably an "obvious" and "immediate" appli-
`cation of the usual equations and theory developed in infor-
`mation theory texts. Abramson [10] credits Elias with having
`conceived it soon after Shannon's seminal paper [11] on infor-
`mation theory. Nonetheless, arithmetic coding was initially no
`more that an intellectual curiosity because a straightforward
`application of the recursive-interval-subdivision idea requires
`infinite precision arithmetic. What has changed recently is that
`it has been realized that it is possible to implement recursive
`interval subdivision with finite precision arithmetic and in a
`way allowing the "pipelining" of the encoder and decoder
`[12]—[14]. With a pipelined encoder and decoder, the encoder
`can begin to put out the bits of the compressed data stream
`before seeing the entire (possibly infinite) input sequence
`X(0), X(1), .. ., and the decoder can begin to put out decoded
`symbols before seeing the entire (possibly infinite) sequence
`of compressed bits. The existence of finite-precision imple-
`mentations that can be pipelined is crucial for the practical
`application of arithmetic coders, but detail about how these
`properties are achieved is inconsequential and unnecessary
`for the purposes of this paper. Some additional discussion of
`arithmetic coding appears later in Section III-F, but it is only
`to the minimal depth needed to illustrate the points central
`to this paper. Readers interested in more detail on arithmetic
`coding are referred to [7] and [8].
`
`B. What is Desired From the Model?
`denote the string
`For notational convenience, let X[i:
`, X[j]. The block labeled "Model" in Fig. 1 is free to
`use any of the X[0: n — 1] in forming its estimate P(n). Let
`
`T(n) = {i: X(i) is used in forming 1"(n)}
`
`(3)
`
`denote the set of time indices of the symbols actually used in
`forming P(n).
`The conditional entropy
`
`H(P(n) X(i),i E kit (n)) = E Pk(n)log2(1/Pk(n)) (4)
`
`K -1
`
`k=0
`
`is the average number of bits that will be used to code X(n)
`In this equation
`given X (i), i E
`
`Pk (n) = Pr{X(n) = k I X(i),i E kIi(n)}
`
`(5)
`
`Unified Patents, Ex. 1011
`
`000002
`
`

`

`DUTTWEILER AND CHAMZAS. PROBABILITY ESTIMATION IN ENTROPY CODERS
`
`239
`
`is the true conditional probability of an occurrence of symbol
`k given the particular pattern X(i). i E 11(n) of past symbols
`that has occurred. The conditional expected bit count provided
`by (4) is minimized if
`
`P(n) = P(n).
`
`(6)
`
`Therefore, not surprisingly, one thing desired from the model
`for efficient coding is accurate estimates P(n).
`More is needed, however. A significant entropy coding
`advantage is only achievable if the probability distribution
`P(n) is nonuniform with one or possibly a few of the symbols
`taken on by X (n) much more likely than the others. Thus,
`the other requirement for good modeling is a conditioning set
`tP(n) allowing good prediction so that at least as an average
`over past patterns X(i), i E W(n), the probability estimate
`P(n) is highly nonuniform with one (or, less desirably, a few)
`symbols predicted to occur with high likelihood.
`The unconditional entropy
`
`H(P(n) 1I1(n))
`= E[H(P(n) I X (i),i E W(n))]
`
`J
`
`PrIX(i),i E
`
`(n)1H(P(n) X(i), i E 11(0)(1,x
`
`(7)
`
`is the expected bit count to code symbol n using the probability
`estimate P(n), depending on past symbols X(i), i E 41(n). If
`P(n) = P(n) so that the coding is the best possible with the
`conditioning indices kV(n), the expected bit count for coding
`X(n) is
`
`11(41(n)) =
`
`( i) I X (i), E kif(n))]
`
`Pk(n)10g2(1/Pk(n))
`
`(8)
`
`k=0
`
`= E[K-1
`
`C. Approximating Excess Bit Count
`Before specializing the structure of Fig. 1 to something
`allowing practical realization, it is useful to proceed a little
`further within the general framework and derive a broadly
`applicable approximation to excess bit count.
`The conditional entropy difference
`
`D(P(n) I X(i),i E qi(n))
`= H(P(rt) I X(i),i E 111(n)) — H(P(n) I X(i),t E qi(n))
`= E pk(n)log,(Pk(n)/Pk(n))
`
`(13)
`
`k=0
`
`is the expected excess bit count due to error in probability
`estimation when X(i), i E tlf(n), has been observed. If we
`assume that the estimate Pk (n) is at least close to Pk (n) so
`that Pk(n)/Pk (n) is close to one and the approximation
`
`x — 1
`
`(14)
`
`with "In" denoting the natural logarithm is valid, then (13)
`becomes
`
`D(P(n) X(i), i E T(n))
`K-1
`
`1
`
`
`Pk(n)
`
`(Pk (n)
`
`1
`
`k=0
`K-1
`
`1
`= —
`1n2
`
` (15)
` (Pk(n)::::»
`1
`1 -I- Pk(n)-Pk(n)
`
`Pk(n)
`
`k=0
`
`The further approximation
`
`1
`1+e
`
`1 — e
`
`(16)
`
`for small e, and an assumption that Pk (n) is sane in the sense
`that
`
`K -1 E Pk(n) = 1
`
`k=0
`
`(17)
`
`It can be shown (see, for example, pp. 105-111 of 110]
`or almost any discussion on equivocation in an information
`theory text) that if
`
`Ti(n) C II/2(n)
`
`(9)
`
`leads to
`
`then
`
`11(Pier0)
`
`FT(T2(T))-
`
`Hence, the best model sets
`
`W(n) = [0,n— 1]
`
`(10)
`
`(11)
`
`to use all the past information and then makes the probability
`estimate P(n) = P.
`From a theoretical standpoint, the discussion thus far says all
`there is to say about modeling. From a practical standpoint,
`it says nothing. For most real-world applications, statistical
`knowledge about the source is usually nowhere near sufficient
`to begin to calculate
`
`Pk(n) = Pk(n) = Pr{X(n) = k I X(i),i E Ten)}
`
`(12)
`
`and even if it were, the calculation would in all likelihood be
`hopelessly complex.
`
`D(P(n) I X(i),
`
`Ten))
`
`Let
`
`1 ic-1 (Pk(n) Pk(n))2
`lug E
`Pk(n)
`k-=0
`
`(18)
`
`Q(P(n) X(i),i E T(7)) =
`
`D(P(n) X(i).i E 14"))
`H(P(n) X(i), i E T(0
`(19)
`
`denote the fractional (or normalized) excess bit count due to
`estimation error, that is, the expected coding inefficiency due
`to estimation error when XW, i E 111(n) has been observed.
`A simple formula for this coding inefficiency Q(P(n)
`X(i), i E T(n)) can be derived under the assumption that
`one symbol, say, kina.(n), is highly likely to occur, and all
`the others are unlikely. In the case of a binary alphabet, this
`will always be true whenever there is anything to be gained
`from entropy coding. With a multisymbol alphabet, it is a
`
`Unified Patents, Ex. 1011
`
`000003
`
`

`

`240
`
`X(n)
`
`IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 4, NO. 3, MARCH 1995
`
`X(n)
`
`Entropy
`Mapper
`
`Context
`Extractor
`
`Tin)
`
`Frequency
`Observer
`
`0(n)
`
`Entropy
`Mapper
`
`Frequency
`Observer
`
`0(n)
`
`Fig. 3. Adap ive probability estimation for entropy encoding and decoding.
`
`Fig. 4. Simplified adaptive probability estimation with only one context.
`
`somewhat restrictive assumption since entropy coding would
`be advantageous if a few (as opposed to just one) of the
`symbols were likely and all the others nonlikely. However,
`additional structure in multisymbol applications will often
`justify this assumption.
`Assuming the symbol kmax(n) is highly likely and all the
`others unlikely
`
`H(P (n) I X(i),i E )11(n))
`(1 — Pmax(n))log2(1/(1 — Pmax(n)))
`
`
`= IA(1 — P„,ax(n))
`ln2
`
`where
`
`Pmax (n) = Pkn.„.(n)(n)
`
`(20)
`
`(21)
`
`is the probability of the dominating symbol, and for conve-
`nience
`
`A(q) = q In(l/q).
`
`(22)
`
`In words, (20) approximates the entropy H(P(„) X(i), i E
`T(n)) by the information in the fact that the expected symbol
`does not occur times the probability that it does occur. The true
`entropy also adds in the probability of the expected symbol
`times its information and the entropy needed to distinguish
`exactly which of the unlikely events did occur, but these
`two components to the total entropy are comparatively much
`smaller.
`Using (20) in (19) finally gives
`
`Q(P(n) X(i),i E kli(n))
`
`1
`A(1 — Pmax(n))
`
`(Pk (n) — Pk (n))2
`Pk (n)
`
`(23)
`
`D. Adaptive Probability Estimation
`Fig. 3 shows an entropy encoding structure that is much
`more restrictive than that of Fig. 1, but it has the important
`advantage of not requiring any a priori statistical knowledge
`about the source for the sequence X(0), X(1), . . . . The de-
`coding counterpart, which is similar, is not shown.
`With this adaptive entropy encoder, the conditioning index
`set )11(n) has the special form
`
`)1/(n) =
`
`i -= n — j with j E
`
`(24)
`
`where 1 is a set of positive integers. In other words, the
`condition set )IJ(n) at all times n is always past symbols
`
`X(n) at particular offsets relative to the current symbol.2 The
`conditioning set 52 is often referred to as a template because
`it specifies a pattern of past symbols to use in predicting the
`current one. It is important to note that the template is fixed3
`and not dependent on n. In image coding, the template is
`usually chosen as a few pixels that are near and causally (in a
`raster scan sense) related to a given pixel to be coded [16]. In
`text coding, the template is usually one or two of the preceding
`characters [7].
`The "context extractor" outputs a unique integer T(n) for
`each possible instance of conditioning values {X(n — i), i E
`D}. For example, if 52 = {1, 2}, one possible way to define
`the context T(n) is by
`
`T(n) = X(n —1)K° + X(n — 2)K1.
`
`(25)
`
`The exact manner in which particular instances of conditioning
`values are mapped to an integer is inconsequential as the
`context T(n) just serves as a labeling in what follows.
`The "frequency observer" notes the context T(n) and, based
`on this, makes a probability estimate P(n). An obvious way
`in which it might do this adaptively is by maintaining counts
`of how often each particular instance k of X(n) has occurred
`when the context has been T(n). Estimated probabilities are
`roughly (but preferably not exactly, as will be discussed at
`length later) proportional to occurrence frequencies.
`Such a strategy implicitly assumes at least "local" ergodicity
`so that averages over past time are good predictors of ensemble
`averages. The qualifier "local" is added because an advantage
`of considerable practical importance for running (that is, on-
`line or adaptive) probability estimation like this is an ability
`to track slowly varying input statistics. This, too, will be the
`subject of much further discussion.
`Context is a powerful idea. It is essential for efficient coding
`to use it in the JBIG and JPEG environments as well as the
`text coding environment. In spite of the power and utility
`of the context idea, the remainder of the analysis in this
`paper assumes that there is only one context. The reason for
`doing an analysis under this unrealistic condition is simply
`that the extension of all results back to the more interesting
`multicontext environment is trivial. Carrying the necessary
`notation for context throughout the analysis adds nothing but
`uneeded clutter. Assuming a single context application, the
`model of Fig. 3 reduces to that of Fig. 4.
`
`2 For small n, ,P(rr) may reference nonexistent symbols. There are many
`reasonable ways of patching this up, and we will ignore this uninteresting
`detail here.
`3 In the JBIG application, it was found to be advantageous to infrequently
`allow SI to change [15], but this kind of detail is best ignored here.
`
`Unified Patents, Ex. 1011
`
`000004
`
`

`

`DU1TWEILER AND CHAMZAS. PROBABILITY ESTIMATION IN ENTROPY CODERS
`
`241
`
`III. SCALED-COUNT PROBABILITY ESTIMATION
`
`A. Bayesian Estimation
`Assume the sequence X(0). X(1), . . . is a sequence of
`independent, identically distributed, random variables each
`taking on the value k with probability Rk. If the vector
`
`R = [1-?0, . . . .RK-l]T
`
`(26)
`
`of symbol probabilities is itself a random variable with a priori
`probability density fo(r), then the a posteriori probability
`density f,(r) of the random variable R given an obseration
`of X(0)
`X(n - 1) is computable by Bayes formula. In
`particular,
`
`f„(r) = PORI X[0: n - I]}
`- Pr{lt, X[0: n- 1]}
`Pr{X[0: ra - 1]1
`PrIX[0: n - 1] I rhio(r)
`f Pr{X[0: n - 1] r}fo(r)dr
`
`The expected value
`
`rf, (r)dr
`
`(27)
`
`(28)
`
`is the Bayesian estimate of R.
`Analytical evaluation of these equations is possible for some
`particular a priori densities fo(r). One that is well known is
`the unform density
`
`fo(r) = 6I 1
`
`K -1
`
`-
`
`)
`rk
`
`k=0
`
`(29)
`
`where the Dirac delta function 6(•) simply forces the proba-
`bilities to add to one and lowers the effective dimensionality
`of the probability density from K
`to K- 1. With a uniform
`a priori density
`
`factorial to noninteger numbers. Some of its properties that
`are useful here are
`
`F(0) = 1
`F(x + 1) = xr(x), x > 0
`(x) = (x - 1)!, x > 1 and integer.
`
`(36)
`(37)
`(38)
`
`The estimate it of (30) makes intuitive sense. If n is
`much larger than K. it becomes simply a ratio of observed
`frequencies. The "+I" in the numerator and the "+K"
`in the
`denominator bias early estimates of R toward equiprobable.
`Another a priori density that allows analytical evaluation
`and one that leads to a class of estimates that at least for the
`JBIG and JPEG applications has been found quite powerful is
`fo(r) = ry(0)r,5 -1 • • • rZ=1,6 1 - E rk
`
`K-1
`
`)
`
`(
`
`(39)
`
`k=0
`where A > 0 is a real number that is a free parameter and the
`normalizing constant a(0) will be given later. This particular
`prior has also been described by Zandi and Langdon [17].
`If A = 1, this density becomes the uniform density already
`discussed. With A < 1, the apriori density fo(r) tends to favor
`symbol probability vectors R near the edges rather than the
`center of the n-dimensional unit hypercube. Such a density is
`of interest because the symbol probability vectors it faors are
`just those for which entropy coding is rewarding. There are of
`course many other such apriori densities favoring edges and
`there is no argument for considering one over them other than
`it happening to lead to analytically tractable calculations.
`Under the apriori density 39,
`
`C(n) +
`R(n) =
`+ KA
`
`
`
`(40)
`
`and
`
`f„(r) = n roc ' (n)+A-1
`
`• • r Ch
`
`11(n)+-1-16 (1 -
`
`and
`
`C(n) + 1
`n+ K
`
`(30)
`
`where
`
`k=0 rh
`(41)
`
`fn (r) = a(n)roc' (n) • • rrk-t('
`k i
`
`)6 1 --
`
`K -1
`
`)
`
`rk
`
`(31)
`
`k=0
`
`where
`
`C(n) = [Co(n)
`
`1 = [1 ,
`
`
`
`
`1]
`
`T
`
`Ch (n)r.
`
`a(n) =
`
`P(n + K)
`
`r(Co(n) + K)
`
`• F(C K - _1(n) + K) .
`
`F(x) =
`
`Ax-l e-AdA, x > 0
`
`(32)
`(33)
`
`(34)
`
`(35)
`
`and the Ck(n) are counts of the number of occurrences of
`the instance k in X[0: n
`- 1]. The Gamma function P(x)
`is a standard mathematical function that extends the idea of
`
`(n ) -=-
`
`F(n + KO)
`P(Co(n) + A) •
`• F(CK_ (n) + A )
`With A < 1 (the interesting parameterization), the Bayesian
`estimate (40) is quicker to jump to conclusions about the
`underlying probability density. As an example, suppose K =
`2, and X(0) happens to be 1. Equation (30) (or, equivalently,
`(40) with A = 1) estimates
`
`.
`
`(42)
`
`(43)
`whereas with very small A, (40) has already concluded that
`
`/51(1) = 2/3
`
`Pi(1).-':, 1.
`
`(44)
`Rashness of this sort empirically has been found desirable for
`the JBIG and JPEG application for which the optimal value
`of Z seems to be about 0.4.
`
`Unified Patents, Ex. 1011
`
`000005
`
`

`

`242
`
`IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 4, NO. 3, MARCH 1995
`
`P(n) =
`
`B. Balancing Estimation Accuracy Against Estimation Speed
`. is again
`For this section, the sequence X(0), X(1),
`independent and identically distributed as in the last section,
`but there is no longer any need to conceptualize the symbol
`probability vector r as having been first determined by some
`underlying random process. It is best now to simply think of
`r as fixed but unknown. We further assume than an estimate
`C(n) + O1
`n K A
`has been chosen for estimating the probabilities under which
`X(n) takes on its possible values, but the reason for having
`made this particular choice is immaterial. In particular, the fact
`this estimate has a Bayesian interpolation is inconsequential.
`None of the assumed a priori densities or calculated a poste-
`riori densities of the last section are assumed or used in what
`follows.
`Assuming either that the alphabet is binary or that it
`is multilevel with one symbol dominating, the conditional
`fractional increase in bit count due to probability error is given
`by (23) as
`
`(45)
`
`Q(P(n) I X (n — i),i E
`
` h-1 (Pk(n)
`1
`A(1 — max) kE=0
`rk
`
`rk)2
`
`(46)
`
`where rmax denotes the largest rk. Averaging over all instances
`of the conditioning set gives the average fractional increase in
`bit count as
`
`1
`Q(P(n)) A(1 — r„,ax) k=0
`
`of,k(n) - rk)2]
`
`rk
`
`(47)
`
`The mean-square error required in (47) is readily calculated.
`Toward this end, define
`
`The difference
`
`m(n) = E[P(n)]
`nr + A 1
`n + KA
`
`b(n) = m(n) — r
`
`(1 Kr)
`n + KA
`
`(48)
`
`(49)
`
`between the expected value of P(n) and its desired value r is
`the estimation bias. Bayesian estimates are notorious for being
`biased. However, the bias is small and goes to 0 with n. It is
`interesting that it happens to be smaller with small A as well.
`The mean-square of interest in (47) is evaluated as
`
`ERPk(n) — rk)2]
`=- ER/3k (n) — mk(n))2] + bi2,(n)
`Az
`
`= (n + K A)2
`
`rk(1
`
`rk) +
`
`(it + K A)2
`
`(1 Krk)2.
`
`For large n or small A
`ERA(n) — rk)21, rk(1
`n
`
`rk)
`
`(50)
`
`(51)
`
`Hence
`
`Q(n)
`
`1
`A(1 — rmax)
`
`k —1
`
`k —o
`
`rk (1 — rk)
`
`nnk
`
`K —
`nA(1 — rmax) .
`The probability rma„ of the most likely symbol is not known
`but can be estimated as Pmax (n). Thus
`K — 1
`nA(1 — P,..(n))
`K — 1
`nA(Ekok_C k (n)ln) .
`
`(52)
`
`(53)
`
`Q(n)
`
`If the alphabet is binary, (53) becomes amazingly simple:
`
`Q(n) ^
`1
`nA(Cmin(n)ln)
`1
`Cmin(n) ln(n/C,„„(n))
`1
`Cmin(n) ln(1/rmin(n))
`where e mm(n) is the smallest count at time n, and rmin is the
`probability of the least likely symbol.
`
`(54)
`
`C. Scaled-Count Iteration
`Suppose a binary environment and that some particular
`fractional increase Q* has been deemed acceptable. To achieve
`it, 1/(Q* ln(1/rmm)) of the less likely symbols must be
`observed. Collecting this many samples of the less likely event
`will require comparatively few total observations when this
`event is only moderately rare but comparatively many when
`it is extremely rare. Hence, the averaging window needed to
`estimate small probabilities increases as the disparity, or, skew,
`in r„th, and r„,ax increases.
`In many applications for entropy coders, the underlying
`statistics of the process being coded change with time. This
`is definitely true in the JBIG and JPEG applications. It is
`extremely desirable that the probability estimate track this
`time variability as much as possible, which is consistent with
`maintaining some minimal penalty in increased fractional bit
`count due to estimation error. Equation (54) and the discussion
`of the preceding paragraph suggest the following modification
`to standard Bayesian estimation to achieve this goal [18].
`Initialize for 0 < k < K
`
`Ck (0) = 0.
`
`Estimate for 0 < k < K
`
`Pk(n) =
`
`Ck (n) + A
`
`ELo l (Ci(n)+ A)
`Iterate for 0< Jr < K and n> 1
`Ck (n — 1) + 1.
`if X (n —1) = k
`Ck(n) = {Ck (n — 1),
`otherwise
`Cm = min{Can),
`, CK_i(n)}
`C
`if C;„„„ < CL,
`d(Ck (n) + A) — A, otherwise
`
`Ck(n) =
`
`(55)
`
`(56)
`
`(57)
`
`(58)
`
`(59)
`
`Unified Patents, Ex. 1011
`
`000006
`
`

`

`DUTTWEILER AND CHANIZAS PROBABILITY ESTIMATION IN ENTROPY CODERS
`
`243
`
`,•• 0
`
`'
`
`-•>• • -)
`
`> •
`
`•
`
`
`
`,• • =-
`
`--> •
`
`T T
`T T.. T
`
`••
`
`•
`
`•
`
`•
`
`be done to keep the sum of all the counts but the largest
`under some threshold. The shown criteria involving just the
`minimum count is easier to implement and, for us, has more
`intuitive appeal. Experimental data from a real multisymbol
`application would be helpful in choosing between the two
`alternatives. Our practical experience is only with binary
`alphabets.
`In practice, setting Cmin is best done by experimentation.
`Equation (54) nicely quantifies the cost to coding efficiency
`associated with noisy estimation of true probabilities. If this
`were the only criteria, large values for cmin would clearly be
`best. The other side, though, is tracking ability, and although it
`is clear that smaller values for emir, will allow better tracking
`of changing underlying statistics, there is little to be said
`quantitatively. The exact nature of the time variability is
`crucial and not something amenable to analysis.
`
`•
`
`•
`
`•
`
`Co
`
`•
`
`1
`
`• -I.
`
`
`
`I
`
`• -4.
`
`Fig. 5. Markov interpretation for the scaled-counter probability estimator.
`
`w here
`
`=
`
`Cur,
`Crc,in + A
`
`
`(60)
`
`We refer to an estimator like that defined by the iteration
`(55)—(60) as a scaled-count estimator. The basic idea is to
`mimic a Bayesian estimator until the smallest count exceeds
`some threshold (.1",;,.,. The exact numerical value of Cni *,i„ is a
`design parameter controlling the balance between estimation
`accuracy and estimation speed. When the minimum count
`attempts to become greater than
`all the counts are scaled
`back to bring the count for the least likely symbol just back
`to CL„.
`There are many possible variations on this theme. The "A's"
`in (59) and (60) cause the scaling of counts to be such that
`the probability estimates (56) are unaffected. Alternatively, the
`"A's" in (59) and (60) could be dropped, in which case, count
`proportionality rather than probability-estimate proportionality
`would be maintained. Theoretically, the scaling shown perhaps
`makes more sense, but as a practical matter, the choice is likely
`to be inconsequential.
`The scaling /3 of (60) drops the minimum count by L
`It could be made smaller so that the minimum count can
`be dropped back even more. A value of 0.5 might make
`implementation easier in some instances [19].
`Notice that the counts Ck (n) are initially all integers, but
`once scaling has been done, some become noninteger. Practical
`realizations will require rounding off the scalings to the
`precision available for representation. Much more discussion
`of this kind of problem appears following this section.
`Practical realizations will also have a limit on how large a
`number can be represented. One fix is to also scale back counts
`whenever the largest count attempts to exceed this limit.
`The theoretical justification for the scaled-count estimator
`of (55)—(60) is much stronger with a binary alphabet than a
`multisymbol alphabet. In the multisymbol case, the awkward
`assumption that one probability dominates is needed to reach
`(23). That equation, in fact, suggests that instead of scaling to
`keep the smallest count under some threshold, scaling should
`
`D. Malloy Implementation
`A Markov chain interpretation of the scaled-count equations
`exists whenever the counts C5 are quantized after scaling and
`limits are placed on how- large counts may grow. Any real
`implementation necessarily satisfies both these conditions.
`Fig. 5 helps to explain the Markov chain interpretation.
`In drawing this figure, we made the following additional
`particular assumptions:
`• The alphabet is binary.
`• The quantization is to integers.
`• ()Liu = 2 .
`• Any count is also scaled whenever it exceeds C,..%), =
`6, and
`• A = 0.5.
`A numerical value of 2 for Cm*
`is slightly smaller than
`numbers we found optimal for image coding, which were
`generally in the 4 to 16 range. A numerical value of 6 for the
`allowed maximum is ridiculously small, by many orders of
`magnitude. Its virtue here is that it is large enough to illustrate
`key ideas to come, yet small enough that a plot like that in
`Fig. 5 stays manageable.
`The solid squares are the Markov states, and the solid square
`in the lower left corner is the initial state. The observed
`sequence X(0), X(I), . . . dictates the transitions between
`these states. If X(n) is 0, the transition is nominally 1 unit
`to the right. If it is 1, it is one unit up. Occasionally, such
`a transition would take the state to a point in Co, C1 space
`outside the region of allowed states. Such transitions are shown
`by dotted arrows to open squares, these squares representing
`what we call "virtual" states. Any transition to a virtual state
`is immediately ma

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket