`
`ENTROPY AND CODING TECHNIQUES
`
`2.1
`
`INFORMATION AND ENTROPY
`
`A binary digit, or "bit," b, takes one of the values b = 0 or b = l.
`A single bit has the ability to convey a certain amount of information
`- the information corresponding to the outcome of a binary decision, or
`"event," such as a coin toss. If we have N bits, then we can identify the
`outcomes of N binary decisions.
`
`Intuitively, the average amount of information associated with a bi(cid:173)
`nary decision depends upon prior knowledge which we have concerning
`the likelihoods of the possible outcomes. For example, .there is little
`informative value to including snow conditions in the weather report
`during summer - in common parlance, the result is a foregone conclu(cid:173)
`the binary events which convey most information
`sion. By contrast,
`on average are those which are equally likely. Similarly, the N-bit se(cid:173)
`quences which convey most information are those for which each bit has
`equally likely outcomes, regardless of the outcomes of the other bits in
`the sequence - loosely speaking, these are "entirely random" sequences
`of bits.
`
`Source coding is the art of mapping each possible output from a given
`information source to a sequence of binary digits called "code bits."
`Ideally, the mapping has the property that the code bits are "entirely
`random," i.e., statistically independent, taking values of 0 and 1 with
`equal probability. In this way, the code bits convey the maximum pos(cid:173)
`sible amount of information. Then, provided the mapping is invertible,
`we can identify the number of code bits with the amount of information
`in the original source output.
`
`D.S. Taubman et al., JPEG2000 Image Compression Fundamentals, Standards and Practice
`© Springer Science+Business Media New York 2002
`
`GE Video Exhibit 2003
`Unified Patents v. GE Video, IPR2019-00726
`
`
`
`24
`
`Information and Entropy
`
`The above concepts were formalized in the pioneering work of Claude
`Shannon [130]. A quantity known as "entropy" is defined in terms of the
`statistical properties of the information source. The entropy represents
`a lower bound on the average number of bits required to represent the
`source output. Moreover, it is possible to approach this lower bound ar(cid:173)
`bitrarily closely. In fact, practical coding algorithms can achieve average
`bit rates which are extremely close to the entropy in many applications
`and when they do so the code bits must be entirely random.
`
`MATHEMATICAL PRELIMINARIES
`2.1.1
`RANDOM VARIABLES AND VECTORS
`Let X denote a random variable. Associated with the random variable
`is a set of possible outcomes, known as the aphabet, Ax. The outcome
`of the random variable is denoted x, and is one of the elements of Ax .
`A random variable is said to be discrete if its alphabet is finite or at
`most countably infinite. That is, we can enumerate the elements of the
`alphabet,
`
`Ax = {ao,al,a2, ... }
`In this case,
`the statistical properties of the random variable are de(cid:173)
`scribed by its probability mass function (PMF)
`
`fx (x) #:. P (X = x) for each x E Ax
`In words, fx (x) is the probability of the outcome X = x. By contrast, a
`continuous random variable has uncountably many outcomes, e.g. Ax =
`In this chapter we will be concerned
`JR,
`the set of all real numbers.
`exclusively with discrete alphabets. As an example, we model binary
`decisions as random variables whose alphabets have only two entries,
`usually written Ax = {O, I}. Binary random variables playa special
`role in coding.
`The notion of a random variable is trivially extended to random vec(cid:173)
`tors, X, with alphabet, Ax and PMF, fx (x), for each vector, x EAx.
`An m-dimensional random vector is a collection of m random variables,
`usually taken as a column vector,
`
`X=
`
`X m - 1
`
`The PMF, fx (x), is sometimes written longhand as
`
`fx (x) == fXO,XI,,,,,Xm-l (XO,Xl,'"
`
`,Xm-l)
`
`GE Video Exhibit 2003
`
`
`
`Chapter 2: Entropy and Coding Techniques
`
`25
`
`It denotes the probability that Xo = Xo, Xl = Xl, ... , and Xm - l = Xm-l
`simultaneously. For this reason, it is often called the joint PMF, or joint
`distribution, for the m random variables.
`From the joint distribution of a collection of m random variables,
`we can obtain the "marginal" distribution of anyone of the random
`variables, Xi, as
`
`IXi (X) = L Ix (x)
`
`X3 X i=X
`
`INDEPENDENCE AND CONDITIONAL PMF'S
`We say that two random variables are statistically independent, or
`simply independent, if their joint distribution is separable; i.e.,
`
`That is, the probability that both Xo = Xo and Xl = Xl is the product of
`the two marginal probabilities. As suggested by the introductory com(cid:173)
`ments above, the notion of statistical independence plays an important
`role in coding.
`We define the conditional distribution of Xl, given Xo, by
`
`The function, Ix llxo (-, xo), is interpreted as a modified PMF for X I,
`where the modification is to reflect the fact that the outcome X o = Xo
`is already known. If the two random variables are statistically indepen(cid:173)
`dent, we expect that the outcome of Xo has no bearing on the distribu(cid:173)
`tion of X I and indeed we find that
`
`IXllXo (XI,XO) = IXI (Xl) if and only if XI,Xo are independent
`
`We note that the marginal distribution of Xo and the conditional distrib(cid:173)
`ution of Xl, given Xo, together are equivalent to the joint distribution of
`Xl and Xo· More generally, we write IXnIXn-l, ... ,Xo (xn , ... ,xo) for the
`conditional distribution of X n , given Xo through X n - l . The joint dis(cid:173)
`tribution of all m random variables of an m-dimensional random vector,
`X, may be recovered from
`
`Ix (x) = Ixo (xo) IXllXo (Xl, Xo) ... IXm -lIXm -2, ... ,Xo (Xm-l, . .. , xo)
`(2.1)
`
`and the random variables are said to be mutually independent if
`Ix (x) = Ixo (xo) IX l (Xl) ... IXm - 1 (xm-d
`
`GE Video Exhibit 2003
`
`
`
`26
`
`Information and Entropy
`
`EXPECTATION
`The expectation of a random variable, X, is denoted E [X] and defined
`E[X] ~ L xfx (x)
`
`by
`
`xEAx
`It represents the statistical average or mean of the random variable X.
`Here, for the first time, we are concerned with the algebraic properties
`of random variables. More generally, let 9 0 be any function. We may
`define Y = 9 (X) to be the random variable whose outcomes are y =
`9 (x) whenever the outcome of X is x. Consequently, the distribution of
`Y may be found from
`
`fy (y) = L fx (x)
`
`x3g(x)=y
`
`It is readily shown that the expectation of the new random variable, Y,
`satisfies
`
`E [Y] = E [g (X)] = L yfy (y) = L 9 (x) fx (x)
`
`(2.2)
`
`yEAy
`
`xEAx
`
`Given two random variables, Xo and XI, we may define conditional
`expectations in the most obvious way as
`
`E [Xl I Xo = xo] ~ L xfx1IXo (x, xo)
`
`XEAxl
`
`and for any function, gO, we have
`E [g (Xl) I Xo = xo] = L 9 (x) fXl\XO (x, xo)
`
`xEAxl
`
`RANDOM PROCESSES
`We conclude this section by introducing the concept of a discrete
`random process, denoted {Xn }. A random process is nothing but a se(cid:173)
`quence of individual random variables, X n , nEZ, all having a common
`alphabet, Ax. The key distinction from a random vector is that there
`are infinitely many random variables. The statistics are summarized by
`the vector PMF's, fXi:j 0, for all i < j E Z, where we use the notation,
`X i :j , to refer to the (j - i)-dimensional random vector formed from the
`elements, X k , i ::; k < j, of the random process.
`The random process, {Xn },
`is said to be stationary if the vector
`PMF's satisfy
`
`fX i :i+m = fXo: m for all i, m E Z, m > 0
`
`GE Video Exhibit 2003
`
`
`
`Chapter 2: Entropy and Coding Techniques
`
`27
`
`is, all collections of m consecutive random variables from the
`That
`process have exactly the same joint distribution. Thus, a stationary
`fXo: m for each m =
`random process is characterized by the PMF's,
`1,2, .... Alternatively, from equation (2.1) we see that stationary ran(cid:173)
`dom processes are characterized by the marginal distribution, fxo == fx,
`together with the sequence of conditional distributions,
`fXmIXo:m' for
`m= 1,2, ....
`In most applications we find that the conditional distributions satisfy
`
`(2.3)
`
`for a sufficiently large value of the parameter, p. That is, the conditional
`distribution of X m given X othrough X m - 1 , is actually a function of only
`the p most recent random variables, X m - p through X m - 1 . We say that
`X m is "conditionally independent" of X o through X m - p - 1 ' Conditional
`independence is a phenomenon which we usually expect to encounter in
`the information sources which we model using random processes. Indeed
`statistical dependencies among samples taken from natural physical phe(cid:173)
`nomena such as images and audio are generally of a local nature. For
`stationary processes, conditional independence means that the entire
`process is described by a finite number of conditional PMF's
`
`fxo' fXllxo' fX2Ixo:2'
`
`... , fXplxo: p
`
`These are called Markov random processes with parameter p. A Markov(cid:173)
`1 random process is entirely described by Ix and IXllxo' If P = 0, all ele(cid:173)
`ments of the random process are statistically independent with identical
`distribution, fx. Such a random process is said to be IID (Independent
`and Identically Distributed). It is also said to be "memoryless."
`Stationary random processes with conditional independence proper(cid:173)
`ties (i.e. Markov processes) play an extremely important role in coding,
`precisely because they are described by a finite number of conditional
`PMF's. By observing the outcomes of the random process over a finite
`period of time, we can hope to estimate these conditional PMF's and use
`these estimates to code future outcomes of the random process. In this
`way, we need not necessarily have any a priori knowledge concerning the
`statistics in order to effectively code the source output. Adaptive coders
`are based on this principle.
`The technical condition required to enable estimation of the rele(cid:173)
`vant PMF's from a finite number of outcomes is "ergodicity."
`To be
`more precise, suppose we observe the outcomes of random variables Xo
`through XM-l' For each m-dimensional vector, y, let Ky,M denote the
`number of occurrences of y as a "sub-string" of the observed sequence,
`
`GE Video Exhibit 2003
`
`
`
`28
`
`Information and Entropy
`
`XO:M; I.e.,
`
`Ky,M = II{ il 0 :s; i < lYI - m and Xi:i+m = y}II
`
`A
`
`) for each m :s; p < lYI
`
`It is natural to estimate the conditional PMF's according to
`K(y x) M + fJ
`IXmlXo: m (x, y) = "
`("
`L....zEAx K(y,z),M + fJ
`fJ = 1), included to avoid undefined or
`where fJ is a small offset (e.g.
`ill-conditioned estimates when M is small and (y, x) denotes the vector
`If the random process is ergodic, these
`formed by appending x to y.
`estimates will converge to the actual conditional PMF's as M increases.
`Most random processes encountered in practice are ergodic. At least,
`practical coding algorithms are based on the assumption that the under(cid:173)
`lying random process is ergodic, so that we can estimate the statistics
`through observation.
`
`THE CONCEPT OF ENTROPY
`2.1.2
`ENTROPY OF A RANDOM VARIABLE
`The entropy of a random variable, X, is defined as
`
`H (X) ~ - L Ix (x) log2 Ix (x)
`
`xEAx
`
`We shall find the following equivalent expression more convenient and
`intuitive
`
`H(X) = E[-log2Ix (X)]
`To clarify this expression, define the function, hx 0, by
`hx (x) ~ -log2 Ix (x)
`
`As with any function, then, we may define the random variable, Y =
`hx (X), and apply equation (2.2) to see that the entropy of X is the
`expectation of the new random variable, Y; i.e.,
`
`H (X) = E [hx (X)]
`
`As we shall see, the quantity hx (x) may be interpreted as the amount
`of information associated with the event X = x and then the entropy
`may be interpreted as the average (or expected) amount of information
`conveyed by the outcome of the random variable, X.
`Entropy measures information in units of bits. The precise connec(cid:173)
`ti,on between entropy, as defined above, and the average number of bits
`
`GE Video Exhibit 2003
`
`
`
`Chapter 2: Entropy and Coding Techniques
`
`29
`
`required to code the outcome of the random variable will be explored
`in Sections 2.1.3 and 2.1.4. For the moment, however, it is instructive
`to reflect on the intuitive appeal of the definition, as suggested by the
`following properties:
`
`1. hx (x) is a strictly decreasing function of the likelihood, fx (x). This
`agrees with the notion that a highly unlikely event carries consid(cid:173)
`erable information when it occurs. For example, the appearance of
`snow in summer is a newsworthy event in most cities.
`
`2. The entropy is bounded below by O. Moreover, H (X) = 0 occurs if
`and only if X has a deterministic outcome, say X = xo. That is,
`
`f x (x) =
`
`{
`
`if x = Xo
`I
`0 if x =1= Xo
`
`3. For random variables with finite alphabets, the entropy is bounded
`above by H (X) ~ log21lAxll. Moreover, the upper bound occurs if
`and only if X has a uniform distribution; i.e., fx (x) = lI)xll' \/x E
`Ax. Of particular interest is the case in which the alphabet consists
`of B-bit numbers, Ax = {o, 1, ... , 2B -I}. Then H (X) ~ B with
`equality if and only if all 2B outcomes are equally likely. Put another
`way, an information source whose outcomes are represented with B
`bit numbers has an entropy of at most B bits, where this maximum
`occurs if and only if the B bits are "entirely random."
`
`Example 2.1 Let X be a binary random variable with fx (0) = p and
`fx (1) = (1 - p). Figure 2.1 plots H (X) as a function of the parameter,
`p. The figure clearly indicates the fact that the entropy is zero when X is
`deterministic (p = 0 or p = 1) and is maximized when the two outcomes
`are equally likely (p = ~).
`
`JOINT AND CONDITIONAL ENTROPY
`The definition of entropy extends naturally to random vectors so that
`H (X) ~ E [-log2 fx (X)] = E [hx (X)]
`In fact, since the alphabet of the random vector is discrete, we can enu(cid:173)
`merate its elements, Ax = {ao, aI, ...}, and define a random variable,
`K = k whenever X = ak. Since the outcomes of K and X convey ex(cid:173)
`actly the same information we must insist that H (X) = H (K). The
`above definition then follows immediately.
`We sometimes use the longhand expression
`
`GE Video Exhibit 2003
`
`
`
`30
`
`Information and Entropy
`
`1······
`
`.
`
`H(X)
`
`o
`
`p=
`
`Figure 2.1. Entropy of a binary random variable, X, as a function of p = Ix (0).
`
`where X is an m-dimensional random vector with elements, Xo through
`X m - 1 . We may also call this the joint entropy ofthe m random variables
`since it is a function of their joint PMF.
`The conditional entropy of X given Y is defined by
`
`H(X IY) £ - L !y(y) L !xlY(x, y) log2 !xIY(x, y)
`= - L L !x,Y(x, y) log2 !xlY(x, y)
`
`xEAx
`
`yEAy
`
`yEAy xEAx
`= E [-log2!xIY(X,Y)]
`H (X IY) may be interpreted as the average additional information we
`receive from the outcome of X given that the outcome of Y is already
`known. This interpretation follows directly from the interpretation of
`H (X) as the average amount of information we receive from the outcome
`of X. 1 To see this, observe that
`
`H (X, Y) = E [-log2 !X,Y (X, Y)]
`= E [-log2 (Iy (Y) !xlY (X, Y))]
`= E [-log2 fy (Y)] + E [-log2 fxlY (X, Y)]
`= H (Y) + H (X IY)
`
`(2.4)
`
`1 Although the properties of H (X) suggest an interpretation as a measure of information,
`the connection with information will be concretely established in Section 2.1.3.
`
`GE Video Exhibit 2003
`
`
`
`Chapter 2: Entropy and Coding Techniques
`
`31
`
`Thus, H (X I Y) indeed represents the extra information, H (X, Y) (cid:173)
`H (Y). When X and Yare independent random variables we find that
`H (X I Y) = E [-log2 fxlY (X, Y)]
`= E [-log2 fx (X)]
`=H(X)
`
`which agrees with the fact that the outcome of Y has no bearing on the
`information conveyed by X.
`An important property of the conditional entropy is summarized by
`the following theorem.
`
`Theorem 2.1 Let X be a random variable. Let Y be a random vector
`with elements Yo through Ym-l and let Y' be a random vector consisting
`of any subset (possibly empty) of these elements. Then
`H (X I Y) :::; H (X IY')
`with equality if and only if fXIY = fXIY/" i. e., if and only if X is condi(cid:173)
`tionally independent of the elements in Y which are missing from Y'.
`Corollary 2.2 For random variables, X and Y, H (X I Y) :::; H (X),
`with equality if and only if X and Yare independent.
`Corollary 2.3 From equation (2.4), we also have H (X, Y) :::; H (X) +
`H (Y), with equality if and only if X and Yare independent.
`
`These results have considerable intuitive appeal. If we know the out(cid:173)
`comes of some collection of random variables, Yo through Ym - l , which
`are not statistically independent of X, then this reduces the uncertainty
`of X and hence the amount of additional information conveyed by the
`outcome of X. As m increases, the uncertainty in X and hence the con(cid:173)
`ditional entropy, H (X I Y), continues to decrease so long as each new
`random variable Ym provides some new information about X, which is
`not already present in the other random variables, Yo through Ym-l.
`Equation (2.4) is easily generalized to expand the entropy of any ran(cid:173)
`dom vector, X, as
`
`H (X) = E [-log2 fx (X)]
`= E [-log2 (Jxo (Xo) ..... fX m -lIXm -2, ... ,xo (Xm - l ,
`= H (Xo) + H (Xl I X o) + ... + H (Xm - l I X m - 2 ,
`
`, X o))]
`,Xo)
`
`(2.5)
`
`GE Video Exhibit 2003
`
`
`
`32
`
`Information and Entropy
`
`where equality holds if and only if the random variables, Xi, are all inde(cid:173)
`pendent. As we shall see in Section 2.1.4, this expansion is particularly
`useful in coding.
`
`ENTROPY RATE
`Let {Xn } be a discrete stationary random process. Since the random
`process has infinite extent, the total amount of information conveyed by
`the outcome of the random process will usually be infinite. In fact, for
`Markov processes it must be either infinite or zero. Thus, it is more
`meaningful to introduce the notion of an "information rate" for random
`processes. A close analogy is the characterization of stationary random
`processes by their power rather than their energy in the study of linear
`systems.
`We begin by defining the m th order entropy of the random process by
`}) ~ 1
`.)
`) _ 1 ( .
`(
`- -H X O:m - -H X~:m+~ ,lor all 1, E Z
`m
`m
`Thus, the 1st order entropy is simply the entropy of any given random
`variable, say X [0] (they all have the same distribution), taken individ(cid:173)
`ually from the process. The 2nd order entropy is half the joint entropy
`of any pair of consecutive random variables from the process. From
`Corollary 2.3,
`
`£
`
`.
`
`(m) ({
`
`H
`
`X n
`
`H(2) ({Xn }) :::; ~(H(Xo)+H(XI))
`= H(l) ({Xn })
`In fact, from Theorem 2.1, we see that
`mH(m) ({Xn }) = H (Xo) + H (Xl I Xo: I ) + ... + H (Xm- l I Xo:m - l )
`(2.6)
`
`~ mH (Xm - l I X O:m - l )
`
`and hence
`(m + 1) H(m+l) ({Xn }) = mH(m) ({Xn }) + H (Xm I X O:m )
`:::; mH(m) ({Xn }) + H (Xm- l I X O:m - l )
`:::; (m + 1) H(m) ({Xn })
`So H(m) ({Xn}) is a monotonically decreasing function of m. Since it
`is bounded below by 0, the sequence must converge and we define the
`entropy rate of the random process to be
`
`(2.7)
`
`GE Video Exhibit 2003
`
`
`
`Chapter 2: Entropy and Coding Techniques
`
`33
`
`This quantity is interpreted as the average rate at which the random
`process conveys information as the outcomes X n = x n are discovered
`one by one.
`For a Markov-p source process, equation (2.7) may be recast as
`
`(2.8)
`
`This very simple form follows from equation (2.6) and the conditional
`independence property of the random process, according to which
`
`H(m) ({Xn}) = ~ (H (Xo) + H (Xl I XO:I ) + ... + H (Xp- l I X O:p- I ))
`-p
`+ - -H (Xp I X O:p )
`m
`----+ H (Xp I X O:p ) as m ---t 00
`
`mm
`
`2.1.3
`
`SHANNON'S NOISELESS SOURCE
`CODING THEOREM
`In Section 2.1.2 we defined a quantity called entropy, having prop(cid:173)
`erties which we would expect of a measure of information. The value
`of entropy, however, as a tool for understanding and developing practi(cid:173)
`cal compression algorithms, arises from a rigorous connection between
`these definitions and fundamental bounds on coding performance. This
`connection was first established by Shannon's "noiseless source coding
`theorem" [130]. The essence of this theorem is that the entropy rate of
`a random process provides a lower bound on the average number of bits
`which must be spent in coding each of its outcomes and also that this
`bound may be approached arbitrarily closely as the complexity of the
`coding scheme is allowed to grow without bound. Due to the impor(cid:173)
`tance of this result, we choose to reproduce Shannon's proof here, for
`the simple case of a stationary, memoryless random process.
`Let {Xn } be an IID random process, each element having distribution
`Ix and entropy H(X). The entropy rate, H({Xn }), in this case,
`is
`identical to H (X). For ease of expression we shall sometimes refer to
`the individual source outcomes, Xo, Xl, ... , as symbols. Then, we assess
`the information rate of the random process as the average number of bits
`per symbol required to represent the source output over a period of m
`consecutive symbols, in the limit as m becomes very large. Specifically,
`we construct a code which maps outcomes of the random vector, XO: m
`to L-bit codewords. This is a "fixed length" code since each block of m
`symbols is represented by the same number of code-bits, L. Codes of this
`f;, represents the average
`form are known as (m, L) codes. The ratio,
`number of bits spent coding each symbol. The idea is essentially to show
`
`GE Video Exhibit 2003
`
`
`
`34
`
`Information and Entropy
`
`that in the limit as m becomes very large, this ratio !:i may be made
`arbitrarily close to H (X).
`There are a total of /lAx 11 m possible m-dimensional outcomes, Xo: m ,
`and so it is clearly impossible to represent all outcomes perfectly un(cid:173)
`!:i 2: log211Ax II· But we know that H (X) :::;
`less 2£ 2: IIAx 11 m
`; i.e.,
`log2 /lAx II, attaining this maximum value only when fx is the uniform
`distribution,with all outcomes equally likely. Thus, in order to establish
`a connection between entropy and the bit-rate of a fixed length code, we
`will need to admit the possibility that the coded representation might
`not be exact. Let Pe (m, L) denote the probability that our L-bit code
`does not represent the random vector, X O:m , exactly. The idea behind
`the noiseless source coding theorem is to show that Pe (m, L) may be
`made arbitrarily small as m grows, provided the code-rate ~ > H (X).
`Theorem 2.4 Let {Xn } be a discrete IID random process having finite
`entropy, H (X), and consider fixed length (m, L) codes, associating m(cid:173)
`to L-bit codewords. Only one outcome
`element outcome vectors, Xo: m ,
`vector may be associated with each codeword, so let Pe (m, L) denote the
`probability of the event that X O:m has no associated codeword. Then,
`by making m sufficiently large,
`the error probability, Pe (m, L), may be
`made arbitrarily small, so long as the code-rate satisfies
`L- > H(X)
`m
`the error probability, Pe (m, L),
`
`tends to 1 as m ~ 00 for
`
`Conversely,
`codes having
`
`L- < H(X)
`m
`Proof. Consider the random variable, hXo'm (XO:m ), which we defined to be
`-log2 !xo,,,, (XO:m )' Since the elements of the random vector, Xo: m , are all indepen(cid:173)
`dent, !Xo'm is separable and we obtain
`hXO'm (XO:m ) = -log2 II !x (Xi)
`
`m-l
`
`i=O
`
`m-l
`
`= L hx (Xi)
`
`i=O
`is a sum of the IID random variables, hx (Xi)' According to the
`So hxo,,,, (Xo: m )
`weak law of large numbers, ~ L::'~l hx (X;), converges to E [hx (X)] = H (X), as
`m -> 00. Specifically, for any 8 > 0, let c (m, 8) denote the probability
`d m,8) = P (I~, ~l hx (Xi) - E[hx (X)l! > 8)
`= P (I ~. hXOm (XO: m ) - H (X)I > 8)
`
`(2.9)
`
`GE Video Exhibit 2003
`
`
`
`Chapter 2: Entropy and Coding Techniques
`
`35
`
`Then the weak law of large numbers states that
`lim c (m, 8) = 0
`m-oo
`
`Equivalently, let T (m, 8) be the set of outcomes, Xo: m , for which
`I~, hXo m (xo: m ) - H (X) I::; 8
`
`Then c (m, 8) = P (XO:m rf. T (m, 8)) m~ O. For small 8 and large m so that c (m, 8)
`is very small, we may think of T (m, 8) as the set of "typical" outcomes. The idea is
`to assign codewords only to these typical outcomes, since the probability of anything
`else becomes vanishingly small as m grows. For each XO: m E T (m, 8), we have
`
`1
`H (X) - 8::; -h xOm (xo:m )
`m
`.
`
`::; H (X) + 8
`
`and hence the probability of each typical outcome is bounded by
`
`rm(H(X)+O) ::; fXo
`
`(xo: m )
`
`m
`
`::; r m (H(X)-6)
`
`(2.10)
`
`Letting 8 become very small, the typical outcomes all have essentially the same likeli(cid:173)
`hood, so that if we assign codewords only to the typical outcomes, the resulting L-bit
`codewords will be uniformly distributed, or "entirely random."
`Using equation (2.10), we see that
`
`P (XO:m E T (m, 8» =
`
`xo: m ET(m,6)
`::::: rm(H(X)+o) liT (m, 8)11
`
`So the number of typical outcomes is bounded above by
`
`1 - £ (m,o)
`IIT(m,o)11 ::; 2- m (H(X)+o)
`::; 2m (H(X)+o)
`
`It follows that so long as we select L ::::: m (H (X) + 8), we can represent all of the
`typical outcomes with a distinct codeword and then the probability of error, Pe (m, L),
`must be at most £ (m, 8), which tends to 0 as m --; 00. This proves the first statement
`of the theorem, since 8 > 0 is arbitrary.
`To prove the converse statement, we use equation (2.10) again to obtain a lower
`bound for the number of typical outcomes; i.e.,
`
`IIT(m 8)11 > l-c(m,8)
`,
`- 2- m (H(X)-6)
`
`Suppose that L ::; m (H (X) - 28). Let T denote the number of elements from
`T (m, 8) which are represented in this code. Then T satisfies
`
`T m6
`2L
`<
`T
`liT (m, 8)11 ~ liT (m, 8)11 ::; 1 - £ (m, 8)
`So the fraction of typical outcomes which can be represented tends to 0 as m --;
`00, whenever the code rate is less than H (X). This suggests the validity of the
`
`GE Video Exhibit 2003
`
`
`
`36
`
`Information and Entropy
`
`second statement of the theorem. To make the proof rigorous, observe that the total
`probability associated with the T elements of T (m, 8) which are represented by the
`code is at most
`
`2£ . T m (H(X)-8) :s; T m8
`
`and the total probability of all other outcomes is c (m, 8), so
`
`Pe (m, L) ~ 1 - c (m, 8) _ T m8
`~1
`
`•
`Several points are worth noting concerning the noiseless source coding
`theorem. Firstly, for finite length codes, fixed length coding is incapable
`of guaranteeing that all source outcomes will be represented exactly. The
`solution to this dilemma is variable length codes, which are examined
`next. Despite this obstacle, the noiseless source coding theorem does
`indeed establish a strong connection between entropy and coding. The
`entropy clearly partitions the set of code rates into two classes. So long
`as the code-rate exceeds the entropy, we can make sure that the entire
`message is coded without error with arbitrarily high confidence; if the
`code-rate is less than the entropy, long messages will contain errors with
`probability approaching 1.
`As noted in the proof of the theorem, reliable codes whose rate ap(cid:173)
`proaches the entropy have the property that their codewords all occur
`with equal likelihood. That is, the L bit sequences are "entirely ran(cid:173)
`dom." Recall that we began Section 2.1 with the claim that the repre(cid:173)
`sentation of source outcomes with entirely random sequences of bits is
`the goal of source coding. This is perhaps the most important observa(cid:173)
`tion arising from the noiseless source coding theorem.
`Shannon's original result has been extended over the years to random
`processes satisfying a variety of technical conditions. For more gen(cid:173)
`eral random processes than the simple memoryless processes considered
`the key difficulty is to demonstrate convergence of c (m, b), as
`above,
`defined by equation (2.9). This is known as the entropy-ergodic prop(cid:173)
`erty. Shannon himself extended the result to Markov processes, while
`extensions to more general ergodic random processes were developed
`by McMillan [107] and extended by Breiman [29, 30] and others. The
`more general result is often known as the Shannon-McMillan-Breiman
`theorem, or the asymptotic equipartition (AEP) theorem.
`
`ELIAS CODING
`2.1.4
`As mentioned above, fixed length codes cannot generally guarantee
`lossless coding. In this section, we consider variable length codes. It is
`most instructive to describe a particular coding algorithm, whose ability
`
`GE Video Exhibit 2003
`
`
`
`Chapter 2: Entropy and Coding Techniques
`
`37
`
`to approach the entropy rate of a stationary markov random process
`can be demonstrated rather easily. The algorithm is not practical as
`it stands since its implementation requires infinite precision arithmetic.
`Nevertheless, it is the basis for a family of highly efficient practical coding
`techniques, known collectively as arithmetic coding.
`Indeed one mem(cid:173)
`ber of this family is at the heart of the JPEG2000 image compression
`standard (see Section 12.1). Practical arithmetic coding is the subject
`of Section 2.3. P. Elias is usually credited with conceiving the algorithm
`shortly after Shannon's original publication on information theory.
`
`MAPPING OUTCOMES TO INTERVALS
`Let {Xn } be a stationary random process. To begin, we will restrict
`ourselves to memoryless processes, as in Section 2.1.3. In this case, we
`hope to be able to code the outcomes of the random process at an average
`rate of H (X) bits per symbol.
`Following the notation developed above, we denote the first n out(cid:173)
`comes of the random process by the vector, Xo: n . The algorithm is best
`understood as associating each such length n prefix of the source se(cid:173)
`quence with a unique interval on the real line,
`
`such that the length of this interval is equal to fXo: n (xo: n ). The algo(cid:173)
`rithm is implemented recursively as follows:
`
`Elias Coding Algorithm
`Initialize Co = 0 and ao = 1.
`For each n = 0,1, ...
`Update an+! f - anfx (xn)
`Update Cn+I f - Cn + anFx (xn)
`
`Here, Fx denotes the cumulative distribution2 ,
`
`i-I
`Fx (ai) ~ Lfx(aj) where Ax = {ao,aI, ... }
`j=O
`
`We assume that the encoder and decoder both have access to the un(cid:173)
`derlying distribution function, fx and hence Fx, or else they both use
`identical estimates for this function.
`
`2Note the non-standard definition here, in which the probability of Qi itself is not included
`in the summation.
`
`GE Video Exhibit 2003
`
`
`
`,..
`
`37
`
`:rr"'--
`
`16
`4'-
`
`44
`
`' .....--.;:----~
`
`38
`
`Information and Entropy
`
`o
`
`28
`C 7
`3 4' - - - - - : ) ( ' ;4 4'
`
`Figure 2.2.
`
`Elias coding for a memoryless binary source.
`
`Example 2.2 Consider a binary memoryless source with fx(O) = :t
`and fx(l) = ~ and suppose the source outputs the sequence "01101 ... ".
`Figure 2.2 indicates the evolution of the intervals [cn,en + an).
`[en, en + an) have the following easily verified proper(cid:173)
`
`The intervals,
`ties:
`
`1. The set of intervals, [cn, Cn + an), corresponding to each distinct vec(cid:173)
`tor, XO: n E IIAx r, are disjoint and their union is [0, 1). That is, the
`set of all possible length n prefixes of the source output induces a
`partition of the unit interval, [0, 1).
`
`2. The intervals corresponding to successively longer prefixes of the
`source output sequence are nested; i.e.,
`
`3. The length of the interval associated with XO: n satisfies
`an = II fx (Xi) = fXo:n (XO:n)
`
`n-l
`
`i=O
`
`MAPPING INTERVALS TO CODEWORDS
`Suppose we apply the recursive algorithm described above for a total
`of m source output symbols. The key observation behind Elias coding
`is that the particular outcome, XO: m , may be uniquely identified by any
`[em, Cm + am), as a consequence of property 1
`number in the interval
`above. Since the interval has length am, it must contain at least one Lm
`
`GE Video Exhibit 2003
`
`
`
`Chapter 2: Entropy and Coding Techniques
`
`39
`
`bit binary fraction of the form
`
`o.~
`
`L m
`where L m is any integer satisfying 2- Lm < am. Thus, we conclude that
`the number of bits in the representation is
`
`Lm ~ -log2 am = hXo:m (XO:m ) = L hx (xn )
`
`m-l
`
`n=O
`In this way, the Elias coding algorithm firmly establishes the connection
`between hx (x) and the amount of information associated with the out(cid:173)
`come X = x. Each individual outcome, X n = X n , reduces the interval
`length by the factor Ix (xn ), adding exactly hx (xn ) = -log2 Ix (xn )
`bits to the code length.
`This in turn means that the average number of bits required to code m
`symbols from the source output sequence is E [hxo:m (Xo:m )] = mH (X).
`
`ELIAS TERMINATION
`There is a subtle weakness in the above argument in that the decoder
`does not know a priori the number of bits, Lm , which are being used
`to represent the source output, Xo: m . Therefore, we ought to provide
`a mechanism for signalling this length and include the number of bits
`consumed by this mechanism in the overall bit count. In practical arith(cid:173)
`metic coding algorithms, we will usually code a very large number of
`source outcomes, m, together, so that this cost may often be neglected.
`Nevertheless, it is worthwhile presenting a particular codeword termina(cid:173)
`tion policy suggested by Elias, for which there is no need to explicitly
`si