throbber
Chapter 2
`
`ENTROPY AND CODING TECHNIQUES
`
`2.1
`
`INFORMATION AND ENTROPY
`
`A binary digit, or "bit," b, takes one of the values b = 0 or b = l.
`A single bit has the ability to convey a certain amount of information
`- the information corresponding to the outcome of a binary decision, or
`"event," such as a coin toss. If we have N bits, then we can identify the
`outcomes of N binary decisions.
`
`Intuitively, the average amount of information associated with a bi(cid:173)
`nary decision depends upon prior knowledge which we have concerning
`the likelihoods of the possible outcomes. For example, .there is little
`informative value to including snow conditions in the weather report
`during summer - in common parlance, the result is a foregone conclu(cid:173)
`the binary events which convey most information
`sion. By contrast,
`on average are those which are equally likely. Similarly, the N-bit se(cid:173)
`quences which convey most information are those for which each bit has
`equally likely outcomes, regardless of the outcomes of the other bits in
`the sequence - loosely speaking, these are "entirely random" sequences
`of bits.
`
`Source coding is the art of mapping each possible output from a given
`information source to a sequence of binary digits called "code bits."
`Ideally, the mapping has the property that the code bits are "entirely
`random," i.e., statistically independent, taking values of 0 and 1 with
`equal probability. In this way, the code bits convey the maximum pos(cid:173)
`sible amount of information. Then, provided the mapping is invertible,
`we can identify the number of code bits with the amount of information
`in the original source output.
`
`D.S. Taubman et al., JPEG2000 Image Compression Fundamentals, Standards and Practice
`© Springer Science+Business Media New York 2002
`
`GE Video Exhibit 2003
`Unified Patents v. GE Video, IPR2019-00726
`
`

`

`24
`
`Information and Entropy
`
`The above concepts were formalized in the pioneering work of Claude
`Shannon [130]. A quantity known as "entropy" is defined in terms of the
`statistical properties of the information source. The entropy represents
`a lower bound on the average number of bits required to represent the
`source output. Moreover, it is possible to approach this lower bound ar(cid:173)
`bitrarily closely. In fact, practical coding algorithms can achieve average
`bit rates which are extremely close to the entropy in many applications
`and when they do so the code bits must be entirely random.
`
`MATHEMATICAL PRELIMINARIES
`2.1.1
`RANDOM VARIABLES AND VECTORS
`Let X denote a random variable. Associated with the random variable
`is a set of possible outcomes, known as the aphabet, Ax. The outcome
`of the random variable is denoted x, and is one of the elements of Ax .
`A random variable is said to be discrete if its alphabet is finite or at
`most countably infinite. That is, we can enumerate the elements of the
`alphabet,
`
`Ax = {ao,al,a2, ... }
`In this case,
`the statistical properties of the random variable are de(cid:173)
`scribed by its probability mass function (PMF)
`
`fx (x) #:. P (X = x) for each x E Ax
`In words, fx (x) is the probability of the outcome X = x. By contrast, a
`continuous random variable has uncountably many outcomes, e.g. Ax =
`In this chapter we will be concerned
`JR,
`the set of all real numbers.
`exclusively with discrete alphabets. As an example, we model binary
`decisions as random variables whose alphabets have only two entries,
`usually written Ax = {O, I}. Binary random variables playa special
`role in coding.
`The notion of a random variable is trivially extended to random vec(cid:173)
`tors, X, with alphabet, Ax and PMF, fx (x), for each vector, x EAx.
`An m-dimensional random vector is a collection of m random variables,
`usually taken as a column vector,
`
`X=
`
`X m - 1
`
`The PMF, fx (x), is sometimes written longhand as
`
`fx (x) == fXO,XI,,,,,Xm-l (XO,Xl,'"
`
`,Xm-l)
`
`GE Video Exhibit 2003
`
`

`

`Chapter 2: Entropy and Coding Techniques
`
`25
`
`It denotes the probability that Xo = Xo, Xl = Xl, ... , and Xm - l = Xm-l
`simultaneously. For this reason, it is often called the joint PMF, or joint
`distribution, for the m random variables.
`From the joint distribution of a collection of m random variables,
`we can obtain the "marginal" distribution of anyone of the random
`variables, Xi, as
`
`IXi (X) = L Ix (x)
`
`X3 X i=X
`
`INDEPENDENCE AND CONDITIONAL PMF'S
`We say that two random variables are statistically independent, or
`simply independent, if their joint distribution is separable; i.e.,
`
`That is, the probability that both Xo = Xo and Xl = Xl is the product of
`the two marginal probabilities. As suggested by the introductory com(cid:173)
`ments above, the notion of statistical independence plays an important
`role in coding.
`We define the conditional distribution of Xl, given Xo, by
`
`The function, Ix llxo (-, xo), is interpreted as a modified PMF for X I,
`where the modification is to reflect the fact that the outcome X o = Xo
`is already known. If the two random variables are statistically indepen(cid:173)
`dent, we expect that the outcome of Xo has no bearing on the distribu(cid:173)
`tion of X I and indeed we find that
`
`IXllXo (XI,XO) = IXI (Xl) if and only if XI,Xo are independent
`
`We note that the marginal distribution of Xo and the conditional distrib(cid:173)
`ution of Xl, given Xo, together are equivalent to the joint distribution of
`Xl and Xo· More generally, we write IXnIXn-l, ... ,Xo (xn , ... ,xo) for the
`conditional distribution of X n , given Xo through X n - l . The joint dis(cid:173)
`tribution of all m random variables of an m-dimensional random vector,
`X, may be recovered from
`
`Ix (x) = Ixo (xo) IXllXo (Xl, Xo) ... IXm -lIXm -2, ... ,Xo (Xm-l, . .. , xo)
`(2.1)
`
`and the random variables are said to be mutually independent if
`Ix (x) = Ixo (xo) IX l (Xl) ... IXm - 1 (xm-d
`
`GE Video Exhibit 2003
`
`

`

`26
`
`Information and Entropy
`
`EXPECTATION
`The expectation of a random variable, X, is denoted E [X] and defined
`E[X] ~ L xfx (x)
`
`by
`
`xEAx
`It represents the statistical average or mean of the random variable X.
`Here, for the first time, we are concerned with the algebraic properties
`of random variables. More generally, let 9 0 be any function. We may
`define Y = 9 (X) to be the random variable whose outcomes are y =
`9 (x) whenever the outcome of X is x. Consequently, the distribution of
`Y may be found from
`
`fy (y) = L fx (x)
`
`x3g(x)=y
`
`It is readily shown that the expectation of the new random variable, Y,
`satisfies
`
`E [Y] = E [g (X)] = L yfy (y) = L 9 (x) fx (x)
`
`(2.2)
`
`yEAy
`
`xEAx
`
`Given two random variables, Xo and XI, we may define conditional
`expectations in the most obvious way as
`
`E [Xl I Xo = xo] ~ L xfx1IXo (x, xo)
`
`XEAxl
`
`and for any function, gO, we have
`E [g (Xl) I Xo = xo] = L 9 (x) fXl\XO (x, xo)
`
`xEAxl
`
`RANDOM PROCESSES
`We conclude this section by introducing the concept of a discrete
`random process, denoted {Xn }. A random process is nothing but a se(cid:173)
`quence of individual random variables, X n , nEZ, all having a common
`alphabet, Ax. The key distinction from a random vector is that there
`are infinitely many random variables. The statistics are summarized by
`the vector PMF's, fXi:j 0, for all i < j E Z, where we use the notation,
`X i :j , to refer to the (j - i)-dimensional random vector formed from the
`elements, X k , i ::; k < j, of the random process.
`The random process, {Xn },
`is said to be stationary if the vector
`PMF's satisfy
`
`fX i :i+m = fXo: m for all i, m E Z, m > 0
`
`GE Video Exhibit 2003
`
`

`

`Chapter 2: Entropy and Coding Techniques
`
`27
`
`is, all collections of m consecutive random variables from the
`That
`process have exactly the same joint distribution. Thus, a stationary
`fXo: m for each m =
`random process is characterized by the PMF's,
`1,2, .... Alternatively, from equation (2.1) we see that stationary ran(cid:173)
`dom processes are characterized by the marginal distribution, fxo == fx,
`together with the sequence of conditional distributions,
`fXmIXo:m' for
`m= 1,2, ....
`In most applications we find that the conditional distributions satisfy
`
`(2.3)
`
`for a sufficiently large value of the parameter, p. That is, the conditional
`distribution of X m given X othrough X m - 1 , is actually a function of only
`the p most recent random variables, X m - p through X m - 1 . We say that
`X m is "conditionally independent" of X o through X m - p - 1 ' Conditional
`independence is a phenomenon which we usually expect to encounter in
`the information sources which we model using random processes. Indeed
`statistical dependencies among samples taken from natural physical phe(cid:173)
`nomena such as images and audio are generally of a local nature. For
`stationary processes, conditional independence means that the entire
`process is described by a finite number of conditional PMF's
`
`fxo' fXllxo' fX2Ixo:2'
`
`... , fXplxo: p
`
`These are called Markov random processes with parameter p. A Markov(cid:173)
`1 random process is entirely described by Ix and IXllxo' If P = 0, all ele(cid:173)
`ments of the random process are statistically independent with identical
`distribution, fx. Such a random process is said to be IID (Independent
`and Identically Distributed). It is also said to be "memoryless."
`Stationary random processes with conditional independence proper(cid:173)
`ties (i.e. Markov processes) play an extremely important role in coding,
`precisely because they are described by a finite number of conditional
`PMF's. By observing the outcomes of the random process over a finite
`period of time, we can hope to estimate these conditional PMF's and use
`these estimates to code future outcomes of the random process. In this
`way, we need not necessarily have any a priori knowledge concerning the
`statistics in order to effectively code the source output. Adaptive coders
`are based on this principle.
`The technical condition required to enable estimation of the rele(cid:173)
`vant PMF's from a finite number of outcomes is "ergodicity."
`To be
`more precise, suppose we observe the outcomes of random variables Xo
`through XM-l' For each m-dimensional vector, y, let Ky,M denote the
`number of occurrences of y as a "sub-string" of the observed sequence,
`
`GE Video Exhibit 2003
`
`

`

`28
`
`Information and Entropy
`
`XO:M; I.e.,
`
`Ky,M = II{ il 0 :s; i < lYI - m and Xi:i+m = y}II
`
`A
`
`) for each m :s; p < lYI
`
`It is natural to estimate the conditional PMF's according to
`K(y x) M + fJ
`IXmlXo: m (x, y) = "
`("
`L....zEAx K(y,z),M + fJ
`fJ = 1), included to avoid undefined or
`where fJ is a small offset (e.g.
`ill-conditioned estimates when M is small and (y, x) denotes the vector
`If the random process is ergodic, these
`formed by appending x to y.
`estimates will converge to the actual conditional PMF's as M increases.
`Most random processes encountered in practice are ergodic. At least,
`practical coding algorithms are based on the assumption that the under(cid:173)
`lying random process is ergodic, so that we can estimate the statistics
`through observation.
`
`THE CONCEPT OF ENTROPY
`2.1.2
`ENTROPY OF A RANDOM VARIABLE
`The entropy of a random variable, X, is defined as
`
`H (X) ~ - L Ix (x) log2 Ix (x)
`
`xEAx
`
`We shall find the following equivalent expression more convenient and
`intuitive
`
`H(X) = E[-log2Ix (X)]
`To clarify this expression, define the function, hx 0, by
`hx (x) ~ -log2 Ix (x)
`
`As with any function, then, we may define the random variable, Y =
`hx (X), and apply equation (2.2) to see that the entropy of X is the
`expectation of the new random variable, Y; i.e.,
`
`H (X) = E [hx (X)]
`
`As we shall see, the quantity hx (x) may be interpreted as the amount
`of information associated with the event X = x and then the entropy
`may be interpreted as the average (or expected) amount of information
`conveyed by the outcome of the random variable, X.
`Entropy measures information in units of bits. The precise connec(cid:173)
`ti,on between entropy, as defined above, and the average number of bits
`
`GE Video Exhibit 2003
`
`

`

`Chapter 2: Entropy and Coding Techniques
`
`29
`
`required to code the outcome of the random variable will be explored
`in Sections 2.1.3 and 2.1.4. For the moment, however, it is instructive
`to reflect on the intuitive appeal of the definition, as suggested by the
`following properties:
`
`1. hx (x) is a strictly decreasing function of the likelihood, fx (x). This
`agrees with the notion that a highly unlikely event carries consid(cid:173)
`erable information when it occurs. For example, the appearance of
`snow in summer is a newsworthy event in most cities.
`
`2. The entropy is bounded below by O. Moreover, H (X) = 0 occurs if
`and only if X has a deterministic outcome, say X = xo. That is,
`
`f x (x) =
`
`{
`
`if x = Xo
`I
`0 if x =1= Xo
`
`3. For random variables with finite alphabets, the entropy is bounded
`above by H (X) ~ log21lAxll. Moreover, the upper bound occurs if
`and only if X has a uniform distribution; i.e., fx (x) = lI)xll' \/x E
`Ax. Of particular interest is the case in which the alphabet consists
`of B-bit numbers, Ax = {o, 1, ... , 2B -I}. Then H (X) ~ B with
`equality if and only if all 2B outcomes are equally likely. Put another
`way, an information source whose outcomes are represented with B
`bit numbers has an entropy of at most B bits, where this maximum
`occurs if and only if the B bits are "entirely random."
`
`Example 2.1 Let X be a binary random variable with fx (0) = p and
`fx (1) = (1 - p). Figure 2.1 plots H (X) as a function of the parameter,
`p. The figure clearly indicates the fact that the entropy is zero when X is
`deterministic (p = 0 or p = 1) and is maximized when the two outcomes
`are equally likely (p = ~).
`
`JOINT AND CONDITIONAL ENTROPY
`The definition of entropy extends naturally to random vectors so that
`H (X) ~ E [-log2 fx (X)] = E [hx (X)]
`In fact, since the alphabet of the random vector is discrete, we can enu(cid:173)
`merate its elements, Ax = {ao, aI, ...}, and define a random variable,
`K = k whenever X = ak. Since the outcomes of K and X convey ex(cid:173)
`actly the same information we must insist that H (X) = H (K). The
`above definition then follows immediately.
`We sometimes use the longhand expression
`
`GE Video Exhibit 2003
`
`

`

`30
`
`Information and Entropy
`
`1······
`
`.
`
`H(X)
`
`o
`
`p=
`
`Figure 2.1. Entropy of a binary random variable, X, as a function of p = Ix (0).
`
`where X is an m-dimensional random vector with elements, Xo through
`X m - 1 . We may also call this the joint entropy ofthe m random variables
`since it is a function of their joint PMF.
`The conditional entropy of X given Y is defined by
`
`H(X IY) £ - L !y(y) L !xlY(x, y) log2 !xIY(x, y)
`= - L L !x,Y(x, y) log2 !xlY(x, y)
`
`xEAx
`
`yEAy
`
`yEAy xEAx
`= E [-log2!xIY(X,Y)]
`H (X IY) may be interpreted as the average additional information we
`receive from the outcome of X given that the outcome of Y is already
`known. This interpretation follows directly from the interpretation of
`H (X) as the average amount of information we receive from the outcome
`of X. 1 To see this, observe that
`
`H (X, Y) = E [-log2 !X,Y (X, Y)]
`= E [-log2 (Iy (Y) !xlY (X, Y))]
`= E [-log2 fy (Y)] + E [-log2 fxlY (X, Y)]
`= H (Y) + H (X IY)
`
`(2.4)
`
`1 Although the properties of H (X) suggest an interpretation as a measure of information,
`the connection with information will be concretely established in Section 2.1.3.
`
`GE Video Exhibit 2003
`
`

`

`Chapter 2: Entropy and Coding Techniques
`
`31
`
`Thus, H (X I Y) indeed represents the extra information, H (X, Y) (cid:173)
`H (Y). When X and Yare independent random variables we find that
`H (X I Y) = E [-log2 fxlY (X, Y)]
`= E [-log2 fx (X)]
`=H(X)
`
`which agrees with the fact that the outcome of Y has no bearing on the
`information conveyed by X.
`An important property of the conditional entropy is summarized by
`the following theorem.
`
`Theorem 2.1 Let X be a random variable. Let Y be a random vector
`with elements Yo through Ym-l and let Y' be a random vector consisting
`of any subset (possibly empty) of these elements. Then
`H (X I Y) :::; H (X IY')
`with equality if and only if fXIY = fXIY/" i. e., if and only if X is condi(cid:173)
`tionally independent of the elements in Y which are missing from Y'.
`Corollary 2.2 For random variables, X and Y, H (X I Y) :::; H (X),
`with equality if and only if X and Yare independent.
`Corollary 2.3 From equation (2.4), we also have H (X, Y) :::; H (X) +
`H (Y), with equality if and only if X and Yare independent.
`
`These results have considerable intuitive appeal. If we know the out(cid:173)
`comes of some collection of random variables, Yo through Ym - l , which
`are not statistically independent of X, then this reduces the uncertainty
`of X and hence the amount of additional information conveyed by the
`outcome of X. As m increases, the uncertainty in X and hence the con(cid:173)
`ditional entropy, H (X I Y), continues to decrease so long as each new
`random variable Ym provides some new information about X, which is
`not already present in the other random variables, Yo through Ym-l.
`Equation (2.4) is easily generalized to expand the entropy of any ran(cid:173)
`dom vector, X, as
`
`H (X) = E [-log2 fx (X)]
`= E [-log2 (Jxo (Xo) ..... fX m -lIXm -2, ... ,xo (Xm - l ,
`= H (Xo) + H (Xl I X o) + ... + H (Xm - l I X m - 2 ,
`
`, X o))]
`,Xo)
`
`(2.5)
`
`GE Video Exhibit 2003
`
`

`

`32
`
`Information and Entropy
`
`where equality holds if and only if the random variables, Xi, are all inde(cid:173)
`pendent. As we shall see in Section 2.1.4, this expansion is particularly
`useful in coding.
`
`ENTROPY RATE
`Let {Xn } be a discrete stationary random process. Since the random
`process has infinite extent, the total amount of information conveyed by
`the outcome of the random process will usually be infinite. In fact, for
`Markov processes it must be either infinite or zero. Thus, it is more
`meaningful to introduce the notion of an "information rate" for random
`processes. A close analogy is the characterization of stationary random
`processes by their power rather than their energy in the study of linear
`systems.
`We begin by defining the m th order entropy of the random process by
`}) ~ 1
`.)
`) _ 1 ( .
`(
`- -H X O:m - -H X~:m+~ ,lor all 1, E Z
`m
`m
`Thus, the 1st order entropy is simply the entropy of any given random
`variable, say X [0] (they all have the same distribution), taken individ(cid:173)
`ually from the process. The 2nd order entropy is half the joint entropy
`of any pair of consecutive random variables from the process. From
`Corollary 2.3,
`

`
`.
`
`(m) ({
`
`H
`
`X n
`
`H(2) ({Xn }) :::; ~(H(Xo)+H(XI))
`= H(l) ({Xn })
`In fact, from Theorem 2.1, we see that
`mH(m) ({Xn }) = H (Xo) + H (Xl I Xo: I ) + ... + H (Xm- l I Xo:m - l )
`(2.6)
`
`~ mH (Xm - l I X O:m - l )
`
`and hence
`(m + 1) H(m+l) ({Xn }) = mH(m) ({Xn }) + H (Xm I X O:m )
`:::; mH(m) ({Xn }) + H (Xm- l I X O:m - l )
`:::; (m + 1) H(m) ({Xn })
`So H(m) ({Xn}) is a monotonically decreasing function of m. Since it
`is bounded below by 0, the sequence must converge and we define the
`entropy rate of the random process to be
`
`(2.7)
`
`GE Video Exhibit 2003
`
`

`

`Chapter 2: Entropy and Coding Techniques
`
`33
`
`This quantity is interpreted as the average rate at which the random
`process conveys information as the outcomes X n = x n are discovered
`one by one.
`For a Markov-p source process, equation (2.7) may be recast as
`
`(2.8)
`
`This very simple form follows from equation (2.6) and the conditional
`independence property of the random process, according to which
`
`H(m) ({Xn}) = ~ (H (Xo) + H (Xl I XO:I ) + ... + H (Xp- l I X O:p- I ))
`-p
`+ - -H (Xp I X O:p )
`m
`----+ H (Xp I X O:p ) as m ---t 00
`
`mm
`
`2.1.3
`
`SHANNON'S NOISELESS SOURCE
`CODING THEOREM
`In Section 2.1.2 we defined a quantity called entropy, having prop(cid:173)
`erties which we would expect of a measure of information. The value
`of entropy, however, as a tool for understanding and developing practi(cid:173)
`cal compression algorithms, arises from a rigorous connection between
`these definitions and fundamental bounds on coding performance. This
`connection was first established by Shannon's "noiseless source coding
`theorem" [130]. The essence of this theorem is that the entropy rate of
`a random process provides a lower bound on the average number of bits
`which must be spent in coding each of its outcomes and also that this
`bound may be approached arbitrarily closely as the complexity of the
`coding scheme is allowed to grow without bound. Due to the impor(cid:173)
`tance of this result, we choose to reproduce Shannon's proof here, for
`the simple case of a stationary, memoryless random process.
`Let {Xn } be an IID random process, each element having distribution
`Ix and entropy H(X). The entropy rate, H({Xn }), in this case,
`is
`identical to H (X). For ease of expression we shall sometimes refer to
`the individual source outcomes, Xo, Xl, ... , as symbols. Then, we assess
`the information rate of the random process as the average number of bits
`per symbol required to represent the source output over a period of m
`consecutive symbols, in the limit as m becomes very large. Specifically,
`we construct a code which maps outcomes of the random vector, XO: m
`to L-bit codewords. This is a "fixed length" code since each block of m
`symbols is represented by the same number of code-bits, L. Codes of this
`f;, represents the average
`form are known as (m, L) codes. The ratio,
`number of bits spent coding each symbol. The idea is essentially to show
`
`GE Video Exhibit 2003
`
`

`

`34
`
`Information and Entropy
`
`that in the limit as m becomes very large, this ratio !:i may be made
`arbitrarily close to H (X).
`There are a total of /lAx 11 m possible m-dimensional outcomes, Xo: m ,
`and so it is clearly impossible to represent all outcomes perfectly un(cid:173)
`!:i 2: log211Ax II· But we know that H (X) :::;
`less 2£ 2: IIAx 11 m
`; i.e.,
`log2 /lAx II, attaining this maximum value only when fx is the uniform
`distribution,with all outcomes equally likely. Thus, in order to establish
`a connection between entropy and the bit-rate of a fixed length code, we
`will need to admit the possibility that the coded representation might
`not be exact. Let Pe (m, L) denote the probability that our L-bit code
`does not represent the random vector, X O:m , exactly. The idea behind
`the noiseless source coding theorem is to show that Pe (m, L) may be
`made arbitrarily small as m grows, provided the code-rate ~ > H (X).
`Theorem 2.4 Let {Xn } be a discrete IID random process having finite
`entropy, H (X), and consider fixed length (m, L) codes, associating m(cid:173)
`to L-bit codewords. Only one outcome
`element outcome vectors, Xo: m ,
`vector may be associated with each codeword, so let Pe (m, L) denote the
`probability of the event that X O:m has no associated codeword. Then,
`by making m sufficiently large,
`the error probability, Pe (m, L), may be
`made arbitrarily small, so long as the code-rate satisfies
`L- > H(X)
`m
`the error probability, Pe (m, L),
`
`tends to 1 as m ~ 00 for
`
`Conversely,
`codes having
`
`L- < H(X)
`m
`Proof. Consider the random variable, hXo'm (XO:m ), which we defined to be
`-log2 !xo,,,, (XO:m )' Since the elements of the random vector, Xo: m , are all indepen(cid:173)
`dent, !Xo'm is separable and we obtain
`hXO'm (XO:m ) = -log2 II !x (Xi)
`
`m-l
`
`i=O
`
`m-l
`
`= L hx (Xi)
`
`i=O
`is a sum of the IID random variables, hx (Xi)' According to the
`So hxo,,,, (Xo: m )
`weak law of large numbers, ~ L::'~l hx (X;), converges to E [hx (X)] = H (X), as
`m -> 00. Specifically, for any 8 > 0, let c (m, 8) denote the probability
`d m,8) = P (I~, ~l hx (Xi) - E[hx (X)l! > 8)
`= P (I ~. hXOm (XO: m ) - H (X)I > 8)
`
`(2.9)
`
`GE Video Exhibit 2003
`
`

`

`Chapter 2: Entropy and Coding Techniques
`
`35
`
`Then the weak law of large numbers states that
`lim c (m, 8) = 0
`m-oo
`
`Equivalently, let T (m, 8) be the set of outcomes, Xo: m , for which
`I~, hXo m (xo: m ) - H (X) I::; 8
`
`Then c (m, 8) = P (XO:m rf. T (m, 8)) m~ O. For small 8 and large m so that c (m, 8)
`is very small, we may think of T (m, 8) as the set of "typical" outcomes. The idea is
`to assign codewords only to these typical outcomes, since the probability of anything
`else becomes vanishingly small as m grows. For each XO: m E T (m, 8), we have
`
`1
`H (X) - 8::; -h xOm (xo:m )
`m
`.
`
`::; H (X) + 8
`
`and hence the probability of each typical outcome is bounded by
`
`rm(H(X)+O) ::; fXo
`
`(xo: m )
`
`m
`
`::; r m (H(X)-6)
`
`(2.10)
`
`Letting 8 become very small, the typical outcomes all have essentially the same likeli(cid:173)
`hood, so that if we assign codewords only to the typical outcomes, the resulting L-bit
`codewords will be uniformly distributed, or "entirely random."
`Using equation (2.10), we see that
`
`P (XO:m E T (m, 8» =
`
`xo: m ET(m,6)
`::::: rm(H(X)+o) liT (m, 8)11
`
`So the number of typical outcomes is bounded above by
`
`1 - £ (m,o)
`IIT(m,o)11 ::; 2- m (H(X)+o)
`::; 2m (H(X)+o)
`
`It follows that so long as we select L ::::: m (H (X) + 8), we can represent all of the
`typical outcomes with a distinct codeword and then the probability of error, Pe (m, L),
`must be at most £ (m, 8), which tends to 0 as m --; 00. This proves the first statement
`of the theorem, since 8 > 0 is arbitrary.
`To prove the converse statement, we use equation (2.10) again to obtain a lower
`bound for the number of typical outcomes; i.e.,
`
`IIT(m 8)11 > l-c(m,8)
`,
`- 2- m (H(X)-6)
`
`Suppose that L ::; m (H (X) - 28). Let T denote the number of elements from
`T (m, 8) which are represented in this code. Then T satisfies
`
`T m6
`2L
`<
`T
`liT (m, 8)11 ~ liT (m, 8)11 ::; 1 - £ (m, 8)
`So the fraction of typical outcomes which can be represented tends to 0 as m --;
`00, whenever the code rate is less than H (X). This suggests the validity of the
`
`GE Video Exhibit 2003
`
`

`

`36
`
`Information and Entropy
`
`second statement of the theorem. To make the proof rigorous, observe that the total
`probability associated with the T elements of T (m, 8) which are represented by the
`code is at most
`
`2£ . T m (H(X)-8) :s; T m8
`
`and the total probability of all other outcomes is c (m, 8), so
`
`Pe (m, L) ~ 1 - c (m, 8) _ T m8
`~1
`
`•
`Several points are worth noting concerning the noiseless source coding
`theorem. Firstly, for finite length codes, fixed length coding is incapable
`of guaranteeing that all source outcomes will be represented exactly. The
`solution to this dilemma is variable length codes, which are examined
`next. Despite this obstacle, the noiseless source coding theorem does
`indeed establish a strong connection between entropy and coding. The
`entropy clearly partitions the set of code rates into two classes. So long
`as the code-rate exceeds the entropy, we can make sure that the entire
`message is coded without error with arbitrarily high confidence; if the
`code-rate is less than the entropy, long messages will contain errors with
`probability approaching 1.
`As noted in the proof of the theorem, reliable codes whose rate ap(cid:173)
`proaches the entropy have the property that their codewords all occur
`with equal likelihood. That is, the L bit sequences are "entirely ran(cid:173)
`dom." Recall that we began Section 2.1 with the claim that the repre(cid:173)
`sentation of source outcomes with entirely random sequences of bits is
`the goal of source coding. This is perhaps the most important observa(cid:173)
`tion arising from the noiseless source coding theorem.
`Shannon's original result has been extended over the years to random
`processes satisfying a variety of technical conditions. For more gen(cid:173)
`eral random processes than the simple memoryless processes considered
`the key difficulty is to demonstrate convergence of c (m, b), as
`above,
`defined by equation (2.9). This is known as the entropy-ergodic prop(cid:173)
`erty. Shannon himself extended the result to Markov processes, while
`extensions to more general ergodic random processes were developed
`by McMillan [107] and extended by Breiman [29, 30] and others. The
`more general result is often known as the Shannon-McMillan-Breiman
`theorem, or the asymptotic equipartition (AEP) theorem.
`
`ELIAS CODING
`2.1.4
`As mentioned above, fixed length codes cannot generally guarantee
`lossless coding. In this section, we consider variable length codes. It is
`most instructive to describe a particular coding algorithm, whose ability
`
`GE Video Exhibit 2003
`
`

`

`Chapter 2: Entropy and Coding Techniques
`
`37
`
`to approach the entropy rate of a stationary markov random process
`can be demonstrated rather easily. The algorithm is not practical as
`it stands since its implementation requires infinite precision arithmetic.
`Nevertheless, it is the basis for a family of highly efficient practical coding
`techniques, known collectively as arithmetic coding.
`Indeed one mem(cid:173)
`ber of this family is at the heart of the JPEG2000 image compression
`standard (see Section 12.1). Practical arithmetic coding is the subject
`of Section 2.3. P. Elias is usually credited with conceiving the algorithm
`shortly after Shannon's original publication on information theory.
`
`MAPPING OUTCOMES TO INTERVALS
`Let {Xn } be a stationary random process. To begin, we will restrict
`ourselves to memoryless processes, as in Section 2.1.3. In this case, we
`hope to be able to code the outcomes of the random process at an average
`rate of H (X) bits per symbol.
`Following the notation developed above, we denote the first n out(cid:173)
`comes of the random process by the vector, Xo: n . The algorithm is best
`understood as associating each such length n prefix of the source se(cid:173)
`quence with a unique interval on the real line,
`
`such that the length of this interval is equal to fXo: n (xo: n ). The algo(cid:173)
`rithm is implemented recursively as follows:
`
`Elias Coding Algorithm
`Initialize Co = 0 and ao = 1.
`For each n = 0,1, ...
`Update an+! f - anfx (xn)
`Update Cn+I f - Cn + anFx (xn)
`
`Here, Fx denotes the cumulative distribution2 ,
`
`i-I
`Fx (ai) ~ Lfx(aj) where Ax = {ao,aI, ... }
`j=O
`
`We assume that the encoder and decoder both have access to the un(cid:173)
`derlying distribution function, fx and hence Fx, or else they both use
`identical estimates for this function.
`
`2Note the non-standard definition here, in which the probability of Qi itself is not included
`in the summation.
`
`GE Video Exhibit 2003
`
`

`

`,..
`
`37
`
`:rr"'--
`
`16
`4'-
`
`44
`
`' .....--.;:----~
`
`38
`
`Information and Entropy
`
`o
`
`28
`C 7
`3 4' - - - - - : ) ( ' ;4 4'
`
`Figure 2.2.
`
`Elias coding for a memoryless binary source.
`
`Example 2.2 Consider a binary memoryless source with fx(O) = :t
`and fx(l) = ~ and suppose the source outputs the sequence "01101 ... ".
`Figure 2.2 indicates the evolution of the intervals [cn,en + an).
`[en, en + an) have the following easily verified proper(cid:173)
`
`The intervals,
`ties:
`
`1. The set of intervals, [cn, Cn + an), corresponding to each distinct vec(cid:173)
`tor, XO: n E IIAx r, are disjoint and their union is [0, 1). That is, the
`set of all possible length n prefixes of the source output induces a
`partition of the unit interval, [0, 1).
`
`2. The intervals corresponding to successively longer prefixes of the
`source output sequence are nested; i.e.,
`
`3. The length of the interval associated with XO: n satisfies
`an = II fx (Xi) = fXo:n (XO:n)
`
`n-l
`
`i=O
`
`MAPPING INTERVALS TO CODEWORDS
`Suppose we apply the recursive algorithm described above for a total
`of m source output symbols. The key observation behind Elias coding
`is that the particular outcome, XO: m , may be uniquely identified by any
`[em, Cm + am), as a consequence of property 1
`number in the interval
`above. Since the interval has length am, it must contain at least one Lm
`
`GE Video Exhibit 2003
`
`

`

`Chapter 2: Entropy and Coding Techniques
`
`39
`
`bit binary fraction of the form
`
`o.~
`
`L m
`where L m is any integer satisfying 2- Lm < am. Thus, we conclude that
`the number of bits in the representation is
`
`Lm ~ -log2 am = hXo:m (XO:m ) = L hx (xn )
`
`m-l
`
`n=O
`In this way, the Elias coding algorithm firmly establishes the connection
`between hx (x) and the amount of information associated with the out(cid:173)
`come X = x. Each individual outcome, X n = X n , reduces the interval
`length by the factor Ix (xn ), adding exactly hx (xn ) = -log2 Ix (xn )
`bits to the code length.
`This in turn means that the average number of bits required to code m
`symbols from the source output sequence is E [hxo:m (Xo:m )] = mH (X).
`
`ELIAS TERMINATION
`There is a subtle weakness in the above argument in that the decoder
`does not know a priori the number of bits, Lm , which are being used
`to represent the source output, Xo: m . Therefore, we ought to provide
`a mechanism for signalling this length and include the number of bits
`consumed by this mechanism in the overall bit count. In practical arith(cid:173)
`metic coding algorithms, we will usually code a very large number of
`source outcomes, m, together, so that this cost may often be neglected.
`Nevertheless, it is worthwhile presenting a particular codeword termina(cid:173)
`tion policy suggested by Elias, for which there is no need to explicitly
`si

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket