throbber
To appear in IEEE J. Selected Areas in Communication.
`Draft of August 13, 1997 11:06 a.m.
`
`Turbo Decoding as an Instance of
`Pearl’s “Belief Propagation” Algorithm
`
`Robert J. McEliece
`California Institute of Technology
`
`David J. C. MacKay
`Cambridge University
`
`Jung-Fu Cheng
`California Institute of Technology
`
`Abstract.
`
`In this paper we will describe the close connection between the now celebrated iterative
`turbo decoding algorithm of Berrou, Glavieux, and Thitimajshima, and an algorithm that
`has been well-known in the artificial intelligence community for a decade, but which is
`relatively unknown to information theorists: Pearl’s belief propagation algorithm. We shall
`see that if Pearl’s algorithm is applied to the “belief network” of a parallel concatenation
`of two or more codes, the turbo decoding algorithm immediately results. Unfortunately,
`however, this belief diagram has loops, and Pearl only proved that his algorithm works
`when there are no loops, so an explanation of the excellent experimental performance of
`turbo decoding is still lacking.
`
`However, we shall also show that Pearl’s algorithm can be used to routinely derive
`previously known iterative, but suboptimal, decoding algorithms for a number of other
`error-control systems, including Gallager’s low-density parity-check codes, serially con-
`catenated codes, and product codes. Thus belief propagation provides a very attractive
`general methodology for devising low-complexity iterative decoding algorithms for hybrid
`coded systems.
`
`* This work was supported by NSF grant no. NCR-9505975, AFOSR grant no. 5F49620-
`97-1-0313, and by a grant from Qualcomm, Inc. A portion of McEliece’s contribution was
`done while he was visiting the Sony corporation in Tokyo. The collaboration between
`MacKay and McEliece was begun at, and partially supported by, the Newton Institute for
`Mathematical Sciences, Cambridge, England.
`
`Hughes, Exh. 1049, p. 1
`
`

`

`1. Introduction and Summary.
`
`“Turbo codes,” which were introduced in 1993 by Berrou, Glavieux, and Thitimajshima
`[10], are the most exciting and potentially important development in coding theory in
`many years. Many of the structural properties of turbo codes have now been put on a firm
`theoretical footing ([7][18][20][21][27][45]), and several innovative variations on the turbo
`theme have appeared ([5][8][9][12][27][48]).
`
`What is still lacking, however, is a satisfactory theoretical explanation of why the
`turbo decoding algorithm performs as well as it does. While we cannot yet announce a
`solution to this problem, we believe the answer may come from a close study of Pearl’s
`belief propagation algorithm, which is largely unknown to information theorists, but well-
`known in the artificial intelligence community. (The first mention of belief propagation in
`a communications paper, and indeed the paper that motivated this one, is that of MacKay
`and Neal [37]. See also [38] and [39].)
`
`In this paper, we will review the turbo decoding algorithm as originally expounded
`by Berrou et al.
`[10], but which was perhaps explained more lucidly in [3], [18], or [50].
`We will then describe Pearl’s algorithm, first in its natural “AI” setting, and then show
`that if it is applied to the “belief network” of a turbo code, the turbo decoding algorithm
`immediately results. Unfortunately, however, this belief network has loops, and Pearl’s
`algorithm only gives exact answers when there are no loops, so the existing body of knowl-
`edge about Pearl’s algorithm does not solve the central problem of turbo decoding. Still, it
`is interesting and suggestive that Pearl’s algorithm yields the turbo decoding algorithm so
`easily. Furthermore, we shall show that Pearl’s algorithm can also be used to derive effec-
`tive iterative decoding algorithms for a number of other error-control systems, including
`Gallager’s low-density parity-check codes, the recently introduced low-density generator
`matrix codes, serially concatenated codes, and product codes. Some of these “BP” de-
`coding algorithms agree with the ones previously derived by ad hoc methods, and some
`are new, but all prove to be remarkably effective. In short, belief propagation provides
`an attractive general method for devising low-complexity iterative decoding algorithms for
`hybrid coded systems. This is the message of the paper. (A similar message is given in
`the paper by Kschischang and Frey [33] in this issue.)
`
`Here is an outline of the paper. In Section 2 we derive some simple but important
`results about, and introduce some compact notation for, “optimal symbol decision” de-
`coding algorithms. In Section 3 we define what we mean by a turbo code, and review the
`turbo decoding algorithm. Our definitions are deliberately more general than what has
`previously appeared in the literature. In particular, our transmitted information is not bi-
`nary, but rather comes from a q-ary alphabet, which means that we must deal with q-ary
`probability distributions instead of the traditional “log-likelihood ratios.” Furthermore,
`the reader may be surprised to find no discussion of “interleavers,” which are an essential
`component of all turbo-coding systems. This is because, as we will articulate fully in our
`concluding remarks, we believe the interleaver’s contribution is to make the turbo code
`a “good” code, but it has nothing directly to do with the fact that the turbo decoding
`algorithm is a good approximation to an optimal decoder. In Section 4, we change gears
`and give a tutorial overview of the general probabilistic inference problem, with special
`
`1
`
`Hughes, Exh. 1049, p. 2
`
`

`

`reference to Bayesian belief networks. In Section 5 we describe Pearl’s BP algorithm, which
`can be defined on any belief network, and which gives an exact solution to the probabilis-
`tic inference problem when the belief network has no loops. In Section 6, we show that
`the turbo decoding algorithm follows from a routine application of Pearl’s algorithm to
`the appropriate (loopy) belief network. In Section 7 we briefly sketch some other decoding
`algorithms that can be derived from BP considerations. Finally in Section 8 we summarize
`our findings and venture some conclusions.
`
`2. Preliminaries.
`
`In this section, we will describe a general class of q-ary systematic encoders, and derive
`the optimal symbol-by-symbol decoding rule for a memoryless channel.
`
`Let U = (U1, . . . , Uk) be ak -dimensional random vector of independent, but not
`necessarily equiprobable, symbols from a q-letter alphabet A, with Pr{Ui = a} = πi(a), for
`a ∈ A. The vector U represents information to be transmitted reliably over an unreliable
`channel. We suppose that U is encoded systematically, i.e., mapped into a codeword X of
`the form
`
`(2.1)
`
`X = (U , X1)
`where U is the “systematic” part, and X1 is the “nonsystematic” part, of the codeword
`X. In the rest of the paper, we will sometimes call X1 a codeword fragment.
`
`We assume that the codeword X is transmitted over a noisy channel with transition
`
`probabilities p(y | x)def= Pr{Y = y | X = x}, and received as Y = (Ys, Y1), where Ys is
`the portion of Y corresponding to the systematic part of the codeword U , and Y1 is the
`portion corresponding to the codeword fragment X1. We assume further that the channel
`is memoryless, which implies that the conditional density factors according to the rule
`p(y | x) =p(y s, y1 | u, x1)
`(cid:2)
`(cid:4)
`k(cid:3)
`= p(ys | u)p(y1 | x1)
`p(ysi | ui)
`· p(y1 | x1),
`
`(2.2)
`
`(2.3)
`
`=
`
`i=1
`where ysi denotes the ith component of ys. The situation is depicted in Figure 1.
`
`channel
`
`(cid:0)(cid:0)
`Ys = (Ys1, . . . , Ysk)
`(cid:0)(cid:0)
`Y1
`(cid:0)(cid:0)
`(cid:0)(cid:0)
`
`X1
`
`E
`
`U = (U1, . . . , Uk)
`
`Figure 1. The codeword X = (U , X1) is transmitted
`over a memoryless channel and received as Y = (Ys, Y1).
`
`2
`
`Hughes, Exh. 1049, p. 3
`
`

`

`The decoding problem is to “infer” the values of the hidden variables Ui based on the
`“evidence,” viz., the observed values ys and y1 of the variables Ys and Y1. The optimal
`decision, i.e., the one that minimizes the probability of inferring an incorrect value for Ui,
`is the one based on the conditional probability, or “belief,” that the information symbol
`in question has a given value:
`
`(2.4)
`
`BELi(a)def= Pr{Ui = a | Ys = ys, Y1 = y1}.
`
`(A communication theorist would use the term “a posteriori probability,” rather than
`
`“belief.”) If a0 is such that BELi(a0) > BELi(a), for all a (cid:3)= a0, the decoder infers that
`Ui = a0. The following straightforward computation is central to our results.
`computation, and for the rest of the paper, we will use Pearl’s α notation [44].
`
`In this
`
`2.1 Definition. If x = (x1, . . . , xm) and y = (y1, . . . , ym) are vectors of nonnegative real
`numbers, the notation
`
`x = α y
`
`m k
`
`=1 yk), for i = 1, . . . , m. In other words, x is a probability vector
`means that xi = yi/(
`whose components are proportional to those of y. (If f (x) and g(x) are nonnegative real-
`valued functions defined on a finite set, the notation f (x) =α g (x) is defined similarly.)
`
`(cid:5)
`
`defined in (2.4) is given by
`
`BELi(a) =α
`
`u :ui=a
`
`p(y1 | x1)
`(cid:6)
`
`λj(uj)πj(uj)
`
`k(cid:3)
`
`2.2 Lemma. If the likelihood p(ysi | ui)1 is denoted by λi(ui), then the belief BELi(a)
`(cid:6)
`k(cid:3)
`
`j=1
`
`p(y1 | x1)
`
`(2.5)
`
`= α λi(a)πi(a)
`
`λj(uj)πj(uj).
`
`u :ui=a
`
`j=1
`j(cid:2)=i
`
`Proof: We have, by the definition (2.4), BELi(a) = Pr{Ui = a | Y = y}. Then
`Pr{Y = y, Ui = a}
`Pr{Ui = a | Y = y} =
`Pr{Y = y}
`(cid:6)
`= α Pr{Y = y, Ui = a}
`(cid:6)
`
`= α
`
`u :ui=a
`
`= α
`
`u :ui=a
`
`p(y, u)
`p(y | u) · p(u)
`
`(using the α-notation)
`
`1 If the encoder is not systematic, i.e., if the uncoded information symbols Ui are not
`transmitted, these likelihoods should all be set equal to 1.
`
`3
`
`Hughes, Exh. 1049, p. 4
`
`

`

`(cid:6)
`(cid:6)
`
`u :ui=a
`
`u :ui=a
`
`= α
`
`= α
`
`p(y1 | x1)p(ys | u) · k(cid:3)
`p(y1 | x1) · k(cid:3)
`k(cid:3)
`(cid:6)
`
`πj(uj)
`
`by (2.2)
`
`j=1
`
`λj(uj)πj(uj)
`
`by (2.3)
`
`j=1
`
`p(y1 | x1)
`
`= α λi(a)πi(a)
`
`λj(uj)πj(uj).
`
`u :ui=a
`
`j=1
`j(cid:2)=i
`
`The last two lines of the above calculation are the assertions of the Lemma.
`
`We see from (2.5) that BELi(a) is the product of three terms. The first term, λi(a),
`might be called the systematic evidence term. The second term, πi(a), takes into account
`the a priori distribution of Ui. Note that the effect of the systematic evidence is, in effect,
`to change the prior distribution of Ui from πi(a) toαπ i(a)λi(a). The third term, which is
`more complicated, takes into account the geometry of the code. Following [10], we will call
`this term the extrinsic term, and denote it by Ei(a). The extrinsic term is so important
`to what follows that we shall introduce a special notation for it. (This notation will also
`prove useful in Section 5, where we shall use it to describe Pearl’s algorithm—see Table
`5.1, line 6.)
`
`Thus let A1, . . . , Ak be finite alphabets, let U ⊆ A1 × ··· × Ak, and let R denote
`the set of real numbers. Let g = (g1, . . . , gk) be a function mapping U into Rk. In other
`words, g is a vector of k real valued functions, and if u = (u1, . . . , uk) ∈ U , then
`g(u) = (g1(u1), . . . , gk(uk)).
`
`(cid:2) i
`
`(cid:2) k
`
`k(cid:3)
`
`(cid:2) 1
`
`(cid:6)
`
`Now suppose that K(u) is a real-valued function defined on the set U , which we call a
`(cid:2)
`= (g
`, . . . , g
`), where g
`is defined by
`kernel. The K-transform of g is the vector g
`
`(a) =
`
`(cid:2) i
`
`g
`
`(2.6)
`
`K(u)
`
`gj(uj).
`
`u :ui=a
`
`j=1
`j(cid:2)=i
`
`We summarize (2.6) by writing
`
`(2.7)
`
`(cid:2)
`
`g
`
`= g ◦ K.
`
`Next, if f and g are vector-valued functions as above, we define their adjacent product
`h = f g as a simple componentwise product, i.e., h = (h1, . . . , hk), where
`
`(2.8)
`
`hi(a) =f i(a)gi(a).
`
`Using the circle and adjacent notation,2 we can express the result of Lemma 2.2
`compactly. To do so, we take U = Ak, and define a kernel p(u) as
`p(u)def= p(y1 | x1),
`
`2 We assume that “adjacent” takes precedence over “circle,” in order to minimize the
`use of parentheses.
`
`4
`
`Hughes, Exh. 1049, p. 5
`
`

`

`where the codeword fragment x1 = x1(u) is a deterministic function of u. Then Lemma 2.2
`can be summarized as follows:
`
`(2.9)
`
`BEL = α λπ (λπ ◦ p) ,
`
`where λ(u) = (λ1(u1), . . . , λk(uk)) and π(u) = (π1(u1), . . . , πk(uk)).
`
`3. Systematic Parallel Concatenated (Turbo) Codes.
`
`In this section, we will define what we mean by a turbo code, and present a general version
`of the turbo decoding algorithm.
`
`With the same setup as in Section 2, suppose we have two systematic encodings of U :
`
`C1 : U → (U , X1)
`C2 : U → (U , X2).
`One way to combine C1 and C2 into a single code is via the mapping
`C : U → X = (U , X1, X2),
`which is called the parallel concatenation of C1 and C2, or the turbo code formed by com-
`bining C1 and C2.
`with transition probabilities p(y | x). It is received as Y = (Ys, Y1, Y2), where Ys is the
`component of Y corresponding to U , Y1 is the component of Y corresponding to X1, and
`Y2 is the component of Y corresponding to X2. We assume again that the channel is
`memoryless, which implies that the conditional density factors according to the rule
`
`Once again, we assume that the codeword X is transmitted through a noisy channel
`
`(3.1)
`
`(3.2)
`
`p(y | x) =p(y s, y1, y2 | u, x1, x2)
`(cid:2)
`(cid:4)
`k(cid:3)
`= p(ys | u)p(y1 | x1)p(y2 | x2)
`p(ysi | ui)
`· p(y1 | x1)p(y2 | x2),
`
`=
`
`i=1
`
`The situation is as depicted in Figure 2.
`
`By Lemma 2.2, the optimal decisions for the turbo code are based on the beliefs
`
`(cid:6)
`
`(3.3)
`
`= α λi(a)πi(a)
`
`p(y1 | x1)p(y2 | x2)
`
`BELi(a) =α
`
`u :ui=a
`
`p(y1 | x1)p(y2 | x2)
`(cid:6)
`
`u :ui=a
`
`5
`
`k(cid:3)
`
`j=1
`
`λj(uj)πj(uj)
`
`k(cid:3)
`
`λj(uj)πj(uj).
`
`j=1
`j(cid:2)=i
`
`Hughes, Exh. 1049, p. 6
`
`

`

`Ys = (Ys1, . . . , Ysk)
`(cid:0)(cid:0)
`(cid:0)(cid:0)
`Y1
`(cid:0)(cid:0)
`(cid:0)(cid:0)
`(cid:0)(cid:0)
`Y2
`(cid:0)(cid:0)
`
`channel
`
`X1
`
`X2
`
`E1
`
`E2
`
`U = (U1, . . . , Uk)
`
`Figure 2. A generic “turbo code:” The codeword
`X = (U , X1, X2) is transmitted over a memoryless
`channel and received as Y = (Ys, Y1, Y1).
`
`For simplicity, and in accordance with engineering practice, from now on we will
`assume that the a priori probability density of the Ui’s is uniform, i.e., π = (α 1, . . . , α 1).
`With this assumption, using the notation introduced in Section 2, (3.3) becomes3
`BEL = α λ (λ ◦ p1p2) ,
`
`(3.4)
`
`where the kernels p1 and p2 are defined by
`p1(u) =p(y 1 | x1)
`p2(u) =p(y 2 | x2).
`
`(3.5)
`
`The celebrated “turbo decoding algorithm” [10][50][3] is an iterative approximation to
`the optimal beliefs in (3.3) or (3.4), whose performance, while demonstrably suboptimal
`[41], has nevertheless proved to be “nearly optimal” in an impressive array of experiments.
`The heart of the turbo algorithm is an iteratively defined sequence π(m) of product prob-
`ability densities on Ak defined by
`
`π(0) = (α 1, . . . , α 1),
`(3.6)
`(cid:7)
`i.e., π(0) is a list of k uniform densities on A, and for m ≥ 1,
`α λπ(m−1) ◦ p1
`α λπ(m−1) ◦ p2
`
`if m is odd
`if m is even.
`
`(3.7)
`
`π(m) =
`
`Then the mth turbo belief vector is defined by
`BEL(m) = α λπ(m)π(m−1).
`
`(3.8)
`
`3 As we observed earlier, the effect of λ is to change the prior distribution from π to
`λπ. It follows that if there is a nonuniform prior π, it can be accounted for by replacing
`every occurrence of “λ” in our formulas with λπ.
`
`6
`
`Hughes, Exh. 1049, p. 7
`
`

`

`The general form of (3.7) is shown in Figure 3.
`
`In a “practical” decoder, the decision about the information bits is usually made after
`a fixed number of iterations. (The hope that the limit of (3.8) will exist is in general a
`vain one, since in [41] several examples of non-convergence are given.) If the decision is
`made after m iterations, the mth turbo decision is defined as
`BEL(m)
`i
`
`(a).
`
`(cid:8)U
`
`(m)
`i = arg max
`a∈A
`
`(3.9)
`
`Y1
`Ys
`
`D1
`
`π(2), π(4), π(6), . . .
`
`π(1), π(3), π(5), . . .
`
`D2
`
`Y2
`Ys
`
`Figure 3. Block diagram of turbo-decoding procedure.
`
`We conclude this section by observing that as we have stated it, the turbo algorithm
`((3.7) and (3.9)) does not appear to be significantly simpler than the optimal algorithm
`
`(3.4), since (for example) (λ◦p1) is not, in general, much easier to compute than (λ◦p1p2).
`
`The following theorem, and the discussion that follows, sheds light on this problem.
`
`i
`
`(3.10)
`
`π(m)
`i
`
`(a) =α
`
`= α
`
`i
`
`i
`
`(a)
`
`(a)
`
`if m is odd
`
`if m is even.
`
`3.1 Theorem. If the components of U are assumed to be independent, with Pr{Ui =
`ui} = π(m−1)
`(ui), for i = 1, . . . , k, then
`Pr{Ui = a | Ys, Y1}
`λi(a)π(m−1)
`Pr{Ui = a | Ys, Y2}
`λi(a)π(m−1)
`(cid:6)
`
`Proof: We consider the case m odd, the proof for even m being essentially the same. By
`reasoning similar to that in Lemma 2.2, we find that
`
`(3.11)
`
`Pr{Ui = a | Ys, Y1} =
`
`p(y1 | u)
`
`λj(uj)π(m−1)
`
`j
`
`(uj).
`
`k(cid:3)
`
`u :ui=a
`
`j=1
`
`7
`
`Hughes, Exh. 1049, p. 8
`
`

`

`If we divide both sides of (3.11) by λi(a)π(m−1)
`(cid:6)
`Pr{Ui = a | Ys, Y1}
`λi(a)π(m−1)
`
`(3.12)
`
`i
`
`(a)
`
`p(y1 | u)
`
`=
`
`u :ui=a
`
`j=1
`j(cid:2)=i
`
`i
`
`(a), we obtain
`
`k(cid:3)
`
`λj(uj)π(m−1)
`
`j
`
`(uj)
`
`= λπ(m−1) ◦ p1.
`Since by (3.7), π(m) = αλπ(m−1) ◦ p1, the Theorem follows.
`the vectors π(m) can be computed by a decoder for C1 (or C2) which is capable of computing
`the probabilities Pr{Ui = a | Ys, Y1}, based on an observation of the noisy codeword
`Y = (Ys, Y1), i.e., an optimal “soft” symbol decision decoder. The ith component of the
`message passed to the second decoder module is then
`
`The significance of Theorem 3.1 is that it tells us that the appropriate components of
`
`(3.13)
`
`π(m)
`i
`
`(a) =
`
`Pr{Ui = a | Ys, Y1}
`λi(a)π(m−1)
`
`(a)
`
`i
`
`,
`
`which is the “extrinsic information” referred to earlier.
`
`One of the keys to the success of turbo codes is to use component codes C1 and C2
`
`for which a low-complexity soft bit decision algorithm exists. For example, the BCJR, or
`“APP” decoding algorithm [4] provides such an algorithm for any code, block or convolu-
`tional, that can be represented by a trellis.4
`achieve high performance, which means that individually, the codes C1 and C2 must be
`type shown in Figure 2, in which the individual codes C1 and C2 are indeed relatively weak
`
`relatively weak. The brilliant innovation of Berrou et al. [10] was to devise a code of the
`
`As far as is known, a code with a low-complexity optimal decoding algorithm cannot
`
`(but have a low-complexity decoding algorithm), in such a way that that the overall code
`is very powerful. Roughly speaking, they accomplished this by making the encoder E2
`identical to E1, except for a random permutation (accomplished by the “interleaver”) of
`the inputs. (The encoders were short-constraint-length systematic convolutional encoders
`with feedback.) However, since it is the object of this paper to study the decoding algorithm
`without regard to the resulting performance, we shall not discuss the constructive aspect
`of turbo-codes further.
`
`4 As we shall see in Section 4, the BCJR algorithm itself, and the many variations of
`it, are themselves special cases of Pearl’s algorithm. In this application, the algorithm is
`provably exact, since the corresponding “belief” diagram has no loops.
`
`8
`
`Hughes, Exh. 1049, p. 9
`
`

`

`4. Background on Probabilistic Inference, Bayesian Belief Networks, and
`Pearl’s Algorithm
`
`In this section, we will give a tutorial overview of the so-called probabilistic inference
`problem of the artificial intelligence community, as well as a brief discussion of Pearl’s
`algorithm, which solves the probabilistic inference problem in many important special
`cases.
`
`Thus let X = {X1, X2, . . . , XN}5 be a set of N discrete random variables, where Xi
`assumes values in the finite alphabet Ai. The joint density function
`p(x) =p(x 1, x2, . . . , xN )def= Pr{X1 = x1, . . . , XN = xN}
`is then a mapping from A1 ×···× AN into the set of real numbers R. We assume that the
`marginal densities p(xi)def= Pr{Xi = xi} are also known. The marginal density function
`p(xi) represents our a priori “belief” about the random variable Xi. Now suppose one or
`more of these random variables is measured, or “observed.” This means that there is a
`subset J ⊆ {1, 2, . . . , N}, (the evidence set) such that for all j ∈ J, the random variable
`Xj is known to have a particular value, say aj. The evidence is then defined to be the
`event
`E = {Xj = aj : j ∈ J}.
`
`The fundamental probabilistic inference problem is to compute the updated beliefs, i.e., the
`
`a posteriori or conditional probabilities p(Xi|E), for all i /∈ J.
`The brute force approach to computing p(Xi|E) is to sum over all of the terms of
`(cid:6)
`p(x) which do not involve either i or J. To simplify notation, we assume i = 1, and
`J = {m + 1, . . . , N}. Then we have
`p(X1 = a|E) =α
`
`(4.1)
`
`p(a, x2, . . . , xm, am+1, . . . , aN ).
`
`x2,...,xm
`
`If Xi can assume qi different values, then computing the sum in (4.1) for each possible
`value of a requires q1q2 ··· qm additions, which is impractical unless m and the qi’s are very
`
`small numbers.
`
`The idea behind the “Bayesian belief network” approach [28][51] to this inference
`problem is to exploit any “partial independencies” which may exist among the Xi’s to
`simplify belief updating. The simplest case of this is when the random variables X1, . . . , XN
`are mutually independent, in which case the work in (4.1) can be avoided altogether, since
`an observation of one such variable cannot affect our belief in another. More generally, the
`partial independencies can be described by a directed acyclic graph, or DAG.
`
`5 We have already used upper-case X’s to denote codeword components, for example,
`(2.1). We use upper-case X’s here to denote arbitrary random variables, and hope no
`confusion will occur.
`
`9
`
`Hughes, Exh. 1049, p. 10
`
`

`

`An DAG is a finite, directed graph, in which there are no directed cycles. For example,
`Figure 4 shows a DAG with five vertices and five edges. Let us agree that if there is a
`directed edge a → b, then a will be called a “parent” of b, and b will be called a “child” of
`a. If the set of parents of a vertex v is denoted by pa(v), then we can describe the graph
`of Figure 4 as follows:
`
`(4.2)
`
`pa(v1) =∅
`pa(v2) =∅
`pa(v3) ={v 1}
`pa(v4) ={v 1, v2}
`pa(v5) ={v 3, v4}.
`
`V1
`
`V3
`
`V2
`
`V4
`
`V5
`
`Figure 4. Simple example of a DAG which represents a
`five-variable directed Markov field (see (4.4)). This DAG is
`“loopy,” with the vertices v1, v3, v4, and v5 forming a loop.
`
`If G is a DAG, and if X is a set of random variables in one-to-one correspondence
`with the vertices of G, the joint density function p(x) is said to factor according to G if
`
`N(cid:3)
`
`(4.3)
`
`p(x1, . . . , xN ) =
`
`i=1
`
`p(xi| pa(xi))
`
`where pa(xi) denotes a value assignment for the parents of Xi. For example, a five-variable
`density function p(x1, . . . , x5) factors according to the graph of Figure 4 if
`p(x1, x2, x3, x4, x5) =p(x 1)p(x2)p(x3|x1)p(x4|x1, x2)p(x5|x3, x4).
`
`(4.4)
`
`10
`
`Hughes, Exh. 1049, p. 11
`
`

`

`A set of random variables X whose density functions factor according to a given DAG is
`called a directed Markov field [35][32][65]. For example, if G is a directed chain, then X is
`an ordinary Markov chain. A DAG, together with the associated random variables X, is
`called a Bayesian belief network, or Bayesian network for short [28].
`
`At this point we observe that the general coding framework of Figure 1 can be rep-
`resented as the Bayesian network shown in Figure 5. From the decoder’s viewpoint, the
`observed noisy information bits Ysi are probabilistic functions of the hidden information
`bits Ui. Similarly, the observed noisy codeword fragment Y1 is a probabilistic function of
`the codeword X1, which in turn is a deterministic function of the hidden input bits. (Fig-
`ure 5 implies that the information bits Ui are independent.) The decoder’s problem is thus
`to infer the values of the hidden variables Ui based on the evidence variables (Ys1, . . . , Ysk)
`and Y1.
`
`Ys1
`
`Ys2
`
`Ys(k-1)
`
`Ysk
`
`Noisy information bits
`(visible)
`
`U1
`
`U2
`
`Uk-1
`
`Uk
`
`X1
`
`Y1
`
`Information bits
`(hidden)
`
`Codeword Fragment
`(hidden)
`
`Noisy codeword
`Fragment
`(visible)
`
`Figure 5. Bayesian network
`interpretation of the decoding problem.
`
`Bayesian networks can sometimes lead to considerable simplifications of the proba-
`bilistic inference problem. The most important of these simplifications, for our purposes,
`is Pearl’s belief propagation algorithm. In the 1980’s Kim and Pearl [31][42][43][44] showed
`that if the DAG is a “tree,” i.e., if there are no loops,6 then there are efficient distributed
`algorithms for solving the inference problem. If all the alphabets Ai have the same size q,
`Pearl’s algorithm solves the inference problem on trees with O(N qe) computations, where
`e is the maximum number of parents of any vertex, rather than O(qm), where m is the
`number of unknown random variables, which is required by the brute force method. The
`efficiency of belief propagation on trees stands in sharp contrast to the situation for general
`DAG’s, since in 1990 Cooper [16] showed that the inference problem in general DAG’s is
`
`6 A “loop” is a cycle in the underlying undirected graph. For example, in the DAG of
`Figure 4, v1 → v4 → v5 → v3 → v1 is a loop.
`
`11
`
`Hughes, Exh. 1049, p. 12
`
`

`

`NP-hard. (See also [17] and [53] for more on the NP-hardness of probabilistic inference in
`Bayesian networks.)
`
`Since the network in Figure 5 is a tree, Pearl’s algorithm will apply. However, the
`result is uninteresting: Pearl’s algorithm applied to this Bayesian network merely gives an
`alternative derivation of Lemma 2.2.
`
`A more profitable application of Pearl’s algorithm is to the classic “hidden Markov
`chain” inference problem, where the appropriate Bayesian network is shown in Figure 6.
`Here the result is a linear-time exact solution which is functionally identical to the cele-
`brated “forward-backward algorithm” discovered in 1960’s and 1970’s.7
`
`X1
`
`X2
`
`X3
`
`X4
`
`X5
`
`Y1
`
`Y2
`
`Y3
`
`Y4
`
`Y5
`
`Figure 6. Bayesian network for the “hidden Markov
`chain” problem. Here X1, . . . , XN form a Markov chain,
`and Y1, . . . , YN are noisy versions of X1, . . . , XN . The
`problem is compute the conditional probabilities of the
`hidden variables Xi based in the “evidence” variables Yi.
`
`For us, the important feature of Pearl’s BP algorithm is that it can be defined for
`an arbitrary DAG which is not necessarily a tree, even though there is no guarantee that
`the algorithm will perform a useful calculation if there are loops in the DAG. We believe
`that the key to the success of turbo-codes, and a potentially important research area
`for the AI community, is the experimentally observed fact that Pearl’s algorithm works
`
`7 The forward-backward algorithm has a long and convoluted history that merits the
`attention of science historian. It seems to have first appeared in the unclassified literature
`in two independent 1966 publications [6][11]. Soon afterwards, it appeared in papers MAP
`detection of digital sequences in the presence of intersymbol interference [23]. It appeared
`explicitly as an algorithm for tracking the states of a Markov chain in the early 1970’s
`[40][4] (see also the survey articles [47] and [49]). A similar algorithm (in “min-sum”
`form appeared in a 1971 paper on equalization [62]. The algorithm was connected to the
`optimization literature in 1987 [63]. All this activity appears to have been completely
`independent of the developments in AI that led to Pearl’s algorithm!
`
`12
`
`Hughes, Exh. 1049, p. 13
`
`

`

`“approximately” for some loopy, i.e., non-tree DAG’s.8 We shall explain the connection
`between turbo codes and BP in Section 6, after first describing the BP algorithm in detail
`in Section 5. For now, as a preview of coming attractions, we present Figure 7, which is a
`loopy Bayesian network appropriate for the turbo decoding problem.9
`
`Ys1
`
`Ys2
`
`Ys(k-1)
`
`Ysk
`
`U1
`
`U2
`
`Uk-1
`
`X1
`
`Y1
`
`Uk
`
`X2
`
`Y2
`
`Noisy information bits
`(visible)
`
`Information bits
`(hidden)
`
`Codeword Fragments
`(hidden)
`
`Noisy codeword
`Fragments
`(visible)
`
`Figure 7. Bayesian network interpretation of the
`turbo-decoding problem. Note the presence of
`
`many loops, i.e., U1 → X2 → U2 → X1 → U1.
`
`8 There is an “exact” inference algorithm for an arbitrary DAG, developed by Lauritzen
`and Spiegelhalter [34], which solves the inference problem with O(NcqJ ) computations,
`where Nc is the number of cliques in the undirected triangulated “moralized” graph Gm
`which can be derived from G, and J is the maximum number of vertices in any clique in
`Gm. However, this proves not to be helpful in the decoding problem, since the appropriate
`DAG produces moralized graphs with huge cliques. For example, the turbo codes in [10]
`have an associated Gm with a clique of size 16,384.
`9 Our Figure 7 should be compared to Figure 2.5 in Wiberg [67], which describes the
`“Tanner graph” of a turbo code. The figures are similar, but there is a key difference.
`Wiberg incorporates the turbo code’s interleaver, citing it (the interleaver) as necessary
`for ensuring that there are no short cycles in the graph. In our Figure 7, on the other
`hand, there are many short cycles. It is our belief the presence of short cycles does not,
`at least in many cases, compromise the performance of the decoding algorithm, though it
`may degrade the quality of the code. We will expand on these remarks at the conclusion
`of the paper.
`
`13
`
`Hughes, Exh. 1049, p. 14
`
`

`

`5. Detailed Description of Pearl’s algorithm.
`
`In this section we will give a detailed functional description of Pearl’s algorithm as described
`in Chapter 4 of [44].
`
`Pearl’s belief propagation algorithm is a decentralized “message passing” algorithm,
`in which there is a processor associated with each vertex of G. Each processor can com-
`municate only with its parents and children. Furthermore, the processor associated with
`a variable X is assumed to “know” the conditional density function p(x | u)def= Pr{X =
`x| U1 = u1, . . . , UM = uM}, where U1, . . . , UM are the parents of X. (If X has no parents,
`this knowledge is assumed to be the marginal density function p(x)def= Pr{X = x}.) Thus
`the “local environment” of a node X is as shown in Figure 8a.
`
`(u)
`
`X, U
`
`(x)
`
`X,Y
`
`λ π
`
`X
`
`U
`
`Y
`
`Parents of X
`
`U1 U2
`
`UM
`
`X
`
`“knows”
`p(x|u)
`

`
`(u)
`
`U, X
`

`Y, X
`
`(x)
`
`X
`
`U
`
`Y
`
`Y1
`
`YN
`Y2
`Children of X
`
`(a) The local environment of X
`
`(b) Incoming messages to X,
`in vector notation.
`(Before activation of X.)
`
`(c) Outgoing messages from X,
`in vector notation.
`(After activation of X.)
`
`Figure 8. A summary of Pearl’s algorithm.
`(Boldface symbols denote random vectors;
`ordinary symbols represent random variables.)
`
`When a processor is activated, it “reads” the messages received from each of its parents
`and children, updates its belief based on these messages, and then sends new messages back
`to its parents and children.
`
`The message a node X receives from its parent Ui, denoted πUi,X (ui), is in the form
`of a list of probabilities (“π” for “probability”), one for each value ui ∈ AUi. Informally,
`πUi,X (ui) is the probability of the event Ui = ui, conditioned on the evidence in the
`tree already “known” to Ui. Similarly, the message X receives from its child Yj, denoted
`λYj ,X (x), is in the form of a list of nonnegative real numbers (likelihoods: “λ” for “like-
`lihood”) one for each value of x ∈ AX .
`Informally, λYj ,X (x) is the probability of the
`
`14
`
`Hughes, Exh. 1049, p. 15
`
`

`

`evidence Yj “knows,” conditioned on the event X = x. For simplicity, we adopt a vector
`notation for these incoming messages:
`
`(5.1)
`
`πU ,X (u)def= (πU1,X (u1), . . . , πUM ,X (uM ))
`λY ,X (x)def= (λY1,X (x), . . . , λYN ,X (x)).
`
`The situation is summarized in Figure 8b.
`
`After X has been activated, the message X passes to its child Yj, denoted πX,Yj (x), is a
`list of probabilities, one for each value of x. Roughly speaking, πX,Yj (x) is the probability of
`the event X = x, given the evidence in the tree already “known” to X, which now includes
`any new evidence which may have been contained in the incoming messages. Similarly, the
`message X passes to its parent Ui, denoted λX,Ui(ui), is the probability of the evidence it
`now knows about, given the event Ui = ui. Again we adopt a vector notation:
`
`(5.2)
`
`λX,U (u)def= (λX,U1(u1), . . . , λX,UM (uM ))
`πX,Y (x)def= (πX,Y1(x), . . . , πX,YN (x)).
`
`This situation is summarized in Figure 8c.
`
`Additionally, each node of the graph keeps track of a number of other quantities:
`→ R
`
`× ··· × AUM
`μX (u) :A U1
`λX (x) :A X → R
`πX (x) :A X → R
`× ··· × AUM
`γX (u) :A U1
`BELX (x) :A X → R.
`
`→ R
`
`The quantities μX (u), λX (x), πX (x), and γX (u) have no particular intrinsic significance,
`but the quantity BELX (x) is the heart of the algorithm, since when the algorithm termi-
`nates, BELX (x) gives the value of the desired conditional probability Pr{X = x | E}.
`
`Here then is a complete description of Pearl’s algorithm: When the node X is acti-
`vated, it “reads” its incoming messages πU ,X (u) and λY ,X (x), and updates μX (u), λX (x),
`πX (x), γX (u), BELX (x), λX,U (u) and πX,Y (x), in that order, using the update rules in
`Table 5.1, and the initial values given in Table 5.2 (In Table 5.1 we use the n

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket