`Draft of August 13, 1997 11:06 a.m.
`
`Turbo Decoding as an Instance of
`Pearl’s “Belief Propagation” Algorithm
`
`Robert J. McEliece
`California Institute of Technology
`
`David J. C. MacKay
`Cambridge University
`
`Jung-Fu Cheng
`California Institute of Technology
`
`Abstract.
`
`In this paper we will describe the close connection between the now celebrated iterative
`turbo decoding algorithm of Berrou, Glavieux, and Thitimajshima, and an algorithm that
`has been well-known in the artificial intelligence community for a decade, but which is
`relatively unknown to information theorists: Pearl’s belief propagation algorithm. We shall
`see that if Pearl’s algorithm is applied to the “belief network” of a parallel concatenation
`of two or more codes, the turbo decoding algorithm immediately results. Unfortunately,
`however, this belief diagram has loops, and Pearl only proved that his algorithm works
`when there are no loops, so an explanation of the excellent experimental performance of
`turbo decoding is still lacking.
`
`However, we shall also show that Pearl’s algorithm can be used to routinely derive
`previously known iterative, but suboptimal, decoding algorithms for a number of other
`error-control systems, including Gallager’s low-density parity-check codes, serially con-
`catenated codes, and product codes. Thus belief propagation provides a very attractive
`general methodology for devising low-complexity iterative decoding algorithms for hybrid
`coded systems.
`
`* This work was supported by NSF grant no. NCR-9505975, AFOSR grant no. 5F49620-
`97-1-0313, and by a grant from Qualcomm, Inc. A portion of McEliece’s contribution was
`done while he was visiting the Sony corporation in Tokyo. The collaboration between
`MacKay and McEliece was begun at, and partially supported by, the Newton Institute for
`Mathematical Sciences, Cambridge, England.
`
`Hughes, Exh. 1033, p. 1
`
`
`
`1. Introduction and Summary.
`
`“Turbo codes,” which were introduced in 1993 by Berrou, Glavieux, and Thitimajshima
`[10], are the most exciting and potentially important development in coding theory in
`many years. Many of the structural properties of turbo codes have now been put on a firm
`theoretical footing ([7][18][20][21][27][45]), and several innovative variations on the turbo
`theme have appeared ([5][8][9][12][27][48]).
`
`What is still lacking, however, is a satisfactory theoretical explanation of why the
`turbo decoding algorithm performs as well as it does. While we cannot yet announce a
`solution to this problem, we believe the answer may come from a close study of Pearl’s
`belief propagation algorithm, which is largely unknown to information theorists, but well-
`known in the artificial intelligence community. (The first mention of belief propagation in
`a communications paper, and indeed the paper that motivated this one, is that of MacKay
`and Neal [37]. See also [38] and [39].)
`
`In this paper, we will review the turbo decoding algorithm as originally expounded
`by Berrou et al.
`[10], but which was perhaps explained more lucidly in [3], [18], or [50].
`We will then describe Pearl’s algorithm, first in its natural “AI” setting, and then show
`that if it is applied to the “belief network” of a turbo code, the turbo decoding algorithm
`immediately results. Unfortunately, however, this belief network has loops, and Pearl’s
`algorithm only gives exact answers when there are no loops, so the existing body of knowl-
`edge about Pearl’s algorithm does not solve the central problem of turbo decoding. Still, it
`is interesting and suggestive that Pearl’s algorithm yields the turbo decoding algorithm so
`easily. Furthermore, we shall show that Pearl’s algorithm can also be used to derive effec-
`tive iterative decoding algorithms for a number of other error-control systems, including
`Gallager’s low-density parity-check codes, the recently introduced low-density generator
`matrix codes, serially concatenated codes, and product codes. Some of these “BP” de-
`coding algorithms agree with the ones previously derived by ad hoc methods, and some
`are new, but all prove to be remarkably effective. In short, belief propagation provides
`an attractive general method for devising low-complexity iterative decoding algorithms for
`hybrid coded systems. This is the message of the paper. (A similar message is given in
`the paper by Kschischang and Frey [33] in this issue.)
`
`Here is an outline of the paper. In Section 2 we derive some simple but important
`results about, and introduce some compact notation for, “optimal symbol decision” de-
`coding algorithms. In Section 3 we define what we mean by a turbo code, and review the
`turbo decoding algorithm. Our definitions are deliberately more general than what has
`previously appeared in the literature. In particular, our transmitted information is not bi-
`nary, but rather comes from a q-ary alphabet, which means that we must deal with q-ary
`probability distributions instead of the traditional “log-likelihood ratios.” Furthermore,
`the reader may be surprised to find no discussion of “interleavers,” which are an essential
`component of all turbo-coding systems. This is because, as we will articulate fully in our
`concluding remarks, we believe the interleaver’s contribution is to make the turbo code
`a “good” code, but it has nothing directly to do with the fact that the turbo decoding
`algorithm is a good approximation to an optimal decoder. In Section 4, we change gears
`and give a tutorial overview of the general probabilistic inference problem, with special
`
`1
`
`Hughes, Exh. 1033, p. 2
`
`
`
`reference to Bayesian belief networks. In Section 5 we describe Pearl’s BP algorithm, which
`can be defined on any belief network, and which gives an exact solution to the probabilis-
`tic inference problem when the belief network has no loops. In Section 6, we show that
`the turbo decoding algorithm follows from a routine application of Pearl’s algorithm to
`the appropriate (loopy) belief network. In Section 7 we briefly sketch some other decoding
`algorithms that can be derived from BP considerations. Finally in Section 8 we summarize
`our findings and venture some conclusions.
`
`2. Preliminaries.
`
`In this section, we will describe a general class of q-ary systematic encoders, and derive
`the optimal symbol-by-symbol decoding rule for a memoryless channel.
`
`Let U = (U1, . . . , Uk) be ak -dimensional random vector of independent, but not
`necessarily equiprobable, symbols from a q-letter alphabet A, with Pr{Ui = a} = πi(a), for
`a ∈ A. The vector U represents information to be transmitted reliably over an unreliable
`channel. We suppose that U is encoded systematically, i.e., mapped into a codeword X of
`the form
`
`(2.1)
`
`X = (U , X1)
`where U is the “systematic” part, and X1 is the “nonsystematic” part, of the codeword
`X. In the rest of the paper, we will sometimes call X1 a codeword fragment.
`
`We assume that the codeword X is transmitted over a noisy channel with transition
`
`probabilities p(y | x)def= Pr{Y = y | X = x}, and received as Y = (Ys, Y1), where Ys is
`the portion of Y corresponding to the systematic part of the codeword U , and Y1 is the
`portion corresponding to the codeword fragment X1. We assume further that the channel
`is memoryless, which implies that the conditional density factors according to the rule
`p(y | x) =p(y s, y1 | u, x1)
`(cid:2)
`(cid:4)
`k(cid:3)
`= p(ys | u)p(y1 | x1)
`p(ysi | ui)
`· p(y1 | x1),
`
`(2.2)
`
`(2.3)
`
`=
`
`i=1
`where ysi denotes the ith component of ys. The situation is depicted in Figure 1.
`
`channel
`
`(cid:0)(cid:0)
`Ys = (Ys1, . . . , Ysk)
`(cid:0)(cid:0)
`Y1
`(cid:0)(cid:0)
`(cid:0)(cid:0)
`
`X1
`
`E
`
`U = (U1, . . . , Uk)
`
`Figure 1. The codeword X = (U , X1) is transmitted
`over a memoryless channel and received as Y = (Ys, Y1).
`
`2
`
`Hughes, Exh. 1033, p. 3
`
`
`
`The decoding problem is to “infer” the values of the hidden variables Ui based on the
`“evidence,” viz., the observed values ys and y1 of the variables Ys and Y1. The optimal
`decision, i.e., the one that minimizes the probability of inferring an incorrect value for Ui,
`is the one based on the conditional probability, or “belief,” that the information symbol
`in question has a given value:
`
`(2.4)
`
`BELi(a)def= Pr{Ui = a | Ys = ys, Y1 = y1}.
`
`(A communication theorist would use the term “a posteriori probability,” rather than
`
`“belief.”) If a0 is such that BELi(a0) > BELi(a), for all a (cid:3)= a0, the decoder infers that
`Ui = a0. The following straightforward computation is central to our results.
`computation, and for the rest of the paper, we will use Pearl’s α notation [44].
`
`In this
`
`2.1 Definition. If x = (x1, . . . , xm) and y = (y1, . . . , ym) are vectors of nonnegative real
`numbers, the notation
`
`x = α y
`
`m k
`
`=1 yk), for i = 1, . . . , m. In other words, x is a probability vector
`means that xi = yi/(
`whose components are proportional to those of y. (If f (x) and g(x) are nonnegative real-
`valued functions defined on a finite set, the notation f (x) =α g (x) is defined similarly.)
`
`(cid:5)
`
`defined in (2.4) is given by
`
`BELi(a) =α
`
`u :ui=a
`
`p(y1 | x1)
`(cid:6)
`
`λj(uj)πj(uj)
`
`k(cid:3)
`
`2.2 Lemma. If the likelihood p(ysi | ui)1 is denoted by λi(ui), then the belief BELi(a)
`(cid:6)
`k(cid:3)
`
`j=1
`
`p(y1 | x1)
`
`(2.5)
`
`= α λi(a)πi(a)
`
`λj(uj)πj(uj).
`
`u :ui=a
`
`j=1
`j(cid:2)=i
`
`Proof: We have, by the definition (2.4), BELi(a) = Pr{Ui = a | Y = y}. Then
`Pr{Y = y, Ui = a}
`Pr{Ui = a | Y = y} =
`Pr{Y = y}
`(cid:6)
`= α Pr{Y = y, Ui = a}
`(cid:6)
`
`= α
`
`u :ui=a
`
`= α
`
`u :ui=a
`
`p(y, u)
`p(y | u) · p(u)
`
`(using the α-notation)
`
`1 If the encoder is not systematic, i.e., if the uncoded information symbols Ui are not
`transmitted, these likelihoods should all be set equal to 1.
`
`3
`
`Hughes, Exh. 1033, p. 4
`
`
`
`(cid:6)
`(cid:6)
`
`u :ui=a
`
`u :ui=a
`
`= α
`
`= α
`
`p(y1 | x1)p(ys | u) · k(cid:3)
`p(y1 | x1) · k(cid:3)
`k(cid:3)
`(cid:6)
`
`πj(uj)
`
`by (2.2)
`
`j=1
`
`λj(uj)πj(uj)
`
`by (2.3)
`
`j=1
`
`p(y1 | x1)
`
`= α λi(a)πi(a)
`
`λj(uj)πj(uj).
`
`u :ui=a
`
`j=1
`j(cid:2)=i
`
`The last two lines of the above calculation are the assertions of the Lemma.
`
`We see from (2.5) that BELi(a) is the product of three terms. The first term, λi(a),
`might be called the systematic evidence term. The second term, πi(a), takes into account
`the a priori distribution of Ui. Note that the effect of the systematic evidence is, in effect,
`to change the prior distribution of Ui from πi(a) toαπ i(a)λi(a). The third term, which is
`more complicated, takes into account the geometry of the code. Following [10], we will call
`this term the extrinsic term, and denote it by Ei(a). The extrinsic term is so important
`to what follows that we shall introduce a special notation for it. (This notation will also
`prove useful in Section 5, where we shall use it to describe Pearl’s algorithm—see Table
`5.1, line 6.)
`
`Thus let A1, . . . , Ak be finite alphabets, let U ⊆ A1 × ··· × Ak, and let R denote
`the set of real numbers. Let g = (g1, . . . , gk) be a function mapping U into Rk. In other
`words, g is a vector of k real valued functions, and if u = (u1, . . . , uk) ∈ U , then
`g(u) = (g1(u1), . . . , gk(uk)).
`
`(cid:2) i
`
`(cid:2) k
`
`k(cid:3)
`
`(cid:2) 1
`
`(cid:6)
`
`Now suppose that K(u) is a real-valued function defined on the set U , which we call a
`(cid:2)
`= (g
`, . . . , g
`), where g
`is defined by
`kernel. The K-transform of g is the vector g
`
`(a) =
`
`(cid:2) i
`
`g
`
`(2.6)
`
`K(u)
`
`gj(uj).
`
`u :ui=a
`
`j=1
`j(cid:2)=i
`
`We summarize (2.6) by writing
`
`(2.7)
`
`(cid:2)
`
`g
`
`= g ◦ K.
`
`Next, if f and g are vector-valued functions as above, we define their adjacent product
`h = f g as a simple componentwise product, i.e., h = (h1, . . . , hk), where
`
`(2.8)
`
`hi(a) =f i(a)gi(a).
`
`Using the circle and adjacent notation,2 we can express the result of Lemma 2.2
`compactly. To do so, we take U = Ak, and define a kernel p(u) as
`p(u)def= p(y1 | x1),
`
`2 We assume that “adjacent” takes precedence over “circle,” in order to minimize the
`use of parentheses.
`
`4
`
`Hughes, Exh. 1033, p. 5
`
`
`
`where the codeword fragment x1 = x1(u) is a deterministic function of u. Then Lemma 2.2
`can be summarized as follows:
`
`(2.9)
`
`BEL = α λπ (λπ ◦ p) ,
`
`where λ(u) = (λ1(u1), . . . , λk(uk)) and π(u) = (π1(u1), . . . , πk(uk)).
`
`3. Systematic Parallel Concatenated (Turbo) Codes.
`
`In this section, we will define what we mean by a turbo code, and present a general version
`of the turbo decoding algorithm.
`
`With the same setup as in Section 2, suppose we have two systematic encodings of U :
`
`C1 : U → (U , X1)
`C2 : U → (U , X2).
`One way to combine C1 and C2 into a single code is via the mapping
`C : U → X = (U , X1, X2),
`which is called the parallel concatenation of C1 and C2, or the turbo code formed by com-
`bining C1 and C2.
`with transition probabilities p(y | x). It is received as Y = (Ys, Y1, Y2), where Ys is the
`component of Y corresponding to U , Y1 is the component of Y corresponding to X1, and
`Y2 is the component of Y corresponding to X2. We assume again that the channel is
`memoryless, which implies that the conditional density factors according to the rule
`
`Once again, we assume that the codeword X is transmitted through a noisy channel
`
`(3.1)
`
`(3.2)
`
`p(y | x) =p(y s, y1, y2 | u, x1, x2)
`(cid:2)
`(cid:4)
`k(cid:3)
`= p(ys | u)p(y1 | x1)p(y2 | x2)
`p(ysi | ui)
`· p(y1 | x1)p(y2 | x2),
`
`=
`
`i=1
`
`The situation is as depicted in Figure 2.
`
`By Lemma 2.2, the optimal decisions for the turbo code are based on the beliefs
`
`(cid:6)
`
`(3.3)
`
`= α λi(a)πi(a)
`
`p(y1 | x1)p(y2 | x2)
`
`BELi(a) =α
`
`u :ui=a
`
`p(y1 | x1)p(y2 | x2)
`(cid:6)
`
`u :ui=a
`
`5
`
`k(cid:3)
`
`j=1
`
`λj(uj)πj(uj)
`
`k(cid:3)
`
`λj(uj)πj(uj).
`
`j=1
`j(cid:2)=i
`
`Hughes, Exh. 1033, p. 6
`
`
`
`Ys = (Ys1, . . . , Ysk)
`(cid:0)(cid:0)
`(cid:0)(cid:0)
`Y1
`(cid:0)(cid:0)
`(cid:0)(cid:0)
`(cid:0)(cid:0)
`Y2
`(cid:0)(cid:0)
`
`channel
`
`X1
`
`X2
`
`E1
`
`E2
`
`U = (U1, . . . , Uk)
`
`Figure 2. A generic “turbo code:” The codeword
`X = (U , X1, X2) is transmitted over a memoryless
`channel and received as Y = (Ys, Y1, Y1).
`
`For simplicity, and in accordance with engineering practice, from now on we will
`assume that the a priori probability density of the Ui’s is uniform, i.e., π = (α 1, . . . , α 1).
`With this assumption, using the notation introduced in Section 2, (3.3) becomes3
`BEL = α λ (λ ◦ p1p2) ,
`
`(3.4)
`
`where the kernels p1 and p2 are defined by
`p1(u) =p(y 1 | x1)
`p2(u) =p(y 2 | x2).
`
`(3.5)
`
`The celebrated “turbo decoding algorithm” [10][50][3] is an iterative approximation to
`the optimal beliefs in (3.3) or (3.4), whose performance, while demonstrably suboptimal
`[41], has nevertheless proved to be “nearly optimal” in an impressive array of experiments.
`The heart of the turbo algorithm is an iteratively defined sequence π(m) of product prob-
`ability densities on Ak defined by
`
`π(0) = (α 1, . . . , α 1),
`(3.6)
`(cid:7)
`i.e., π(0) is a list of k uniform densities on A, and for m ≥ 1,
`α λπ(m−1) ◦ p1
`α λπ(m−1) ◦ p2
`
`if m is odd
`if m is even.
`
`(3.7)
`
`π(m) =
`
`Then the mth turbo belief vector is defined by
`BEL(m) = α λπ(m)π(m−1).
`
`(3.8)
`
`3 As we observed earlier, the effect of λ is to change the prior distribution from π to
`λπ. It follows that if there is a nonuniform prior π, it can be accounted for by replacing
`every occurrence of “λ” in our formulas with λπ.
`
`6
`
`Hughes, Exh. 1033, p. 7
`
`
`
`The general form of (3.7) is shown in Figure 3.
`
`In a “practical” decoder, the decision about the information bits is usually made after
`a fixed number of iterations. (The hope that the limit of (3.8) will exist is in general a
`vain one, since in [41] several examples of non-convergence are given.) If the decision is
`made after m iterations, the mth turbo decision is defined as
`BEL(m)
`i
`
`(a).
`
`(cid:8)U
`
`(m)
`i = arg max
`a∈A
`
`(3.9)
`
`Y1
`Ys
`
`D1
`
`π(2), π(4), π(6), . . .
`
`π(1), π(3), π(5), . . .
`
`D2
`
`Y2
`Ys
`
`Figure 3. Block diagram of turbo-decoding procedure.
`
`We conclude this section by observing that as we have stated it, the turbo algorithm
`((3.7) and (3.9)) does not appear to be significantly simpler than the optimal algorithm
`
`(3.4), since (for example) (λ◦p1) is not, in general, much easier to compute than (λ◦p1p2).
`
`The following theorem, and the discussion that follows, sheds light on this problem.
`
`i
`
`(3.10)
`
`π(m)
`i
`
`(a) =α
`
`= α
`
`i
`
`i
`
`(a)
`
`(a)
`
`if m is odd
`
`if m is even.
`
`3.1 Theorem. If the components of U are assumed to be independent, with Pr{Ui =
`ui} = π(m−1)
`(ui), for i = 1, . . . , k, then
`Pr{Ui = a | Ys, Y1}
`λi(a)π(m−1)
`Pr{Ui = a | Ys, Y2}
`λi(a)π(m−1)
`(cid:6)
`
`Proof: We consider the case m odd, the proof for even m being essentially the same. By
`reasoning similar to that in Lemma 2.2, we find that
`
`(3.11)
`
`Pr{Ui = a | Ys, Y1} =
`
`p(y1 | u)
`
`λj(uj)π(m−1)
`
`j
`
`(uj).
`
`k(cid:3)
`
`u :ui=a
`
`j=1
`
`7
`
`Hughes, Exh. 1033, p. 8
`
`
`
`If we divide both sides of (3.11) by λi(a)π(m−1)
`(cid:6)
`Pr{Ui = a | Ys, Y1}
`λi(a)π(m−1)
`
`(3.12)
`
`i
`
`(a)
`
`p(y1 | u)
`
`=
`
`u :ui=a
`
`j=1
`j(cid:2)=i
`
`i
`
`(a), we obtain
`
`k(cid:3)
`
`λj(uj)π(m−1)
`
`j
`
`(uj)
`
`= λπ(m−1) ◦ p1.
`Since by (3.7), π(m) = αλπ(m−1) ◦ p1, the Theorem follows.
`the vectors π(m) can be computed by a decoder for C1 (or C2) which is capable of computing
`the probabilities Pr{Ui = a | Ys, Y1}, based on an observation of the noisy codeword
`Y = (Ys, Y1), i.e., an optimal “soft” symbol decision decoder. The ith component of the
`message passed to the second decoder module is then
`
`The significance of Theorem 3.1 is that it tells us that the appropriate components of
`
`(3.13)
`
`π(m)
`i
`
`(a) =
`
`Pr{Ui = a | Ys, Y1}
`λi(a)π(m−1)
`
`(a)
`
`i
`
`,
`
`which is the “extrinsic information” referred to earlier.
`
`One of the keys to the success of turbo codes is to use component codes C1 and C2
`
`for which a low-complexity soft bit decision algorithm exists. For example, the BCJR, or
`“APP” decoding algorithm [4] provides such an algorithm for any code, block or convolu-
`tional, that can be represented by a trellis.4
`achieve high performance, which means that individually, the codes C1 and C2 must be
`type shown in Figure 2, in which the individual codes C1 and C2 are indeed relatively weak
`
`relatively weak. The brilliant innovation of Berrou et al. [10] was to devise a code of the
`
`As far as is known, a code with a low-complexity optimal decoding algorithm cannot
`
`(but have a low-complexity decoding algorithm), in such a way that that the overall code
`is very powerful. Roughly speaking, they accomplished this by making the encoder E2
`identical to E1, except for a random permutation (accomplished by the “interleaver”) of
`the inputs. (The encoders were short-constraint-length systematic convolutional encoders
`with feedback.) However, since it is the object of this paper to study the decoding algorithm
`without regard to the resulting performance, we shall not discuss the constructive aspect
`of turbo-codes further.
`
`4 As we shall see in Section 4, the BCJR algorithm itself, and the many variations of
`it, are themselves special cases of Pearl’s algorithm. In this application, the algorithm is
`provably exact, since the corresponding “belief” diagram has no loops.
`
`8
`
`Hughes, Exh. 1033, p. 9
`
`
`
`4. Background on Probabilistic Inference, Bayesian Belief Networks, and
`Pearl’s Algorithm
`
`In this section, we will give a tutorial overview of the so-called probabilistic inference
`problem of the artificial intelligence community, as well as a brief discussion of Pearl’s
`algorithm, which solves the probabilistic inference problem in many important special
`cases.
`
`Thus let X = {X1, X2, . . . , XN}5 be a set of N discrete random variables, where Xi
`assumes values in the finite alphabet Ai. The joint density function
`p(x) =p(x 1, x2, . . . , xN )def= Pr{X1 = x1, . . . , XN = xN}
`is then a mapping from A1 ×···× AN into the set of real numbers R. We assume that the
`marginal densities p(xi)def= Pr{Xi = xi} are also known. The marginal density function
`p(xi) represents our a priori “belief” about the random variable Xi. Now suppose one or
`more of these random variables is measured, or “observed.” This means that there is a
`subset J ⊆ {1, 2, . . . , N}, (the evidence set) such that for all j ∈ J, the random variable
`Xj is known to have a particular value, say aj. The evidence is then defined to be the
`event
`E = {Xj = aj : j ∈ J}.
`
`The fundamental probabilistic inference problem is to compute the updated beliefs, i.e., the
`
`a posteriori or conditional probabilities p(Xi|E), for all i /∈ J.
`The brute force approach to computing p(Xi|E) is to sum over all of the terms of
`(cid:6)
`p(x) which do not involve either i or J. To simplify notation, we assume i = 1, and
`J = {m + 1, . . . , N}. Then we have
`p(X1 = a|E) =α
`
`(4.1)
`
`p(a, x2, . . . , xm, am+1, . . . , aN ).
`
`x2,...,xm
`
`If Xi can assume qi different values, then computing the sum in (4.1) for each possible
`value of a requires q1q2 ··· qm additions, which is impractical unless m and the qi’s are very
`
`small numbers.
`
`The idea behind the “Bayesian belief network” approach [28][51] to this inference
`problem is to exploit any “partial independencies” which may exist among the Xi’s to
`simplify belief updating. The simplest case of this is when the random variables X1, . . . , XN
`are mutually independent, in which case the work in (4.1) can be avoided altogether, since
`an observation of one such variable cannot affect our belief in another. More generally, the
`partial independencies can be described by a directed acyclic graph, or DAG.
`
`5 We have already used upper-case X’s to denote codeword components, for example,
`(2.1). We use upper-case X’s here to denote arbitrary random variables, and hope no
`confusion will occur.
`
`9
`
`Hughes, Exh. 1033, p. 10
`
`
`
`An DAG is a finite, directed graph, in which there are no directed cycles. For example,
`Figure 4 shows a DAG with five vertices and five edges. Let us agree that if there is a
`directed edge a → b, then a will be called a “parent” of b, and b will be called a “child” of
`a. If the set of parents of a vertex v is denoted by pa(v), then we can describe the graph
`of Figure 4 as follows:
`
`(4.2)
`
`pa(v1) =∅
`pa(v2) =∅
`pa(v3) ={v 1}
`pa(v4) ={v 1, v2}
`pa(v5) ={v 3, v4}.
`
`V1
`
`V3
`
`V2
`
`V4
`
`V5
`
`Figure 4. Simple example of a DAG which represents a
`five-variable directed Markov field (see (4.4)). This DAG is
`“loopy,” with the vertices v1, v3, v4, and v5 forming a loop.
`
`If G is a DAG, and if X is a set of random variables in one-to-one correspondence
`with the vertices of G, the joint density function p(x) is said to factor according to G if
`
`N(cid:3)
`
`(4.3)
`
`p(x1, . . . , xN ) =
`
`i=1
`
`p(xi| pa(xi))
`
`where pa(xi) denotes a value assignment for the parents of Xi. For example, a five-variable
`density function p(x1, . . . , x5) factors according to the graph of Figure 4 if
`p(x1, x2, x3, x4, x5) =p(x 1)p(x2)p(x3|x1)p(x4|x1, x2)p(x5|x3, x4).
`
`(4.4)
`
`10
`
`Hughes, Exh. 1033, p. 11
`
`
`
`A set of random variables X whose density functions factor according to a given DAG is
`called a directed Markov field [35][32][65]. For example, if G is a directed chain, then X is
`an ordinary Markov chain. A DAG, together with the associated random variables X, is
`called a Bayesian belief network, or Bayesian network for short [28].
`
`At this point we observe that the general coding framework of Figure 1 can be rep-
`resented as the Bayesian network shown in Figure 5. From the decoder’s viewpoint, the
`observed noisy information bits Ysi are probabilistic functions of the hidden information
`bits Ui. Similarly, the observed noisy codeword fragment Y1 is a probabilistic function of
`the codeword X1, which in turn is a deterministic function of the hidden input bits. (Fig-
`ure 5 implies that the information bits Ui are independent.) The decoder’s problem is thus
`to infer the values of the hidden variables Ui based on the evidence variables (Ys1, . . . , Ysk)
`and Y1.
`
`Ys1
`
`Ys2
`
`Ys(k-1)
`
`Ysk
`
`Noisy information bits
`(visible)
`
`U1
`
`U2
`
`Uk-1
`
`Uk
`
`X1
`
`Y1
`
`Information bits
`(hidden)
`
`Codeword Fragment
`(hidden)
`
`Noisy codeword
`Fragment
`(visible)
`
`Figure 5. Bayesian network
`interpretation of the decoding problem.
`
`Bayesian networks can sometimes lead to considerable simplifications of the proba-
`bilistic inference problem. The most important of these simplifications, for our purposes,
`is Pearl’s belief propagation algorithm. In the 1980’s Kim and Pearl [31][42][43][44] showed
`that if the DAG is a “tree,” i.e., if there are no loops,6 then there are efficient distributed
`algorithms for solving the inference problem. If all the alphabets Ai have the same size q,
`Pearl’s algorithm solves the inference problem on trees with O(N qe) computations, where
`e is the maximum number of parents of any vertex, rather than O(qm), where m is the
`number of unknown random variables, which is required by the brute force method. The
`efficiency of belief propagation on trees stands in sharp contrast to the situation for general
`DAG’s, since in 1990 Cooper [16] showed that the inference problem in general DAG’s is
`
`6 A “loop” is a cycle in the underlying undirected graph. For example, in the DAG of
`Figure 4, v1 → v4 → v5 → v3 → v1 is a loop.
`
`11
`
`Hughes, Exh. 1033, p. 12
`
`
`
`NP-hard. (See also [17] and [53] for more on the NP-hardness of probabilistic inference in
`Bayesian networks.)
`
`Since the network in Figure 5 is a tree, Pearl’s algorithm will apply. However, the
`result is uninteresting: Pearl’s algorithm applied to this Bayesian network merely gives an
`alternative derivation of Lemma 2.2.
`
`A more profitable application of Pearl’s algorithm is to the classic “hidden Markov
`chain” inference problem, where the appropriate Bayesian network is shown in Figure 6.
`Here the result is a linear-time exact solution which is functionally identical to the cele-
`brated “forward-backward algorithm” discovered in 1960’s and 1970’s.7
`
`X1
`
`X2
`
`X3
`
`X4
`
`X5
`
`Y1
`
`Y2
`
`Y3
`
`Y4
`
`Y5
`
`Figure 6. Bayesian network for the “hidden Markov
`chain” problem. Here X1, . . . , XN form a Markov chain,
`and Y1, . . . , YN are noisy versions of X1, . . . , XN . The
`problem is compute the conditional probabilities of the
`hidden variables Xi based in the “evidence” variables Yi.
`
`For us, the important feature of Pearl’s BP algorithm is that it can be defined for
`an arbitrary DAG which is not necessarily a tree, even though there is no guarantee that
`the algorithm will perform a useful calculation if there are loops in the DAG. We believe
`that the key to the success of turbo-codes, and a potentially important research area
`for the AI community, is the experimentally observed fact that Pearl’s algorithm works
`
`7 The forward-backward algorithm has a long and convoluted history that merits the
`attention of science historian. It seems to have first appeared in the unclassified literature
`in two independent 1966 publications [6][11]. Soon afterwards, it appeared in papers MAP
`detection of digital sequences in the presence of intersymbol interference [23]. It appeared
`explicitly as an algorithm for tracking the states of a Markov chain in the early 1970’s
`[40][4] (see also the survey articles [47] and [49]). A similar algorithm (in “min-sum”
`form appeared in a 1971 paper on equalization [62]. The algorithm was connected to the
`optimization literature in 1987 [63]. All this activity appears to have been completely
`independent of the developments in AI that led to Pearl’s algorithm!
`
`12
`
`Hughes, Exh. 1033, p. 13
`
`
`
`“approximately” for some loopy, i.e., non-tree DAG’s.8 We shall explain the connection
`between turbo codes and BP in Section 6, after first describing the BP algorithm in detail
`in Section 5. For now, as a preview of coming attractions, we present Figure 7, which is a
`loopy Bayesian network appropriate for the turbo decoding problem.9
`
`Ys1
`
`Ys2
`
`Ys(k-1)
`
`Ysk
`
`U1
`
`U2
`
`Uk-1
`
`X1
`
`Y1
`
`Uk
`
`X2
`
`Y2
`
`Noisy information bits
`(visible)
`
`Information bits
`(hidden)
`
`Codeword Fragments
`(hidden)
`
`Noisy codeword
`Fragments
`(visible)
`
`Figure 7. Bayesian network interpretation of the
`turbo-decoding problem. Note the presence of
`
`many loops, i.e., U1 → X2 → U2 → X1 → U1.
`
`8 There is an “exact” inference algorithm for an arbitrary DAG, developed by Lauritzen
`and Spiegelhalter [34], which solves the inference problem with O(NcqJ ) computations,
`where Nc is the number of cliques in the undirected triangulated “moralized” graph Gm
`which can be derived from G, and J is the maximum number of vertices in any clique in
`Gm. However, this proves not to be helpful in the decoding problem, since the appropriate
`DAG produces moralized graphs with huge cliques. For example, the turbo codes in [10]
`have an associated Gm with a clique of size 16,384.
`9 Our Figure 7 should be compared to Figure 2.5 in Wiberg [67], which describes the
`“Tanner graph” of a turbo code. The figures are similar, but there is a key difference.
`Wiberg incorporates the turbo code’s interleaver, citing it (the interleaver) as necessary
`for ensuring that there are no short cycles in the graph. In our Figure 7, on the other
`hand, there are many short cycles. It is our belief the presence of short cycles does not,
`at least in many cases, compromise the performance of the decoding algorithm, though it
`may degrade the quality of the code. We will expand on these remarks at the conclusion
`of the paper.
`
`13
`
`Hughes, Exh. 1033, p. 14
`
`
`
`5. Detailed Description of Pearl’s algorithm.
`
`In this section we will give a detailed functional description of Pearl’s algorithm as described
`in Chapter 4 of [44].
`
`Pearl’s belief propagation algorithm is a decentralized “message passing” algorithm,
`in which there is a processor associated with each vertex of G. Each processor can com-
`municate only with its parents and children. Furthermore, the processor associated with
`a variable X is assumed to “know” the conditional density function p(x | u)def= Pr{X =
`x| U1 = u1, . . . , UM = uM}, where U1, . . . , UM are the parents of X. (If X has no parents,
`this knowledge is assumed to be the marginal density function p(x)def= Pr{X = x}.) Thus
`the “local environment” of a node X is as shown in Figure 8a.
`
`(u)
`
`X, U
`
`(x)
`
`X,Y
`
`λ π
`
`X
`
`U
`
`Y
`
`Parents of X
`
`U1 U2
`
`UM
`
`X
`
`“knows”
`p(x|u)
`
`π
`
`(u)
`
`U, X
`
`λ
`Y, X
`
`(x)
`
`X
`
`U
`
`Y
`
`Y1
`
`YN
`Y2
`Children of X
`
`(a) The local environment of X
`
`(b) Incoming messages to X,
`in vector notation.
`(Before activation of X.)
`
`(c) Outgoing messages from X,
`in vector notation.
`(After activation of X.)
`
`Figure 8. A summary of Pearl’s algorithm.
`(Boldface symbols denote random vectors;
`ordinary symbols represent random variables.)
`
`When a processor is activated, it “reads” the messages received from each of its parents
`and children, updates its belief based on these messages, and then sends new messages back
`to its parents and children.
`
`The message a node X receives from its parent Ui, denoted πUi,X (ui), is in the form
`of a list of probabilities (“π” for “probability”), one for each value ui ∈ AUi. Informally,
`πUi,X (ui) is the probability of the event Ui = ui, conditioned on the evidence in the
`tree already “known” to Ui. Similarly, the message X receives from its child Yj, denoted
`λYj ,X (x), is in the form of a list of nonnegative real numbers (likelihoods: “λ” for “like-
`lihood”) one for each value of x ∈ AX .
`Informally, λYj ,X (x) is the probability of the
`
`14
`
`Hughes, Exh. 1033, p. 15
`
`
`
`evidence Yj “knows,” conditioned on the event X = x. For simplicity, we adopt a vector
`notation for these incoming messages:
`
`(5.1)
`
`πU ,X (u)def= (πU1,X (u1), . . . , πUM ,X (uM ))
`λY ,X (x)def= (λY1,X (x), . . . , λYN ,X (x)).
`
`The situation is summarized in Figure 8b.
`
`After X has been activated, the message X passes to its child Yj, denoted πX,Yj (x), is a
`list of probabilities, one for each value of x. Roughly speaking, πX,Yj (x) is the probability of
`the event X = x, given the evidence in the tree already “known” to X, which now includes
`any new evidence which may have been contained in the incoming messages. Similarly, the
`message X passes to its parent Ui, denoted λX,Ui(ui), is the probability of the evidence it
`now knows about, given the event Ui = ui. Again we adopt a vector notation:
`
`(5.2)
`
`λX,U (u)def= (λX,U1(u1), . . . , λX,UM (uM ))
`πX,Y (x)def= (πX,Y1(x), . . . , πX,YN (x)).
`
`This situation is summarized in Figure 8c.
`
`Additionally, each node of the graph keeps track of a number of other quantities:
`→ R
`
`× ··· × AUM
`μX (u) :A U1
`λX (x) :A X → R
`πX (x) :A X → R
`× ··· × AUM
`γX (u) :A U1
`BELX (x) :A X → R.
`
`→ R
`
`The quantities μX (u), λX (x), πX (x), and γX (u) have no particular intrinsic significance,
`but the quantity BELX (x) is the heart of the algorithm, since when the algorithm termi-
`nates, BELX (x) gives the value of the desired conditional probability Pr{X = x | E}.
`
`Here then is a complete description of Pearl’s algorithm: When the node X is acti-
`vated, it “reads” its incoming messages πU ,X (u) and λY ,X (x), and updates μX (u), λX (x),
`πX (x), γX (u), BELX (x), λX,U (u) and πX,Y (x), in that order, using the update rules in
`Table 5.1, and the initial values given in Table 5.2 (In Table 5.1 we use the n