`
`Page 1 of 15
`
`SAMSUNG EXHIBIT1011
`
`
`
`Reducing the computations of the SVD array given by Brent and Luk
`
`B. Yang and J.F. Béhme
`
`Ruhr-Universitét, Departmentof Electrical Engineering
`4630 Bochum, West Germany
`
`ABSTRACT
`
`A new,efficient two plane rotations (TPR) method for computing two-sided rotations involved in
`singular value decomposition (SVD)is presented. By exploiting the commutative properties of some special
`types of 2x2 matrices, we show that a two-sided rotation can be computed by only two planerotations and a
`few additions. Moreover, if we use coordinate rotation digital computer (CORDIC) processors to implement
`the processing elements (PEs) of the SVD array given by Brent and Luk, the computational overhead of the
`diagonal PEs due to angle calculations can be avoided. The resulting SVD array has a homogeneous
`structure with identical diagonal and off-diagonal PEs.
`
`1.
`
`INTRODUCTION
`
`the SVD. Particular
`One important problem in linear algebra and digital signal processing is
`applications arise in beamforming and direction finding, spectrum analysis, digital image processing etc.
`Recently, there is a massive interest in parallel architectures for computing SVD due to the growing
`importanceof real-time signal processing and advances in VLSI devices. Brent and Luk2” have shown how
`a Jacobi method with parallel ordering can efficiently compute the SVD, and how the method can be
`implemented by a two-dimensional systolic array. The method is based on, as commonforall two-sided
`approaches, applying a sequence of two-sided rotations to 2x2 submatrices of the original matrix. The
`computational complexity is thus determined by how to compute the two-sidedrotations.
`
`Usually, a two-sided rotation is realized by four plane rotations, where two of them are applied from
`left to the two column vectors of the 2x2 submatrix and the other ones are applied from right to the row
`vectors, respectively. For the diagonal PEs of the SVD array, additional operations for calculating the
`rotation angles are required. This leads to an inhomogeneous array architecture containing two different
`types of PEs.In this paper, we develop a new TPR methodfor computing two-sided rotations. We show that
`the above computational complexity is reduced significantly, and the SVD array becomes homogeneous
`when using CORDICprocessors.
`
`The paper is organized as follows. In section 2, we briefly re-examine the Jacobi SVD method and the
`SVD array. Then, we develop the TPR methodin section 3. The CORDIC algorithm is described in section
`4. Different techniques for scaling correction are discussed, and examples of scaling corrected CORDIC
`sequences for different data formats are given. In section 5, an unified CORDIC implementation of all PEs
`
`92 / SPIE Vol. 1152 Advanced Algorithms and Architectures for Signal Processing IV (1989)
`
`Page 1 of 15
`
`SAMSUNG EXHIBIT 1011
`
`
`
`of the SVDarrayis presented. Finally, some convergence aspects are discussed in section 6.
`
`The SVDof a matrix MeRis given by
`
`2.
`
`JACOBI SVD METHOD
`
`(1)
`M=UZV',
`where UeRN™® and VerRN* are orthogonal matrices and LERis a diagonal matrix of singular values.
`Based on an extension of the Jacobi eigenvalue algorithm, Kogbetliantz’, Forsythe and Henrici> proposed to
`diagonalize M by a sequence of two-sided rotations,
`
`Mo=M,=Muy = UxMyVy (k=0,1,2,-- +). (2)
`
`U, and V, represent rotations in the (i,j)-plane (1Si<jsN). The rotation angles are chosen to annihilate the
`elements of My at the positions (i,j) and (j,i). Usually, several sweeps are necessary to complete the SVD,
`where a sweep is a sequence of N(N-1)/2 two-sided rotations according to a special ordering of the
`N(N-1)/2 different index pairs (i,j). Unfortunately, the best known cyclic orderings, namely, the cyclic row
`ordering
`
`(ij) = (1,2), ... , (LN), (2,3), «. » (2,N), «> (N-LN)
`and the equivalent cyclic column ordering are not suitable for parallel computations.
`
`(3)
`
`Recently, Brent and Luk”? suggested a parallel ordering allowing a high degree of parallelism. It
`enables
`the use of
`a
`square
`systolic
`array with
`[N/2]x[N/2] PEs to implement the Jacobi SVD method
`(Figure 1). For doing this, the matrix M is partitioned into
`2x2 submatrices. Each PE contains one submatrix and
`
`performs a two-sided rotation B = R(0,)'AR(®,) ,
`
`where
`
`(4)
`
`am (2 2] ae ie |
`-
`
`
`
`= ~Lbg,Lan ag) b22 ©)
`
`denote the input and output matrices, and
`
`cos@ sin@
`R®)=|sind cos0 (6)
`
`Fi
`1. The
`SVD
`ma eric ae
`array given by
`describes
`a
`plane
`rotation
`through
`the
`angle
`8,
`respectively. Notice
`that
`there
`are
`two
`different
`computation modes. In a diagonal PE, the rotation angles
`@, and @ are generated to diagonalize the 2x2 submatrix (b;2=b2;=0) stored. We call this the generation
`
` i
`
`SPIE Vol. 1152 Advanced Algorithms and Architectures for Signal Processing IV (1989) / 93
`
`Page 2 of 15
`
`Page 2 of 15
`
`
`
`
`
`
`
`mode. Then, the rotation angles are propagated to all off-diagonal PEs in the same row and the same
`column, in which pure rotations (4) with the received angles are performed. Wecall this the rotation mode.
`Clearly, if we directly compute (4) in the rotation mode, we require four plane rotations. For the generation
`mode, additional operations for calculating 6, and @, are needed.
`
`3. TPR METHOD FOR COMPUTING TWO-SIDED ROTATIONSNeSEEERLAMONS
`
`In order to develop the TPR method for computing two-sided rotations more efficiently, we first
`discuss the commutative properties of two special types, the rotation-type and the reflection-type, of 2x2
`matrices, namely,
`
`and A= i. I | eseR}
`A =| L§ | | c.ser }
`Obviously, the plane rotation matrix and the Givensreflection matrix® with c2+s2=1 are two elements of the
`sets &™ and «™*, respectively. The following results can be shown by elementary manipulations.
`Lemma 1.
`If Aj #™™ and Aye .«™,
`then A;Ao=AoA,e M™,
`Lemma 2.
`If Aj M™* and Age #™, then A,Ap=ALA€ 4,
`
`In particular, if we consider two planerotations, we know,
`Lemma 3.
`If R(;) and R(@) are plane rotations described by (6), then R(8,)R(62)=R(6)+6,) and
`R(,)"R(O2)=R(0,-0)).
`
`Now,we give a theorem describing the rotation mode of the TPR method.
`Theorem.
`If the 2x2 matrix A and the tworotation angles @, and 62 are given, then the two-sided
`rotation (4) can be computed by twoplanerotations, ten additions and four scalings by 1/2:
`Pi = (a22 + a3))/2
`P2 = (a2 — a31)/2
`{a = (a2) —ay)/2 ’
`{ oo = (a1 + ap)/2 ’
`6_ = 6-6, ,
`6, = 62+ 0;
`7
`
`(7)
`(8)
`
`a] = x00[2]
`fn) = Ref
`tp|
`=
`R@d|g|
`ty)
`=
`R@ Iq)
`big = +t) + ty
`by = 1 -r
`era ™1+% °
`(eee aia
`Proof. Using (7), the input matrix A can be reformulated by
`
`»
`
`9)
`(10)
`
`A = A,+A) = |
`| —P2
`Pi =]
`a4:
`Pi
`a2 P2
`Clearly, R(®;), R(®2) in (4) and A, are elements of #'™ while A, belongs to #™. This leads to the
`following expression for the output matrix B by using the lemmas 1-3,
`
`94 / SPIE Vol. 1152 Advanced Algorithms and Architectures for Signal Processing IV (1989)
`
`Page 3 of 15
`
`Page 3 of 15
`
`
`
`ll
`
`w
`
`RO)"ARO.) = R(O)'A:R(G,) + RO)"A2R(2)
`
`RQ)"R(O)Ai + RO)"RO)"Ay = R(O2-0))A; + ROz+0))"Ay
`—P2 es
`sad| qi
`T
`nie G2 P2
`
`-q1
`Pi
`
`Pi
`
`A direct consequence of the above theorem shows how a two-sided rotation in the generation mode can
`be computed in a similar way.
`Corollary.
`If the 2x2 matrix A is given, we can diagonalize A and calculate the corresponding rotation
`angles 0, and 02 by two Cartesian to polar coordinates conversions, eight additions and four scalings by 1/2:
`Pi = (a2 + a33)/2
`P2 = (a2 — a43)/2
`a
`{a = (a2) —aj)/2 ’”
`Ve = (a2) + aj2)/2 ’”
`(11)
`ry = sign(p:){p4 + qi
`13: = sign(p2){p? + @ ;
`(12)
`{o. = arctan(q;/Ppi)
`le, = arctan(q2/p2)
`
` ddeesee
`
`tT
`tq)
`
`-ty Rb
`TY +
`tg
`In}
`
`°
`
`=
`
`This completes the proof.
`
`0, = (0,-0)/2 ,
`
`0 = (0,+0)/2 ,
`
`bi = 1-2,
`
`by = +12.
`
`(13)
`
`(14)
`
`Proof. Regarding equation (10), t;=t2=0 follows directly from bj.=b2);=0.
`
`In equation (12), we choose the rotation through the smaller angle. All vectors lying in the first or the
`fourth quadrant are rotated to the positive x-axis, and all vectors lying in the second andthe third quadrant
`are rotated to the negative x-axis. For vectors on the y-axis, the rotation direction is arbitrary. Thus, the
`generated rotation angles @_ and 6, satisfy |@_|,
`|®,| < 7/2. This results in
`
`|0,| <7/2
`
`and
`
`|| <sx/2
`
`(15)
`
`due to (13).
`
`Weremark that Delosme’” has also proposed to use (14) for computing b,; and b22 in the genération
`mode. However, he did not recognize the fact that the above approach is generally possible because of the
`commutative properties of the rotation-type and the reflection-type of matrices. So, he still requires four
`plane rotations for computing a two-sided rotation in the rotation mode.
`
`4. THE CORDIC ALGORITHM
`
`In the previous section, we have seen that the main operations of the TPR-methodare plane rotations
`and Cartesian to polar coordinates conversions. Although these operations can be carried out by multiplier-
`
`SPIE Vol. 1152 Advanced Algorithms and Architectures for Signal Processing IV (1989) /. 95
`
`Page 4 of 15
`
`Page 4 of 15
`
`
`
`
`
`adder based processors supported by software or special hardware units, a simpler approach is the use of
`dedicated processors that map algorithms more effectively to hardware. The CORDIC processor is a
`powerful one for calculating these functions.
`
`The CORDIC algorithm was originally designed by Volder® as an iterative procedure for computing
`plane rotations and Cartesian to polar coordinates conversions. It was later generalized and unified by
`Walther? enabling a CORDICprocessorto calculate more functions, including hyperbolic functions as well
`as multiplications and divisions, in a similar way. In the following, only Volder's CORDIC algorithm is
`considered for the sake of simplicity because only trigonometric functions are involved in SVD applications.
`
`The CORDICalgorithm consists of iterative shift-add operations on a three-componentvector,
`
`Bs | 7 a _
`Yai
`y,+0,5.x.
`Zi,
`= 2,-€6.0,
`
`cos(a.,)
`|
`1
`“05C) G;sin(@,)
`(i=0,1,....n-1)
`,
`
`-9,sin(a.) *]
`
`cos(a,)
`y;
`
`diay
`
`(17)
`
`in whichthe iteration stepsize 0<8.<1 is defined by
`(18)
`6, = tan(a,) = 20,
`Theset of integers {S(i)} is called CORDIC sequence. Equation (16) can be interpreted, except for a scaling
`factor of
`
`=
`
`1
`
`=
`
`(19)
`148° ,
`k ~ Cos a, ~
`as a rotation of (x,; y)" through the angle a, where o.=11 gives the rotation direction. After n iterations
`(16) and (17),the roast data are given by
`x
`cosa -sinw Xo
`Yn = sin® cosa Yos
`(20)
`ZO Z) — €&
`(21)
`with the overall scaling factor K = i k; and the cumulative rotation angle o = . 6... Now,if the CORDIC
`sequencesatisfies the following paeeiedace condition
`
`(i=0,1,....n-2) ,
`
`to force Y,, OF Z, to zero, provided that the input data x
`
`(22)
`
`0 Yo
`
`n-1
`a- 2asca
`*
`jsitt J
`cc
`we can choose O,=sign(x,y;) or 0,=sign(€z.)
`and Zo lie in the convergence region
`es |
`yl tar
`9,
`
`
`n-1 fory+0arctan(y,/x
`C=
`(23)
`for z_ +0
`Izo|
`Both, plane rotations as well as Cartesian to polar coordinates conversions can be calculated in this way
`
`i=o '
`
`96 / SPIE Vol. 1152 AdvancedAlgorithms andArchitectures for Signal Processing IV (1989)
`
`Page 5 of 15
`
`Page 5 of 15
`
`
`
`(Table 1), where e=+1 is a sign parameter.
`
`Zot€arctan(yo/xXo9)|Yn = K(exosinzg+ yocosZo)
`
`X, = K( xpcoszp—€ypsinzo)
`
`ra Ksign(xo){x3+y9
`
`Table 1 : CORDICtrigonometric functions
`
`In Table 1, only the principal values |arctan(yo/xo) |<m/2 are calculated when computing Cartesian to
`polar coordinates conversions. Correspondingly, x, may be the positive or the negative vector norm
`according to the sign of x9. These properties for computing (12) are enabled by the particular choice of
`o.=-sign(x,y.), in contrast to o.=-sign(y,) in most other CORDIC publications. So, it is guaranteed that a
`vector is always rotated through the smaller angle to the x-axis. In this case, a convergence region of C2900
`is sufficient for the generation mode.
`
`A disadvantage of the CORDIC algorithm is the need to correct the scaling factor K. The traditional
`CORDIC sequence
`
`{S@} = (0,1, 2,3,...,p-1,p }
`
`(24)
`
`with n=p+1 CORDICiterations for a mantissa length p delivers, for example, a scaling factor of K=1.64676
`for p=16. Using multiplications of x, and y, by K? will degrade the algorithm performancesignificantly.
`Welike to compensate the scaling effect by using a minimum numberofshift-add operations.
`
`SPIE Vol. 1152 Advanced Algorithms and Architectures for Signal Processing IV (1989) / 97
`
`For a two-sided rotation that is implemented by four plane rotations, each matrix element undergoes
`two plane rotations so that the total scaling factor to be corrected is K. In this case, Cavallaro and Luk!°
`pointed out that there is a simple systematic approach for scaling correction if the traditional CORDIC
`sequence (24) is used. They proposed to use [p/4] scaling iterations of the type x — x-2Ix with jeJ={1, 3,
`5, ... ,2[p/4]-1} and oneshift operation 21. The remaining scaling error is boundedasfollows),
`:
`.
`?p
`:
`1-271 (1-2)-K?| J ja -2? 0-2). (142-7)|
`< gel
`
`jes
`
`jes
`
`i=0
`
`(25)
`
`This approach, however,fails in the TPR method. Here, each matrix element undergoes only one plane
`rotation. So, the scaling factor to be corrected is K. For solving this more difficult problem, different
`techniques have been developed. Haviland and Tuszynski!! used similar scaling iterations as Cavallaro and
`Luk. Ahmed}? proposed to makeKto a powerof the machine radix by repeating some CORDICiterations.
`Delosme!*
`combined both methods above for minimizing the computational overhead. Deprettere et al.4
`suggested the double-shift concept for providing more freedom by optimizing CORDIC sequences.
`
`Wedesigned a computer program)> for a systematic search of CORDIC sequences with comparable
`
`) Whenreplacing [p/4] by [(p-1)/4] or [(p+1)/4], the upper boundin (25) becomes 2-P or 2-P-2,respectively.
`
`Page 6 of 15
`
`Page 6 of 15
`
`
`
`
`
`complexity for correcting K and found sequences requiring the same number [p/4] of additional iterations.
`We examinedall relevant combinations ofthe shift parameters S(i) (i=0,1,....n-1). In particular, we allowed
`a difference two between adjacentshifts, for example S(i)={0, 1, 3, 3, 3, ...}, such that much more CORDIC
`sequences could be tested. The resulting scaling factor K is corrected by a sequence of nx shift-add
`operations,
`
`ge TT[14n@27®) -K = 14+AK
`
`—(T(j)integers,
`
`(j)=+1) ,
`
`(26)
`
`j=l
`that are characterized by the set of parameters {T(0), n(1)T(1),... , 1(n,)T(n,)}. This results in a total
`number of L=n+n, iterations. We note that the remaining scaling error AK is a systematic error in contrast
`to the other two types of CORDICerrors, the angle resolution and the roundingerrors. So, it is particularly
`important
`to keep AK much smaller than 27, especially if repeated CORDIC operations have to be
`performed on the same data such as in SVD applications.
`
`In the following, for five different mantissa lengthes p=16, 20, 24, 28 and 32, examples of such
`CORDIC sequences, including the total numberofiterations L=n+n,, the remaining scaling error AK and
`the convergenceregionCare given.
`
`
`
`L=17+4, C=100°, AK=-271601 2)
`
`p=16:
`
`p=20:
`
`p=24:
`
`p=28 :
`
`p=32:
`
`{S@} = (0123 --.... 15 16} ,
`{n@T@} = {142-5 +9 410},
`{S@} = (0123 ...... 19 20} ,
`L=21+5, C=100°, AK=2~23-5
`{n@T@} = {1 +2-5 +9 +10 +16} ,
`{S@} = (1123345566788910 ---.-. 23 24},
`{n@TG)} = {0-2 +6},
`L=294+2, C=91°, AK=-2-2%13
`{S@} = (112334556678891011 1213141415 ...... 27 28},
`{n@T)} = {0-2 +6},
`L=3442, C=91°, AK=2732-53
`{S@} = {(001333456789910 ...... 31 32},
`{n@T@} = {1-3-8 +16 -25 -27},
`L=36+5, C=145°, AK~-2739-93
`In order to meet the convergence condition (22) and to provide a convergence region C2>90°, the minimum
`number of CORDICiterations for a mantissa length of p is p+1. So, for all CORDIC sequences given above,
`the number L-(p+1) of additional iterations for scaling correction is p/4. Moreover, except for the first
`CORDICsequence, the remaining scaling error AK is always significantly smaller than 27?.
`
`that similar results can also be obtained by using Deprettere's double-shift concept.
`We suspect
`However, this method requires a slightly increased hardware complexity and will not be discussed in this
`paper.
`
`2) When appendingan additional scaling iteration with (5)T(5)=+16, the scaling accuracy can be enhanced to AK=2-23.
`
`98 / SPIE Vol. 1152 Advanced Algorithms and Architectures for Signal Processing IV (1989)
`
`
`Page 7 of 15
`
`Page 7 of 15
`
`
`
`5. CORDIC IMPLEMENTATIONS OF THE SVD PES
`
`Wefirst introduce a CORDIC processor symbol as shown in Figure 2. The descriptions inside the box
`determine uniquely the function modeof the processor. The output data x, y and z are assumed to bescaling
`corrected.
`
`
`
`The operations (7)-(10) and (11)-(14) of the TPR method can be mapped onto a two CORDIC
`processor architecture. In Figure 3, the diagonal PEs of the SVD
`array working in the generation mode are implemented by two
`CORDIC processors computing Cartesian to polar coordinates
`conversions and eight adders. The dotted inputs of the adders
`represent negated inputs. In Figure 4, the off-diagonal PEs of the
`SVD array working in the rotation mode are implemented by two
`CORDIC processors computing plane rotations and ten adders.
`Clearly, both CORDIC implementations have nearly the same
`architecture. All PEs of the SVD array can thus be implemented by
`one unified CORDIC SVD module (Figure 5) without considerably increased hardware cost. In this case, the
`SVD array depicted in Figure 1 becomes homogeneous. The different computation modes of the diagonal
`and off-diagonal PEs can be easily selected by one control bit.
`
`Figure 2. CORDIC symbol
`
`
`
`Weremark that Figure 5 is more a "graphic program" describing the sequence of operations to be
`computed rather than a hardware block diagram.In particular, the twelve adders that are paired into three
`pre-butterflys and three post-butterflys can be integrated into the two CORDIC processors without separate
`hardware implementations. Because of the recursive nature of the Jacobi SVD method, only recursive
`CORDICprocessors can be used. Each of them consists mainly of two barrel-shifters and three adders for
`computing the iterative shift-add operations (16) and (17). So,it is natural to modify the CORDICprocessor
`architecture slightly and to use the existing six adders contained in the two CORDIC processors for
`computing both, the pre-butterfly and the post-butterfly operations. The only computational overhead are
`two iterations resulting in a total computation time of L+2 iterations for one CORDIC SVD module. In
`Figure 6, the principal architecture of such a two CORDIC processor SVD module is shown, where the
`dashed lines and boxes represent the additional hardware components enabling the CORDIC processors to
`compute the butterfly operations. It can be easily verified that the upper four adders devoted to x and y can
`perform the following types of operations 2xt274y, xt2Sy (CORDICiteration) and xt2Sx (scaling
`iteration), while the lower two adders devoted to z can compute 2712,42712, and zta.
`
`Now, we compare the area and time complexity of the above CORDIC SVD module with that of the
`CORDIC SVD modules proposed by Cavallaro and Luk!®, Let Acsva and Tesyq denote the area and time
`complexity of a CORDIC SVD module and A, and T, those of a CORDIC processor, respectively.
`Cavallaro and Luk have shown thattheir most efficient parallel diagonalization method requires A,,yg=2A,
`
`SPIE Vol. 1152 Advanced Algorithms and Architectures for Signal Processing IV (1989) / 99
`
`_
`
`Page 8 of 15
`
`Page 8 of 15
`
`
`
`a1
`
`Figure 3. CORDIC implementation of the
`diagonal PE of the SVD array
`
`an
`
`421
`
`822
`
`12
`
`26,
`
`
`
`421
`
`aN
`of the unified CORDIC SVD module
`
`Figure 4. CORDIC implementation of the
`off-diagonal PE of the SVD array
`
`post--butterflys
`pre—butterflys
`
`CateanCau,
`
`Figure 6. The principal architecture
`
`Figure 5. An unified CORDIC SVD module
`for implementing all PEs of the SVD array
`
`100 / SPIE Vol. 1152 Advanced Algorithms and Architectures for Signal Processing IV (1989)
`
`Page9of 15
`
`Page 9 of 15
`
`
`
`and T.syg=3T, for the diagonal PEs and Agsyq=2A, and Tosyq=2T,, for the off-diagonal PEs. By using the
`TPR method, we need only Agsyg=2A, and T,syg=T, for all PEs. In other words, the time complexity is
`reduced by more than 50 percent.
`
`6.
`SOME CONVERGENCE CONSIDERATIONS
`Forsythe and Henrici> have proven the convergence of the Jacobi SVD method with the cyclic row
`ordering or the cyclic column ordering if the rotation angles 6; and 0, in (4) are restricted to a closed
`interval H inside the openinterval (-1/2, 7/2). They have also demonstrated that this condition may fail to
`hold, i.e. 6; and @, may have the values +n/2, if the off-diagonal elements by. and b2) in (5) have to be
`exactly annihilated. As a remedy, they suggested an under- or over-rotation by computing (4) with angles
`(1-¥)@, and (1-y)®, (-l<y<1) and proved its convergence. This causes complications for CORDIC
`implementations. Because no longerthe rotation angles @, and 62 generated according to (12) and (13) are
`used, an additional two-sided rotation (4) in the rotation mode with the modified angles (1—y)6, and (1—y)®2
`has to be performed in the diagonal PEs. However, the finite mantissa length in the real arithmetic allows
`only an approximative computation of the rotation angles and implies under- or over-rotations. In the
`practical applications, the Jacobi SVD method converges without using under- or Over-rotations as shown by
`the experimental results of Brent and Luk’.
`
`Recently, Luk and Park!” have shown that various parallel orderings, including the parallel ordering of
`Brent and Luk? for odd N,are equivalent to the cyclic orderings and share the same convergenceproperties.
`Although the parallel architectures for computing SVD for these parallel orderings may be different, the
`basic operations are always two-sided rotations (4). So, the TPR method can be applied to any parallel
`orderings.
`
`7. CONCLUSION
`
`In this paper, we have presented a novel algorithm for computing two-sided rotations requiring only
`two plane rotations and a few additions. This results in a reduced computational complexity of the Jacobi
`SVD method. Moreover, we have presented an unified CORDIC SVD module for implementing all PEs of
`the SVD array given by Brent and Luk. This leads to a homogeneous SVDarray that is simple, regular and
`offers twice the computational speed.
`
`REFERENCES
`
`1.
`
`Speiser, J.M., "Signal Processing Computational Needs", Proc. SPIE Advanced Algorithms and
`Architectures for Signal Processing I, vol.696, pp.2-6, 1986
`2. Brent, R.P. and Luk, F.T., "The Solution of Singular Value and Symmetric Eigenvalue Problems on
`Multiprocessor Arrays" SIAM J. Sci. Stat. Comput., vol.6, no.1, pp.69-84, 1985
`3. Brent, R.P., Luk, F.T. and Van Loan, C.F., "Computation of the Singular Value Decomposition Using
`
`&
`
`Page 10 of 15
`
`SPIE Vol. 1152 Advanced Algorithms and Architectures for Signal Processing IV (1989) / 101
`
`
`
`
`
`Page 10 of 15
`
`
`
`Mesh-Connected Processors", Journal of VLSI and Computer Systems, vol.1, no.3, pp.242-270, 1985
`Kogbetliantz, E.G., "Solution of Linear Equations by Diagonalization of Coefficient Matrices", Quart.
`Appla. Math., vol.13, pp.123-132, 1955
`Forsythe, G.E. and Henrici, P., "The Cyclic Jacobi Method for Computing the Principal Values of a
`Complex Matrix", Trans. Amer. Math. Soc., vol.94, pp.1—23, 1960
`Golub, G.H. and Van Loan, C.F., Matrix Computations, pp.44, Johns Hopkins University Press,
`Baltimore, 1983
`Delosme, J.M., "A Processor for Two-dimensional Symmetric Eigenvalue and Singular Value Arrays",
`Proc. 21st Asilomar Conference on Circuits, Systems and Computers, Nov. 1987
`Volder, J.E.,
`"The CORDIC Trigonometric Computing Technique", JRE Trans. on Electronic
`Computers, vol.8, pp.330-334, 1959
`Walther, J.S., "An Unified Algorithm for Elementary Functions", Proc. Spring Joint Computer
`Conference, pp.379-385, 1971
`Cavallaro, J.R. and Luk, F.T., "CORDIC Arithmetic for an SVD Processor", Journal of Parallel and
`Distributed Computing, vol.5, pp.271-290, 1988
`Haviland, G.L. and Tuszynski, A.A., "A CORDIC Arithmetic Processor Chip", JEEE Trans. on
`Computers, vol.29, no.2, pp.68-79, 1980
`Ahmed, H.M., Signal Processing Algorithms and Architectures, Ph.D., Department of Electrical
`Engineering, Stanford University, Dec. 1981
`Delosme, J.M., "VLSI Implementation of Rotations in Pseudo-Euclidean Spaces", Proc. IEEE ICASSP,
`pp.927—930, Boston, 1983
`Deprettere, E.F., Dewilde, P. and Udo, R., "Pipeliend CORDIC Architectures for Fast VLSI Filtering
`and Array Processing", Proc. IEEE ICASSP, pp.41A.6.1-41A.6.4, San Diego, March 1984
`K6nig, D.,
`(in German) Optimierung des CORDIC-Verfahrens fir Prozessoren mit verschiedenen
`Datenformaten, Tech. Report, Lehrstuhl fiir Signaltheorie, Ruhr-Universitit Bochum, FRG, 1989
`Cavallaro, J.R. and Luk, F.T., "Architectures for a CORDIC SVD Processor", Proc. SPIE Real-Time
`Signal Processing IX, vol.698, pp.45-53, 1986
`Luk, F.T. and Park, H., "On the Equivalence and Convergence of Parallel Jacobi SVD Algorithms",
`Proc. SPIE Advanced Algorithms and Architectures for Signal Processing II, vol.826, pp.152-159,
`1987
`
`
`
`102 / SPIE Vol. 1152 Advanced Algorithms and Architectures for Signal Processing IV (1989)
`
`Page 11 of 15 |
`
`
`
`10.
`
`11.
`
`12.
`
`13.
`
`14.
`
`15.
`
`16.
`
`17.
`
`Page 11 of 15
`
`
`
`PROCEEDINGS
`@SPIE—The International Society for Optical Engineering
`
`Franklin T. Luk
`Chair/Editor
`
`8-10 August 1989
`San Diego, California
`
`Advanced Algorithms
`and Architectures for
`Signal Processing IV
`
` Page 12 of 15
`
`
`
`Volume 1152
`
`;
`
`- a
`
`_
`
`Page 12 of 15
`
`
`
`
`
`PROCEEDINGS
`SPIE—TheInternational Society for Optical Engineering
`
`Advanced Algorithms
`‘and Architectures for
`Signal Processing [V
`
`Franklin T. Luk
`Chair/Editor
`
`8-10 August 1989
`San Diego, California
`
`Sponsored by
`SPIE—TheInternational Society for Optical Engineering
`
`Cooperating Organizations
`Applied Optics Laboratory/New Mexico State University
`Center for Applied Optics Studies/Rose-HulmanInstitute of Technology
`Center for Applied Optics/University of Alabama in Huntsville
`Center for Electro-Optics/University of Dayton
`Center for Excellence in Optical Data Processing/Carnegie Mellon University
`Jet Propulsion Laboratory/California Institute of Technology
`Optical Sciences Center/University of Arizona
`Optoelectronic Computing Systems Center/University of Colorado,
`Colorado State University
`
`Published by
`SPIE—TheInternational Society for Optical Engineering
`P.O. Box 10, Bellingham, Washington 98227-0010 USA
`Telephone 206/676-3290 (Pacific Time) @ Telex 46-7053
`
`a)
`
`PROCEEDINGS
`
`1
`Volume 1152
`
`
`
`SPIE (TheSociety of Photo-Optical Instrumentation Engineers) is a nonprofit society dedicated to advancing engineering
`andscientific applications of optical, electro-optical, and optoelectronic instrumentation, systems, and technology.
`
`Page 13 of 15
`
`Page 13 of 15
`
`
`
`
`
`ADVANCED ALGORITHMS AND ARCHITECTURES FOR SIGNAL PROCESSING IV
`
`Volume 1152
`
`CONTENTS
`
`1152-01
`
`SESSION 1
`1152-02
`
`1152-03
`
`1152-04
`
`1152-05
`
`1152-06
`
`1152-07
`
`SESSION 2
`1152-08
`
`1152-09
`
`1152-10
`
`1152-11
`
`1152-12
`
`1152-13
`
`1152-14
`
`1152-15
`
`1152-16
`
`Conference Committee ... 020 s0cccccccccces cu cccce nna ste e ened eee e eens gue e eee reece rene reemerneaees vi
`THtrOduction . ccs sas 6.6 dec dative eu epie esis corspeeeree ereewin osnn cia aieis ae ¥i4i6.a'4.6 5a a9 8 CWS RIS eypgiael arene ereece «oe vii
`
`KEYNOTE ADDRESS
`Algorithmic engineering: an emerging discipline
`J. G. McWhirter, Royal Signals and Radar Establishment (UK). ...-.6- 0+ secre e eset eee e entre eee e ence eens Z
`
`ADAPTIVE FILTERING
`Chaotic model of sea clutter using a neural network
`S. Haykin, H. Leung, McMaster Univ. (Canada). .....- 6. esse cece cece eee e tence nee nee e cence ence ee cees 18
`Multichannelfast transversalfilter algorithms for adaptive broadband beamforming
`D. T.M.Slock, Philips Research Lab. (Belgium); T. Kailath, Stanford Univ: nhs ccs savewwewueses soe este eee 22
`Data equalization using highly nonlinear adaptive architectures
`C. F. N. Cowan,G.J. Gibson, S. Siu, Edinburgh Univ. (UK). ..... esses cece eee eee eee ener eee e ee enes 34
`VLSI architectures for square root covariance Kalmanfiltering
`F. M.Gaston,G. Irwin, Queen’s Univ. (UK). 00... cece eee eee cece e eee ene ee ene ene e ete nee ene eeees 44
`QR-decomposition basedlattice filter algorithms
`L K.Proudler, J. G. McWhirter, T. J. Shepherd, Royal Signals and Radar Establishment CUR). ss scosessicce ma areaeracs 6 56
`Parallel weight extraction by a systolic least squares algorithm
`J. E. Hudson, STC Technology Ltd. (UK); T. J. Shepherd, RoyalSignals and RadarEstablishment (UK). ....... 68
`
`NUMERICAL TECHNIQUES
`Updating singular value decompositions: a parallel implementation
`M. Moonen,Catholic Univ. of Leuven (Belgium); P. Van Dooren,Philips Research Lab. (Belgium);
`J. Vandewalle, Catholic Univ. of Leuven (Belgium)...........0e sees cence eee ener nen e ener ene e een e ene ees 80
`Reducing the computations of the singular value decomposition array given by Brent and Luk
`B. Yang, J. F. Bohme, Ruhr-Univ. (FRG). 00.6.0 e cece cece renee eee e eee een n ene teen ease cn se ences eenees 92
`Lineararray for covariance differencing via hyperbolic singular value decomposition
`A. W.Bojanczyk, A. O. Steinhardt, Cornell Univ... ..... 0. eee e cece eee e ener n ener e nen en cnr n eee e nee. 103
`Transputer implementation of system identification and control using singular value decompositions
`W. E. Larimore, D. Martin, Computational Engineering, Inc.; F. T. Luk, Cornell Univ. .....--++++s+sseeeees 108
`Jacobi-like algorithm for computing the generalized Schur form of a regular pencil
`J.P. Charlier, P. Van Dooren,Philips Research Lab. (Belgium). ......0--0 sss seer eetee etre tree e eres ee ees 119
`CORDICalgorithms: theory and extensions
`JicML Delosmee; Yale Unive. g:o:5 sree aces sais oie oie owt eiore con nin nishsiafoiatn gesesitin teins tivie/#ntefe )cleke olerele/ se elese.sisiece 131
`Asymptotic analysis of the total least squares ESPRIT algorithm
`B. E. Ottersten, Stanford Univ.; M. Viberg, Stanford Univ. and Linkdping Univ. (Sweden); T. Kailath,
`Stanford Univ. ....ccccccccccccvvcccccccveccccnssececcsceteeseseccrerewseeeereeeeceeeeeeseseeees 146
`Analysis of the constrained total least squares method andits relation to harmonic super-resolution
`T. Abatzoglou, Lockheed Palo Alto Research Lab. ......0ssseeee essen eee e nese nese nett nent seen eeennes 158
`Separation of sources using higher order cumulants
`P. Comon, Thomson SINTRA (France). 0.0... sec cece cece cence nee ene n eee n tee ee sees eee eee eee ences 170
`W
`i
`.
`4 tar ARft
`(continued)
`bbevt*
`&
`eemEt
`“epA BA-= yo.
`Tiss7-SeoNV¥dvancedAlgorithms andArchitectures forSignalProcessingIV(1989) / iii
`
`Page 14 of 15
`
`Page 14 of 15
`
`
`
`
`
`
`
`“AKSIOVS
`sASVSR
`1a
`
`
`
`
`
`The papers appearingin this book comprise the proceedings of the meeting mentioned on
`the cover andtitle page. They reflect the authors’ opinions and are published as presented
`and without change,in the interests of timely dissemination. Their inclusion in this
`publication does not necessarily constitute endorsement by the editors or by SPIE.
`
`Please use the following formatto cite material from this book:
`Author(s), “Title of Paper,” Advanced Algorithms and Architectures for Signal
`Processing IV, Franklin T. Luk, Editor, Proc. SPIE 1152, page numbers(1989).
`
`Library of Congress Catalog Card No. 89-43265
`ISBN 0-8194-0188-9
`
`Copyright © 1989,The Society of Photo-Optical Instrumentation Engineers.
`Copying of material in this bookfor sale orfor internal or personal use beyondthefair use provisions granted by
`the U.S. Copyright Law is subject to paymentof copying fees. The Transactional Reporting Service basefee for
`this volume is $2.00 perarticle and should be paid directly to Copyright Clearance Center, 27 CongressStreet,
`Salem, MA 01970.For those organizations that have been granted a photocopylicense by CCC, a separate
`system of payment has been arranged. The fee code for users of the Transactional Reporting Service is
`0-8194-0188-9/89/$2.00.
`‘
`
`Individual readers of this book and nonprofitlibraries acting for them are permitted to makefair use of the
`materialin it, such as to copy anarticle for teaching or research, without paymentof a fee. Republication or
`systematic or multiple reproduction of any material in this book (including abstracts) is prohibited except with
`the permission of SPIE and oneof the authors.
`,
`
`Permission is granted to quote excerpts from articles in this book in other scientific or technical works with
`acknowledgmentofthe source, including the author's name,thetitle of the book, SPIE volume number, page
`number(s), and year. Reproduction of figures and tablesis likewise permitted in other articles and books
`provided that the same acknowledgmentof the source is printed with the