`
`
`
`Reducing the computations of the SVD array given by Brent and Luk
`
`B. Yang and J.F.B6hme
`
`Ruhr—Universitiit, Department of Electrical Engineering
`
`4630 Bochum, West Germany
`
`ABSTRACT
`
`A new, efficient two plane rotations (TPR) method for computing two—sided rotations involved in
`singular value decomposition (SVD) is presented. By exploiting the commutative properties of some special
`types of 2x2 matrices, we show that a two—sided rotation can be computed by only two plane rotations and a
`few additions. Moreover, if we use coordinate rotation digital computer (CORDIC) processors to implement
`the processing elements (PEs) of the SVD array given by Brent and Luk, the computational overhead of the
`diagonal PEs due to angle calculations can be avoided. The resulting SVD array has a homogeneous
`structure with identical diagonal and off-diagonal PEs.
`
`1.
`
`INTRODUCTION
`
`the SVD. Particular
`One important problem in linear algebra and digital signal processing is
`applications arise in beamforming and direction finding, Spectrum analysis, digital image processing etc.1
`Recently, there is a massive interest in parallel architectures for computing SVD due to the growing
`importance of real—time signal processing and advances in VLSI devices. Brent and Lulcz’3 have shown how
`a Jacobi method with parallel ordering can efficiently compute the SVD, and how the method can be
`implemented by a two-dimensional systolic array. The method is based on, as common for all two-sided
`approaches, applying a sequence of two—sided rotations to 2x2 submatrices of the original matrix. The
`computational complexity is thus determined by how to compute the two-sided rotations.
`
`Usually, a two—sided rotation is realized by four plane rotations, where two of them are applied from
`left to the two column vectors of the 2x2 submatrix and the other ones are applied from right to the row
`vectors, respectively. For the diagonal PEs of the SVD array, additional operations for calculating the
`rotation angles are required. This leads to an inhomogeneous array architecture containing two different
`types of PBS. In this paper, we develop a new TPR method for computing two—sided rotations. We show that
`the above computational complexity is reduced significantly, and the SVD array becomes homogeneous
`when using CORDIC processors.
`
`The paper is organized as follows. In section 2, we briefly re—examine the Jacobi SVD method and the
`SVD array. Then, we develop the TPR method in section 3. The CORDIC algorithm is described in section
`4. Different techniques for scaling correction are discussed, and examples of scaling corrected CORDIC
`sequences for different data formats are given. In section 5, an unified CORDIC implementation of all PEs
`
`92 / SPIE Vol. I 152 A dvanced Algorithms and Architectures for Signal PracessMg lV (1 .989)
`
`Page 1 of 15
`
`SAMSUNG EXHIBIT 1011
`
`Page 1 of 15
`
`SAMSUNG EXHIBIT 1011
`
`
`
`of the SVD array is presented. Finally, some convergence aspects are discussed in section 6.
`
`The SVD of a matrix Me IRNXN is given by
`
`2.
`
`JACOBI SVD METHOD
`
`M = Usz ,
`
`(1)
`
`where UElRNXN and VEIRNXN are orthogonal matrices and ZEIRNXN is a diagonal matrix of singular values.
`Based on an extension of the Jacobi eigenvalue algorithm, Kogbetliantz4, Forsythe and Henrici5 proposed to
`
`diagonalize M by a sequence of two-sided rotations,
`Mo = M ,
`Mk+1 = UIMka
`
`(k=0,1,2,- - -) .
`
`(2)
`
`Uk and Vk represent rotations in the (i,j)—plane (15i<jSN). The rotation angles are chosen to annihilate the
`elements of Mk at the positions (i,j) and (Li). Usually, several sweeps are necessary to complete the SVD,
`where a sweep is a sequence of N(N-1)/2 two—sided rotations according to a special ordering of the
`N(N—l)/2 different index pairs (i,j). Unfortunately, the best known cyclic orderings, namely, the cyclic row
`
`ordering
`
`(i,j) = (1.2),
`
`.(1,N),(2.3),
`
`, (LN),
`
`, (N-1,N)
`
`(3)
`
`and the equivalent cyclic column ordering are not suitable for parallel computations.
`
`Recently, Brent and Lukz’3 suggested a parallel ordering allowing a high degree of parallelism. It
`enables
`the use of
`a
`square
`systolic
`array with
`
`[N/2]x|'N/2] PBS to implement the Jacobi SVD method
`(Figure 1). For doing this, the matrix M is partitioned into
`2x2 submatrices. Each PE contains one submatrix and
`
`performs a two—sided rotation B = R(61)TAR(62),
`
`where
`
`A_[an a12]
`—
`a21
`a22
`
`d B
`an
`
`—
`
`[bu 1312]
`b21
`b22
`
`denote the input and output matrices, and
`
`sine
`cost)
`Rm): —sin9 cose
`
`(4)
`
`5
`()
`
`(6)
`
`F'
`1. Th
`SVD
`31:31:: and Lil:
`
`army given by
`
`plane
`a
`describes
`respectively. Notice
`
`through
`rotation
`that
`there
`are
`
`the
`two
`
`9,
`angle
`different
`
`computation modes. In a diagonal PE, the rotation angles
`61 and 62 are generated to diagonalize the 2x2 submatrix (b12=b21=0) stored. We call this the generation
`
` i
`
`SP/E Vol.
`
`1 152 Advanced Algorithms and Architectures for Signal Processing IV (1 989) / 93
`
`Page 2 of 15
`
`Page 2 of 15
`
`
`
`
`
`
`
`mode. Then, the rotation angles are propagated to all off-diagonal PBS in the same row and the same
`column, in which pure rotations (4) with the received angles are performed. We call this the rotation mode.
`Clearly, if we directly compute (4) in the rotation mode, we require four plane rotations. For the generation
`mode, additional operations for calculating 91 and 62 are needed.
`
`~—-—%———
`3. TPR METHOD FOR COMPUTING TWO—SIDED ROTATIONS
`
`In order to develop the TPR method for computing two-sided rotations more efficiently, we first
`discuss the commutative properties of two special types, the rotation-type and the reflection—type, of 2x2
`matrices, namely,
`
`flr°f={ [: j] ‘c,se|R}.
`and
`‘c,selR}
`flr°[={ [‘2 z]
`Obviously, the plane rotation matrix and the Givens reflection matrix6 with cZ+s2=1 are two elements of the
`sets J! m and .1! M, respectively. The following results can be shown by elementary manipulations.
`
`lemma 1.
`
`If Ale 4"“ and A25 I“, then A1A2=A2Ale Jam.
`
`Lemma 2.
`
`If Ale M“ and A26 45’“,
`
`then A1A2=A§Ale 4’“.
`
`In particular, if we consider two plane rotations, we know,
`
`If R(9]) and R(92) are plane rotations described by (6), then R(61)R(92)=R(91+67) and
`Lemma 3.
`R<61)TR<62)=R<02491).
`
`Now, we give a theorem describing the rotation mode of the TPR method.
`
`Theorem If the 2x2 matrix A and the two rotation angles 61 and 62 are given, then the two—sided
`rotation (4). can be computed by two plane rotations, ten additions and four scalings by 1/2:
`
`P1 = (a22+a11)/2
`{q1 = (azl—aw/z ’
`
`P2 = (azz—an)/2
`{q2 = (a2] +a12)/2 ’
`
`0_=62—01,
`
`6+=62+61,
`
`["l - M [m]
`tl‘ ”ql’
`
`[’2] - R 9 [p2]
`t2_ (9,1,.
`
`"
`
`b12 = -tl+t2
`bll = rl-fz
`{b22= r1+r2‘
`{b21=t1+t2’
`Proof. Using (7), the input matrix A can be reformulated by
`
`(7)
`
`(8)
`
`(9)
`
`(10)
`
`A=A1+A2=[
`pl «11] + [-pz <12]
`Clearly, R(61), me?) in (4) and A1 are elements of 1’“ while A2 belongs to 4"“. This leads to the
`following expression for the output matrix B by using the lemmas 1—3,
`
`‘12 P2
`
`‘11
`
`P1
`
`94 / SPIE Vol. 1152 AdvancedA/gorithms andArch/(ectures for Signal Processing IV {1989)
`
`Page 3 of 15
`
`Page 3 of 15
`
`
`
`B = meoTARte» = R<61)TA1R(62) + R<el>TA2R<ez>
`
`R<eoTR<eaA1+ R<el)TR<ez>TA2 = R<ez4al>A1+ R(92+91)TA2
`
`P1
`
`R(9_)[ ‘11
`
`1'
`
`
`
`-p2 <12]]+R(6+) [ 92 P2
`
`II
`
`1‘1
`t1
`
`-t1
`r1 +
`
`-l'2 t2
`t2
`r2
`
`'
`
`This completes the proof.
`
`A direct consequence of the above theorem shows how a two-sided rotation in the generation mode can
`
`be computed in a similar way.
`
`Corollary.
`
`If the 2x2 matrix A is given, we can diagonalize A and calculate the corresponding rotation
`
`angles 01 and 62 by two Cartesian to polar coordinates conversions, eight additions and four scalings by 1/2:
`
`P2 =' (fizz — a11)/2
`
`<12 = (a21+a12)/2 ’
`
`P1 = (322 + a11)/2
`
`<11 = (an-aifl/2 ’
`
`{
`
`I1
`
`{9-
`
`=
`
`.
`
`51910,”in + (1% ,
`ar“MIMI/Pl)
`
`{
`{ I'2
`
`.
`
`2
`
`2
`
`Slgn(P2)JP2 + ‘12
`
`9+ = arcran(q2/p2)
`
`’
`
`(12)
`
`(13)
`
`(14)
`
`91 = (9+—9_)/2 ,
`
`92 = (9++9_)/2 ,
`
`bu = Il-rz .
`
`b22 = r1+I2 -
`
`Proof. Regarding equation (10), t1=t2=0 follows directly from b12=b21=0.
`
`In equation (12), we choose the rotation through the smaller angle. All vectors lying in the first or the
`
`fourth quadrant are rotated to the positive x—axis, and all vectors lying in the second and the third quadrant
`
`are rotated to the negative x—axis. For vectors on the y—axis, the rotation direction is arbitrary. Thus, the
`
`generated rotation angles 0_ and 6+ satisfy |9_| ,
`
`|6+| S 1t/2. This results in
`
`|01|S1rl2
`
`and
`
`|92|Sn/2
`
`(15)
`
`due to (13).
`
`We remark that Delosme7 has also proposed to use (14) for computing bu and bn in the generation
`
`mode. However, he did not recognize the fact that the above approach is generally possible because of the
`
`commutative properties of the rotation—type and the reflection-type of matrices. So, he still requires four
`
`plane rotations for computing a two-sided rotation in the rotation mode.
`
`4. THE CORDIC ALGORITHM
`
`In the previous section, we have seen that the main operations of the TPR—method are plane rotations
`
`and Cartesian to polar coordinates conversions. Although these operations can be carried out by multiplier—
`
`SP/E Vol. 1752 Advanced Algorithms andArchiteclures for Signal Processing IV (1989/ /. 95
`
`K—________—
`Page 4 of 15
`
`
`
`Page 4 of 15
`
`
`
`
`
`adder based processors supported by software or special hardware units, a simpler approach is the use of
`dedicated processors that map algorithms more effectively to hardware. The CORDIC processor is a
`powerful one for calculating these functions.
`
`The CORDIC algorithm was originally designed by Volder8 as an iterative procedure for computing
`plane rotations and Cartesian to polar coordinates conversions. It was later generalized and unified by
`Waltherg enabling a CORDIC processor to calculate more functions, including hyperbolic functions as well
`as multiplications and divisions, in a similar way. In the following, only Volder's CORDIC algorithm is
`considered for the sake of simplicity because only trigonometric functions are involved in SVD applications.
`
`The CORDIC algorithm consists of iterative shift-add operations on a three-component vector,
`
`cos(0ti) —Gisin((xi)
`[
`1
`[XHI ] = [ xi—O'ioiyi J =
`yi+1
`yi+<)'i8ixi m 0's1n(0t)
`cos(oci)
`2H1
`= zi — soiai
`(i=0,1,...,n—1)
`,
`
`
`
`l xi ]
`yi
`
`in which the iteration stepsize 0<8i<1 is defined by
`ai = tan(oci) = 2‘5“) .
`
`(16)
`
`(17)
`
`(18)
`
`The set of integers {S(i)} is called CORDIC sequence. Equation (16) can be interpreted, except for a scaling
`factor of
`
`=
`
`1
`(19)
`1+8.2
`cosocl =
`i
`as a rotation of (xi, yiT) through the angle eti, where O'i=+1 gives the rotation direction. After 11 iterations
`(16) and (17), the resulting data are given by
`
`Xu
`cosoc —sin0t
`y
`= K sinoc
`cosoc
`(20)
`2n
`= zo—ea
`(21)
`with the overall scaling factor K: 1'_I ki and the cumulative rotation angle a: 2 6ioni. Now, if the CORDIC
`sequence satisfies the following convergence condition
`"
`n— 1
`
`x0
`y0
`
`’
`
`0c. — 2 on. S or
`1
`j=i+1 J
`“‘1
`we can choose oi=—sign(xiy.) or ci=sign(£zi) to force yn or zn to zero, provided that the input data x
`and 20 lie1n the convergence region
`
`(i=0,1,...,n—2) ,
`
`(22)
`
`0’ yo
`
`n—l
`
`[larctan(y0/x0)|
`
`for y —»0
`
`“
`
`.
`
`(23)
`
`.96 / SPIE Vol. 1 752 Advanced Algorithms and Architectures for Signal Processing /V I7.989)
`
`Page 5 of 15
`
`Page 5 of 15
`
`
`
`(Table 1), where 8:11 is a sign parameter.
`
`yn = K(exosinzo+ yocoszo)
`
`- Ksign(x0)Jx6+y3I
`zo+earctan(yo/xo)
`
`xn = K( xocoszo—eyosinzo)
`
`Table 1 : CORDIC trigonometric functions
`
`In Table 1, only the principal values |arctan(yo/xo) |Srt/2 are calculated when computing Cartesian to
`
`polar coordinates conversions. Correspondingly, xn may be the positive or the negative vector norm
`according to the sign of XO. These properties for computing (12) are enabled by the particular choice of
`oi=-sign(xiyi), in contrast to 6i=—sign(yi) in most other CORDIC publications. So, it is guaranteed that a
`vector is always rotated through the smaller angle to the x-axis. In this case, a convergence region of C2900
`
`is sufficient for the generation mode.
`
`A disadvantage of the CORDIC algorithm is the need to correct the scaling factor K. The traditional
`
`CORDIC sequence
`
`{S(i)} = {0,1, 2, 3,
`
`, p—l, p}
`
`(24)
`
`with n=p+1 CORDIC iterations for a mantissa length p delivers, for example, a scaling factor of Kzl.64676
`for p=16. Using multiplications of xn and yn by K‘1 will degrade the algorithm performance significantly.
`
`We like to compensate the scaling effect by using a minimum number of shift—add operations.
`
`For a two—sided rotation that is implemented by four plane rotations, each matrix element undergoes
`two plane rotations so that the total scaling factor to be corrected is K2. In this case, Cavallaro and Luk10
`
`pointed out that there is a simple systematic approach for scaling correction if the traditional CORDIC
`sequence (24) is used. They proposed to use [p/4] scaling iterations of the type x «— x—2—2jx with je J={1, 3,
`5,
`, 2fp/41—1} and one shift operation 2—1. The remaining scaling error is bounded as followsl) ,
`.
`.
`p
`.
`
`SPIE Vol. I 752 AdvancedA/gorithms and Architectures for Signal Processing /V (1989/ / 97
`
`1—2'1r1(1—"21)-K2‘ = 11—24110—245-11(1+2‘2‘)
`
`jeJ
`
`jeJ
`
`i=0
`
`< 2‘1"1 .
`
`(25)
`
`This approach, however, fails in the TPR method. Here, each matrix element undergoes only one plane
`
`rotation. So, the scaling factor to be corrected is K. For solving this more difficult problem, different
`techniques have been developed. Haviland and Tuszynski11 used similar scaling iterations as Cavallaro and
`Luk. Ahmed12 proposed to make K to a power of the machine radix by repeating some CORDIC iterations.
`Delosme13 combined both methods above for minimizing the computational overhead. Deprettere et al.14
`suggested the double-shift concept for providing more freedom by optimizing CORDIC sequences.
`
`We designed a computer program15 for a systematic search of CORDIC sequences with comparable
`
`1) When replacing [p/4] by [(p—1)/4] or [(p+1)/4], the upper bound in (25) becomes 2—p or 2—p—2, respectively.
`
`h
`
`Page 6 of 15
`
`Page 6 of 15
`
`
`
`
`
`
`
`complexity for correcting K and found sequences requiring the same number [p/4] of additional iterations.
`We examined all relevant combinations of the shift parameters S(i) (i=O,1,...,n—1). In particular, we allowed
`a difference two between adjacent shifts, for example S(i)={0, 1, 3, 3, 3, ...}, such that much more CORDIC
`sequences could be tested. The resulting scaling factor K is corrected by a sequence of nk shift-add
`operations,
`
`24(0) fik[1+na)2‘TG)] -K = 1+AK
`
`j=1
`
`(TO) integers, n0)=i1) ,
`
`(26)
`
`, n(nk)T(nk)}. This results in a total
`that are characterized by the set of parameters {T(0), 11(1)T(1),
`number of L=n+nk iterations. We note that the remaining scaling error AK is a systematic error in contrast
`to the other two types of CORDIC errors, the angle resolution and the rounding errors. So, it is particularly
`important
`to keep AK much smaller than 2"), especially if repeated CORDIC operations have to be
`performed on the same data such as in SVD applications.
`
`In the following, for five different mantissa lengthes p=16, 20, 24, 28 and 32, examples of such
`CORDIC sequences, including the total number of iterations L=n+nk, the remaining scaling error AK and
`the convergence region C are given.
`
`p=l6:
`
`{S(i)} = {o 1 2 3 ------ 1516} ,
`
`p=20:
`
`p=24:
`
`p=28:
`
`p=32:
`
`{mum} = {1 +2 —5 +9 +10} ,
`
`L=17+4, C=100°, AK=—2"16'01 2)
`
`19 20} ,
`{S(i)} = {0123 ------
`mama} = {1 +2 —5 +9 +10 +16} ,
`
`L=21+5, C=100°, AK=2'23'05
`
`{S(i)}:{1123345566788910 ------ 2324},
`mama} = {0 —2 +6} ,
`L=29+2, c=91°, AK=—2_29'13
`
`{S(i)}:{1123345566788910111213141415 ------ 2728},
`mamp} = {o —2 +6} ,
`L=34+2, c=91°, AK=2'32'53
`
`{S(i)}:{001333456789910 ------ 3132},
`mama} = {1 —3-—8 +16 —25 —27} ,
`L=36+5, c=145°, AK=—2‘.,39'93
`
`In order to meet the convergence condition (22) and to provide a convergence region C2900, the minimum
`number of CORDIC iterations for a mantissa length of p is p+1. So, for all CORDIC sequences given above,
`the number L—(p+1) of additional iterations for scaling correction is p/4. Moreover, except for the first
`CORDIC sequence, the remaining scaling error AK is always significantly smaller than 2—p.
`
`that similar results can also be obtained by using Deprettere's double—shift concept.
`We suspect
`However, this method requires a slightly increased hardware complexity and will not be discussed in this
`paper.
`
`
`2) When appending an additional scaling iteration with n(5)T(5)=+16, the scaling accuracy can be enhanced to AK=2—23.
`
`98 / SPIE Vol. 7 152 AdvancedA/gorithms andArch/tectures for Signal Processing /V (1989/
`
`
`Page 7 of 15
`
`Page 7 of 15
`
`
`
`5. CORDIC IMPLEMENTATIONS OF THE SVD PES
`
`We first introduce a CORDIC processor symbol as shown in Figure 2. The descriptions inside the box
`
`determine uniquely the function mode of the processor. The output data x, y and z are assumed to be scaling
`corrected.
`
`The operations (7)—(10) and (11)—(14) of the TPR method can be mapped onto a two CORDIC
`
`
`
`Figure 2' CORDIC symbol
`
`processor architecture. In Figure 3, the diagonal PEs of the SVD
`
`array working in the generation mode are implemented by two
`
`CORDIC processors computing Cartesian to polar coordinates
`
`conversions and eight adders. The dotted inputs of the adders
`
`represent negated inputs. In Figure 4, the off-diagonal PEs of the
`
`SVD array working in the rotation mode are implemented by two
`
`CORDIC processors computing plane rotations and ten adders.
`Clearly, both CORDIC implementations have nearly the same
`
`architecture. All PEs of the SVD array can thus be implemented by
`
`one unified CORDIC SVD module (Figure 5) without considerably increased hardware cost. In this case, the
`
`SVD array depicted in Figure 1 becomes homogeneous. The different computation modes of the diagonal
`
`and off-diagonal PEs can be easily selected by one control bit.
`
`We remark that Figure 5 is more a "graphic program" describing the sequence of operations to be
`
`computed rather than a hardware block diagram. In particular, the twelve adders that are paired into three
`
`pre-butterflys and three post-butterflys can be integrated into the two CORDIC processors without separate
`
`hardware implementations. Because of the recursive nature of the Jacobi SVD method, only recursive
`
`CORDIC processors can be used. Each of them consists mainly of two barrel-shifters and three adders for
`
`computing the iterative shift-add operations (16) and (17). So, it is natural to modify the CORDIC processor
`
`architecture slightly and to use the existing six adders contained in the two CORDIC processors for
`
`computing both, the pre—butterfly and the post-butterfly operations. The only computational overhead are
`
`two iterations resulting in a total computation time of L+2 iterations for one CORDIC SVD module. In
`
`Figure 6, the principal architecture of such a two CORDIC processor SVD module is shown, where the
`
`dashed lines and boxes represent the additional hardware components enabling the CORDIC processors to
`
`compute the butterfly operations. It can be easily verified that the upper four adders devoted to x and y can
`perform the following types of operations 2'lxi2"ly, xfl'sy (CORDIC iteration) and xi2'sx (scaling
`iteration), while the lower two adders devoted to 2 can compute 2—121fl'122 and lid.
`
`Now, we compare the area and time complexity of the above CORDIC SVD module with that of the
`CORDIC SVD modules proposed by Cavallaro and Luk“. Let Acsvd and Tam denote the area and time
`
`complexity of a CORDIC SVD module and Ac and Te those of a CORDIC processor, respectively.
`
`Cavallaro and Luk have shown that their most efficient parallel diagonalization method requires Amwd'z2Ac
`
`SP/E Vol. 1 152 Advanced Algorithms and Architectures for Signal Processing lV (1989) / 99
`
`
`
`k—
`
`Page 8 of 15
`
`
`
`Page 8 of 15
`
`
`
`
`
`
`
`pos t- butterfly:
`of the unified CORDIC SVD module
`
`Figure 3. CORDIC implementation of the
`diagonal PE of the SVD array
`
`Figure 4. CORDIC implementation of the
`off—diagonal PE of the SVD array
`
`pre— bu tterflys
`
`
`
`Figure 6. The principal architecture
`
`Figure 5. An unified CORDIC SVD module
`for implementing all PEs of the SVD array
`
`100 / SPIE Vol. 1 152 Advanced Algorithms and Architectures for Signal Processing IV (7989)
`
`WOHS
`
`Page 9 of 15
`
`
`
`and Tcsvd=3Tc for the diagonal PBS and AMWFZAc and de=2Tc for the off-diagonal PEs. By using the
`TPR method, we need only Ac,“,d:=2Ac and Tcsvd=Tc for all PEs. In other words, the time complexity is
`
`reduced by more than 50 percent.
`
`6.
`
`SOME CONVERGENCE CONSIDERATIONS
`
`Forsythe and Henrici5 have proven the convergence of the Jacobi SVD method with the cyclic row
`ordering or the cyclic column ordering if the rotation angles 61 and 62 in (4) are restricted to a closed
`interval H inside the open interval (—1t/2, 1r/2). They have also demonstrated that this condition may fail to
`
`hold, i.e. 91 and 92 may have the values int/2, if the off-diagonal elements bu and b2] in (5) have to be
`exactly annihilated. As a remedy, they suggested an under- or over—rotation by computing (4) with angles
`(l—y)91 and (1—y)62 (—1<'y<1) and proved its convergence. This causes complications for CORDIC
`implementations. Because no longer the rotation angles 61 and 92 generated according to (12) and (13) are
`used, an additional two—sided rotation (4) in the rotation mode with the modified angles (1—y)61 and (l—y)62
`has to be performed in the diagonal PEs. However, the finite mantissa length in the real arithmetic allows
`only an approximative computation of the rotation angles and implies under- or over-rotations. In the
`practical applications, the Jacobi SVD method converges without using under- or over—rotations as shown by
`the experimental results of Brent and Luk3.
`
`Recently, Luk and Park17 have shown that various parallel orderings, including the parallel ordering of
`Brent and Luk2 for odd N, are equivalent to the cyclic orderings and share the same convergence properties.
`Although the parallel architectures for computing SVD for these parallel orderings may be different, the
`basic operations are always two—sided rotations (4). So, the TPR method can be applied to any parallel
`
`orderings.
`
`7. CONCLUSION
`
`In this paper, we have presented a novel algorithm for computing two-sided rotations requiring only
`
`two plane rotations and a few additions. This results in a reduced computational complexity of the Jacobi
`
`SVD method. Moreover, we have presented an unified CORDIC SVD module for implementing all PEs of
`
`the SVD array given by Brent and Luk. This leads to a homogeneous SVD array that is simple, regular and
`
`offers twice the computational speed.
`
`REFERENCES
`
`l.
`
`Speiser, J.M., "Signal Processing Computational Needs", Proc. SPIE Advanced Algorithms and
`
`Architectures for Signal Processing I, vol.696, pp.2—6, 1986
`
`2. Brent, RP. and Luk, F.T., "The Solution of Singular Value and Symmetric Eigenvalue Problems on
`
`Multiprocessor Arrays" SIAM J. Sci. Stat. Comput., vol.6, no.1, pp.69—84, 1985
`
`3. Brent, R.P., Luk. ET. and Van Loan, C.F., "Computation of the Singular Value Decomposition Using
`
`It
`
`Page 10 of 15
`
`SPIE Vol. 7 752 Advanced Algorithms andArchitectures far Signal Processing /V (7989) / 107
`
`
`
`
`
`Page 10 of 15
`
`
`
`7. Delosme, J.M., "A Processor for Two-dimensional Symmetric Eigenvalue and Singular Value Arrays",
`Proc. 2] st Asilomar Conference on Circuits, Systems and Computers, Nov. 1987
`8. Volder, J.E.,
`"The CORDIC Trigonometn'c Computing Technique",
`IRE Trans. on Electronic
`Computers, vol.8, pp.330—334, 1959
`
`9. Walther, J.S., "An Unified Algorithm for Elementary Functions", Proc. Spring Joint Computer
`Conference, pp.379—385, 1971
`
`10. Cavallaro, J.R. and Luk, F.T., "CORDIC Arithmetic for an SVD Processor", Journal of Parallel and
`Distributed Computing, vol.5, pp.271—290, 1988
`
`11. Haviland, G.L. and Tuszynski, A.A., "A CORDIC Arithmetic Processor Chip", IEEE Trans. on
`Computers, vol.29, no.2, pp.68—79, 1980
`
`12. Ahmed, H.M., Signal Processing Algorithms and Architectures, Ph.D., Department of Electrical
`Engineering, Stanford University, Dec. 1981
`13. Delosme, J.M., "VLSI Implementation of Rotations in Pseudo-Euclidean Spaces", Proc. IEEE ICASSP,
`pp.927—930, Boston, 1983
`
`
`
`Mesh-Connected Processors", Journal of VLSI and Computer Systems, vol.1, no.3, pp.242—270, 1985
`4. Kogbetliantz, E.G., "Solution of Linear Equations by Diagonalization of Coefficient Matrices", Quart.
`Appla. Math., vol.13, pp.123—132, 1955
`
`5.
`
`Forsythe, GE. and Henrici, P., "The Cyclic Jacobi Method for Computing the Principal Values of a
`Complex Matrix", Trans. Amer. Math. Soc., vol.94, pp.1—23, 1960
`6. Golub, G.H. and Van Loan, C.F., Matrix Computations, pp.44, Johns Hopkins University Press,
`Baltimore, 1983
`
`702 / SPIE Vol. I 752 AdvancedA/gon'thms and Architectures for Signal Processing IV (7 989)
`
`14. Deprettere, E.F., Dewilde, P. and Udo, R., "Pipeliend CORDIC Architectures for Fast VLSI Filtering
`and Array Processing", Proc. IEEE ICASSP, pp.41A.6.141A.6.4, San Diego, March 1984
`15. Konig, D.,
`(in German) Optimierung des CORDIC-Verfahrens fiir Prozessoren mit verschiedenen
`Datenformaten, Tech. Report, Lehrstuhl fiir Signaltheorie, Ruhr-Universitiit Bochum, FRG, 1989
`16. Cavallaro, J.R. and Luk, F.T., "Architectures for a CORDIC SVD Processor", Proc. SPIE Real-Time
`Signal Processing IX, vol.698, pp.45—53, 1986
`17. Luk, RT. and Park, H., "On the Equivalence and Convergence of Parallel Jacobi SVD Algorithms",
`Proc. SPIE Advanced Algorithms and Architectures for Signal Processing 11, vol.826, pp.152—159,
`1987
`
`Page 11 0f15 .
`
`Page 11 of 15
`
`
`
`
`
`PROCEEDINGS
`
`@SPIE—The International Society for Optical Engineering
`
`Advanced Algorithms
`and Architectures for
`Signal Processing IV
`
`Franklin T. Luk
`
`Chair/Editor
`
`8—10 August 1989
`San Diego, California
`
`
`
`Volume 1 152
`
`
`
`Page 12 0f15
`
`
`
`'
`7m
`W
`N '
`'
`———J
`
`W
`
`Page 12 of 15
`
`
`
`
`
`PROCEEDINGS
`SPIE—The International Society for Optical Engineering
`
`Advanced Algorithms
`//and Architectures for
`Signal Processing IV
`
`Franklin T. Luk
`
`Chair/Editor
`
`8-10 August 1989
`San Diego, California
`
`Sponsored by
`SPIE—The International Society for Optical Engineering
`
`Cooperating Organizations
`Applied Optics Laboratory/New Mexico State University
`Center for Applied Optics Studies/Roseel-Iulman Institute of Technology
`Center for Applied Optics/University of Alabama in Huntsville
`Center for Electro—Optics/University of Dayton
`Center for Excellence in Optical Data Processing/Carnegie Mellon University
`Jet Propulsion Laboratory/California Institute of Technology
`Optical Sciences Center/University of Arizona
`Optoelectronic Computing Systems Center/University of Colorado,
`Colorado State University
`
`Published by
`SPIE—The lntemational Society for Optical Engineering
`P.O. Box 10, Bellingham, Washington 98227—0010 USA
`Telephone 206/6763290 (Pacific Time) . Telex 46—7053
`
`E R l E 2
`
`PROCEEDINGS
`'3
`
`l 5 2
`V l
`0 ume 1
`
`
`
`SPIE (The Society of PhotoOptical Instrumentation Engineers) is a nonprofit society dedicated to advancing engineering
`and scientific applications of optical, electranptical, and optoelectronic instrumentation, systems, and technology.
`
`Page 13 of 15
`
`Page 13 of 15
`
`
`
`
`
`ADVANCED ALGORITHMS AND ARCHITECTURES FOR SIGNAL PROCESSING IV
`
`
`
`Volume 1152
`
`CONTENTS
`
`1152—01
`
`SESSION 1
`1 152—02
`
`1 152—03
`
`1 152—04
`
`1 152—05
`
`1 152—06
`
`1 152—07
`
`SESSION 2
`1 152—08
`
`1152—09
`
`1152—10
`
`1152—11
`
`1152—12
`
`1152—13
`
`1152—14
`
`1152—15
`
`1152—16
`
`Conference Committee ............................................................................ vi
`Introduction .................................................................................... vii
`
`KEYNOTE ADDRESS
`
`Algorithmic engineering: an emerging discipline
`J. G. McWhirter, Royal Signals and Radar Establishment (UK). .......................................... 2
`
`ADAPTIVE FILTERING
`Chaotic model of sea clutter using a neural network
`S. Haykin, H. Leung, McMaster Univ. (Canada)....................................................... 18
`Multichannel fast transversal filter algorithms for adaptive broadband beamforming
`D. T. M. Slock, Philips Research Lab. (Belgium); T. Kailath, Stanford Univ. ............................... 22
`Data equalization using highly nonlinear adaptive architectures
`C. F. N. Cowan, G. J. Gibson, S. Siu, Edinburgh Univ. (UK). ........................................... 34
`VLSI architectures for square root covariance Kalman filtering
`F. M. Gaston, G. Irwin, Queen‘s Univ. (UK). ........................................................ 44
`QR—decomposition based lattice filter algorithms
`I. K. Proudler, J. G. McWhirter, T. J. Shepherd, Royal Signals and Radar Establishment (UK)................. 56
`Parallel weight extraction by a systolic least squares algorithm
`J. E. Hudson, STC Technology Ltd. (UK); T. J. Shepherd, Royal Signals and Radar Establishment (UK). ....... 68
`
`NUMERICAL TECHNIQUES
`Updating singular value decompositions: a parallel implementation
`M. Moonen, Catholic Univ. of Leuven (Belgium); P. Van Dooren. Philips Research Lab. (Belgium);
`J. Vandewalle, Catholic Univ. of Leuven (Belgium)..................................................... 80
`Reducing the computations of the singular value decomposition array given by Brent and Luk
`B. Yang, J. F. Bohme, Ruhr—Univ. (FRG)............................................................. 92
`Linear array for covariance differencing via hyperbolic singular value decomposition
`A. W. Bojanczyk, A. O. Steinhardt. Cornell Univ..................................................... 103
`Transputer implementation of system identification and control using singular value decompositions
`W. E. Larimore, D. Martin, Computational Engineering, Inc.; F. T. Luk, Cornell Univ. ..................... 108
`Jacobi-like algorithm for computing the generalized Schur form of a regular pencil
`J.—P. Charlier, P. Van Dooren, Philips Research Lab. (Belgium).......................................... 119
`CORDIC algorithms: theory and extensions
`J.—M. Delosme, Yale Univ......................................................................... 131
`Asymptotic analysis of the total least squares ESPRIT algorithm
`B. E. Ottersten, Stanford Univ.; M. Viberg, Stanford Univ. and Linkoping Univ. (Sweden); T. Kailath,
`Stanford Univ. ................................................................................. 146
`Analysis of the constrained total least squares method and its relation to harmonic super-resolution
`T. Abatzoglou, Lockheed Palo Alto Research Lab..................................................... 158
`Separation of sources using higher order cumulants
`P. Comon, Thomson SINTRA (France)............................................................. 170
`V.
`.
`-
`v
`, 3... «l91;; {
`(continued)
`.2
`.
`’
`O
`twig51::ééfipiflz
`dvfsnced Algorithms and Architectures for Signal Processing /V (1989) / iii
`
`¥Page 14 of 15
`
`Page 14 of 15
`
`
`
`
`
`
`
`~l lg: to 1. ‘3
`.
`t\- O");
`
`\c’i "c‘i
`
`
`
`The papers appearing in this book comprise the proceedings of the meeting mentioned on
`the cover and title page. They reflect the authors' opinions and are published as presented
`and without change, in the interests of timely dissemination Their inclusion in this
`publication does not necessarily constitute endorsement by the editors or by SPIE.
`
`Please use the following format to cite material from this book:
`Authorls), ”Title of Paper," Advanced Algorithms and Architectures for Signal
`Processing IV, Franklin T. Luk, Editor, Proc. SPIE 1152, page numbers (1989).
`
`Library of Congress Catalog Card No. 89-43265
`ISBN 0-8194-0188-9
`
`Copyright © 1989, The Society of Photo-Optical Instrumentation Engineers.
`
`Copying of material in this book for sale or for internal or personal use beyond the fair use provisions granted by
`the US. Copyright Law is subject to payment of copying fees. The Transactional Reporting Service base fee for
`this volume is $2.00 per article and should be paid directly to Copyright Clearance Center, 27 Congress Street,
`Salem, MA 01970. For those organizations that have been granted a photocopy license by CCC, a separate
`system of payment has been arranged. The fee code for users of the Transactional Reporting Service is
`0-8194-0188-9/89/S200.
`
`Individual readers of this book and nonprofit libraries acting for them are permitted to make fair use of the
`material in it, such as to copy an article for teaching or research, without payment of a fee. Republication or
`systematic or multiple reproduction of any material in this book (including abstracts) is prohibited except with
`the permission of SPIE and one of the authors.
`~
`
`Permission is granted to quote excerpts from articles in this book in other scientific or technical works with
`acknowledgment of the source, including the author's name, the title of the book, SPIE volume number, page
`numberls), and year. Reproduction of figures and tables is likewise per