`vol. I2, No. 4. pp, Tit—1'25, October I99i
`
`l3 l9§|l Society for Industrial and Applied Mathematics
`009
`
`REDUCING THE COMPUTATIONS OF THE SINGULAR VALUE
`DECOMPOSITION ARRAY GIVEN BY BRENT AND LUK“
`
`B. YANG? AND J. F. BOHME‘i‘
`
`Abstract. A new, efficient, two-plane rotation (TPR) method for computing two-sided rotations involved
`in singular value decomposition (SVD) is presented. It is shown that a two-sided rotation can be evaluated by
`only two plane rotations and a few additions. This leads to significantly reduced computations. Moreover, if
`coordinate rotation digital computer (OORDIC) processors are used for realizing the processing elements (PEs)
`of the SVD array given by Brent and Luk, the computational overhead of the diagonal PEs due to angle
`calculations can be avoided. The resulting SVD array has a homogeneous structure with identical diagonal and
`off-diagonal PEs. Similar results can also be obtained if the TPR method is applied to Luk’s triangular SVD
`array and to Stewart’s Schur decomposition array.
`
`Key words. singular value decomposition, systolic arrays, CORDIC, two-sided rotations, VLSI
`
`AMS(MOS) subject classification.
`
`ISAIS
`
`1. Introduction. One important problem in linear algebra and digital signal pro-
`cessing is the singular value decomposition (SVD). Typical applications arise in beam-
`forming and direction finding, spectrum analysis, digital image processing, etc. [1]. Re-
`cently, there has been a massive interest in parallel architectures for computing SVD
`because of the high computational complexity of SVD, the growing importance of real-
`time signal processing, and the rapid advances in very large scale integration (VLSI) that
`make low-cost, high-density and fast processing memory devices available.
`There are different numerically stable methods for computing complete singular
`value and singular vector systems of dense matrices, for example, the Jacobi SVD method,
`the QR method, and the one-sided Hestenes method. For parallel implementations, the
`Jacobi SVD method is far superior in terms of simplicity, regularity, and local com-
`munications. Brent, Luk, and Van Loan have shown how the Jacobi SVD method with
`parallel ordering can be implemented by a two-dimensional systolic array [2], [3]. Various
`coordinate rotation digital computer (CORDIC) realizations of the SVD array have been
`reported by Cavallaro and Luk [4] and Delosme [5], [6].
`The Jacobi SVD method is based on, as common for all two-sided approaches,
`applying a sequence of two-sided rotations to 2 X 2 submatrices of the original matrix.
`The computational complexity is thus determined by how to compute the two—sided
`rotations. In most previous works, a two-sided rotation is evaluated in a straightforward
`manner by four plane rotations, where two of them are applied from left to the two
`column vectors of the 2 X 2 submatrix and the other ones are applied from right to the
`row vectors, respectively. In the diagonal processing elements (PEs), additional operations
`for calculating rotation angles are required. This leads to an inhomogeneous array ar-
`chitecture containing two different types of PBS.
`In this paper, we develop a two-plane rotation {TPR} method for computing two—
`sided rotations. We show that the above computational complexity can be reduced sig-
`nificantly because each two-sided rotation can be evaluated by only two plane rotations
`and a few additions. Moreover, the SVD array given by Brent and Luk becomes ho-
`mogeneous with identical diagonal and off-diagonal PEs when CORDIC processors are
`
`“‘ Received by the editors September 28, 1989; accepted for publication {in revised form) August 2, 1990.
`1' Department of Electrical Engineering, Ruhr-Universitiit Bochum, 4630 Bochum, Germany.
`'.-'l3
`
`
`
`
`
`
`
`
`
`Downloaded01f04f13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:/fwmv.siatn.org/joumalsfojsaphp
`
`
`
`
`
`
`
`
`
`1
`
`LG 1008
`
`Page 1 of 13
`
`SAMSUNG EXHIBIT 1030
`
`1
`
`LG 1008
`
`Page 1 of 13
`
`SAMSUNG EXHIBIT 1030
`
`
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehup:ffww.siarn.org’joumals/ojsa.php
`
`
`
`
`
`
`
`
`
`714
`
`B. YANG AND J. F. Bonus
`
`used. In a recent work [6], Delosrne has also indicated this possibility in connection
`with “rough rotations" independently. He has taken, however, a different approach that
`is based on encoding the rotation angles. He has still required four plane rotations on
`the oflldiagonal PEs while diagonal and off-diagonal operations can be overlapped.
`Our paper is organized as follows. In § 2, we briefly reexamine Jacobi’s SVD method
`and Brent and Luk’s SVD array. Then, we develop the TPR method in § 3. The CORDIC
`algorithm is described in § 4, where in particular CORDIC scaling correction techniques
`are discussed and examples of scaling—corrected CORDIC sequences are given. In § 5, a
`unified CORDIC SVD module for all PEs of the SVD array is presented. This module
`is compared to those proposed by Cavallaro, Luk, and Delosme in § 6. Finally, we stress
`the applicability of the TPR method to several other problems.
`
`2. Jacobi SVD method. In this paper, we consider real, square, and nonsymmetric
`matrices. Let ME RNX N be a matrix of dimension N. The SVD is given by
`
`(l)
`
`M=UEVT,
`
`where U e R” x” and V e IR“r>< N are orthogonal matrices containing the left and right
`singular vectors, and 2 6 IR” X” is a diagonal matrix of singular values, respectively. The
`superscript Tdenotes matrix transpose. Based on an extension of the Jacobi eigenvalue
`algorithm [7], Kogbetliantz [8] and Forsythe and Henrici [9] proposed to diagonalize
`M by a sequence of two-sided rotations,
`
`(2)
`
`M0=M, Mt,.=U£'M,.I/,
`
`(k=0,1,2,---).
`
`Hg, and Va describe two rotations in the (i, j)-plane ( 1 é t' < j g N), where the rotation
`angles are chosen to annihilate the elements of Mk at the positions (i, j) and (j, 1').
`Usually, several sweeps are necessary to complete the SVD, where a sweep is a sequence
`of N( N -— 1 ) / 2 two-sided rotations according to a special ordering of the N(N — 1N2
`different index pairs (1’, j).
`For sequential computing on a uniprocessor system, possibly the most frequently
`used orderings are the cyclic orderings, namely, the cyclic row ordering
`
`(3)
`
`(i,j)=(1,2),(1,3),
`
`,(13N),(2,3). JAN). “uINn LN)
`
`or the equivalent cyclic column ordering. Sameh [10] and Schwiegelshohn and Thiele
`[l l] have shown how to implement the cyclic row ordering on a ring-connected or a
`mesh-connected processor array. Recently, a variety of parallel orderings have been de-
`veloped. Luk and Park [12 ] have shown that these parallel orderings are essentially equiv-
`alent to the cyclic orderings and thus share the same convergence properties.
`Brent and Luk have suggested a particular parallel ordering and developed a square
`systolic array consisting of [NJ 21 X IN; 21 PBS for implementing the Jacobi SVD method
`(Fig. 1 ). To do this, the matrix M is partitioned into 2 X 2 submatrices. Each PE contains
`one submatrix and performs a two-sided rotation
`
`(4)
`
`where
`
`(5)
`
`B=R(3:)TAR(92).
`
`A=(a“ C312)
`
`322
`
`£121
`
`and B=(bll
`
`I521
`
`512)
`
`4522
`
`2
`
`Page 2 of 13
`
`2
`
`Page 2 of 13
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`715
`
`
`
`
`
`
`
`FIG. 1. The SVD array given by Brent and Litk.
`
`denote the submatrix before and after the two-sided rotation, respectively, and
`
`(6}
`
`R(9)=(
`
`cos 6
`—sin 9
`
`sin 6
`cos 6
`
`describes a plane rotation through the angle 8. At first, the diagonal PEs (symbolized by
`a double square in Fig.
`l ) generate the rotation angles to diagonalize the 2 X 2 submatrices
`(bu = [321 = 0) stored in them. This means that B. and 92 are first calculated from the
`elements ofA and then relation (4) is used to compute b“ and in. We call this the
`generation mode. Then, the rotation angles are sent to all off-diagonal PBS in the following
`way: the angles associated to the left-side rotations propagate along the rows while the
`angles associated to the right-side rotations propagate along the columns. Once these
`angles are received, the off-diagonal PEs perform the two-sided rotations (4) on their
`stored data. We call this the rotation mode. Clearly, if we compute the rotation mode
`straightforwardly. we require four plane rotations. For the generation mode, additional
`operations for calculating 61 and 62 are required.
`
`3. TPR method for computing two-sided rotations. In order to develop the TPR
`method for computing two-sided rotations more efficiently, we first discuss the com-
`mutative properties of two special types, the rotation-type and the reflection-type, of
`2 X 2 matrices. We define
`
`(7)
`
`M’°‘=[( x y)x,ye|R]
`
`
`—y x
`
`and Mmf=[(x
`
`y
`
`3’)
`
`-x
`
` x,yelR].
`
`The former is called rotation-type because it has the same matrix structure as a 2 X 2
`plane rotation matrix. Similarly, the latter is called reflection-type because it has the
`same matrix structure as a 2 X 2 Givens reflection matrix [13]. Note that x and y must
`not be normalized to x2 + y2 = 1. Using the above definitions, the following results can
`be shown by some elementary manipulations.
`LEMMA 1. KAI e .1!“ and/42 e M“, then/11142 : AZAI 6 Jim.
`LEMMA 2. HA, E of!“ andAz e ”at“, then AA; = AgA1 G Mr“.
`In particular, if we consider two plane rotations, we know the following.
`LEMMA 3. If Rail) and R(62) are plane rotations described by (6),
`R(31)R(52) = R(31+ 32) and R(61)TR(62) = RUE — 31)-
`Now, we give a theorem describing the rotation mode of the TPR method.
`THEOREM. Ifthe 2 X 2 matrix A and the two rotation angles I91 and 62 are given,
`then the two-sided rotation (4) can be computed by two piane rotations, ten additions,
`
`then
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:iiww.siam.orfljoumals/ojsaphp
`
`
`
`
`
`
`
`
`
`Page 3 of 13
`
`3
`
`Page 3 of 13
`
`
`
`716
`
`B. YANG AND J. F. BoHME
`
`andfottr sealings by %:
`
`(8)
`
`(9)
`
`“0)
`
`(11)
`
`Pl:(622+flti)12.
`(It =(flzt "5112)”,
`
`P2=(fl22"flu)/2,
`£12 =(flzt +a|2)f29
`
`6-=62_6la
`
`9+=92+5h
`
`(Trmli'l
`
`(firmlil
`
`bti=rt—r2,
`
`btz=—ti+€2.
`
`by =l1+f2,
`
`b22=r1+r2.
`
`Proof Using (8), the matrix A can be reformulated as
`
`A=A1+A2=(p‘ eqt)+(-p2 (12)
`
`‘12
`
`02
`
`(11
`
`HI
`
`Clearly, Rtfll), R(62) in (4) and A. are elements ole "m while A; belongs to Jr“. This
`leads to the following reformulation of the matrix B by using Lemmas 1fi3:
`
`B=R(6.)TAR(62)
`
`2 R(9.)TA,R(62)+ R(61)TA2R(62)
`
`= R(6,)TR(62)A. +R(6,)TR(02)TA2
`
`=R(62— EMA, +R(6;+ EMT/42
`
`=R(3_)(
`
`PI
`
`Qt
`
`’QI
`
`P1
`
`)+R(8+)T(ip2 (I2)
`
`P2
`
`92
`
`= 1’1—11+ ‘3'2
`It
`3’1
`52
`
`.
`
`12
`’2
`
`This completes the proof.
`The generation mode of the TPR method follows directly from the above theorem.
`COROLLARY. Iftlte 2 X 2 matrix A is given, we can diagonalt'ze A and cafettt'ate
`the corresponding rotation angles I9, and 62 by two Cartesian—to—polar coordinates con-
`versions, etght additions, andfottr scaffngs by %‘.
`
`(12)
`
`(13)
`
`(l4)
`
`(15)
`
`Pl=(022+att)/2,
`
`P2=(5122—flu)/2,
`
`(It :(flzl I—al2)/2!
`
`‘32: (£12: + 0.2)12.
`
`r1= sign (pt) Vpi+ at,
`
`r2= sign (1):) Vp§+q%,
`
`9—=arctan(qifpi),
`
`6+=arctan(m/pz).
`
`6. =(6+—6_)/2,
`
`62=(8++6—)/2.
`
`511:!”1—3‘2,
`
`b22=ri+72-
`
`Proof. Regarding ( l l ), on = by = O is equivalent to I. = 32 = 0. Equation (13)
`follows then from (10). This completes the proof.
`In equation (13), we choose the rotation through the smaller angle. All vectors
`lying in the first or the fourth quadrant are rotated onto the positive x—axis, and all vectors
`lying in the second and the third quadrant are rotated onto the negative x—axis. For
`vectors on the y—axis, the rotation direction is arbitrary. Thus, the generated rotation
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.orfljoumals/ojsa.php
`
`
`
`
`
`
`
`
`
`Page 4 of 13
`
`4
`
`Page 4 of 13
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`717
`
`angles (L and 6+ satisfy IE. I ,
`
`|6+| g 90". This results in
`
`(t6)
`
`|al|a90°
`
`and “also:
`
`due to ( 14).
`Equation (16) is important with respect to the convergence of the Jacobi SVD
`method. Forsythe and Henrici [9] have pr0ven the convergence for cyclic orderings if
`the rotation angles 6. and 62 are restricted to a closed interval inside the open interval
`(-90“, 90" ). They have also demonstrated that this condition may fail to hold, Le, 61
`and 82 may be :90°, if the off-diagonal elements b1; and 132] in (5) have to be exactly
`annihilated. As a remedy, they suggested an under- or overrotation by computing the
`two-sided rotation (4) with angles (1 , 1061 and (l i 1033 (*1 < 7 < l) and proved
`its convergence. In practice, however, the finite machine accuracy in the real arithmetic
`allows only an approximative computation of the rotation angles and implies under- or
`overrotations. So the Jacobi SVD method converges without using under— or overrotations
`as shown by the experimental results of Brent, Luk, and Van Loan [3]. In case of CORDIC
`implementations, the effect of implicit under- or overrotations is more apparent. The
`angles 190° can never be exactly calculated because of the limited angle resolution are-
`tan (2") of the CORDIC algorithm, where 1) denotes the mantissa length.
`
`4. The CORDIC algorithm. In the previous section, we have seen that the main
`operations of the TPR-method are plane rotations and Cartesian-to-polar coordinates
`conversions. These operations can be carried out by multiplier—adder—based prooessors
`supported by software or special hardware units. An alternative approach is the use of
`dedicated processors that usually map algorithms more effectively to hardware. The
`CORDIC processor is such a powerful one for calculating trigonometric functions.
`The CORDIC algorithm was originally designed by Volder [l4] as an iterative pro-
`cedure for computing plane rotations and Cartesian-to-polar coordinates conversions. It
`was later generalized and unified by Walther [15], enabling a CORDIC processor to
`calculate more functions, including hyperbolic functions, as well as multiplications and
`divisions. In the following, we consider Volder’s CORDIC algorithm because only trig-
`onometric functions are involved in SVD applications.
`The CORDIC algorithm consists of iterative shift-add operations on a three—com-
`ponent vector,
`
`(17)
`
`(18)
`
`(251+ 1)
`
`Y: + l
`
`(xi— Marys) :
`
`3’: + 0:5in
`
`
`
`1
`
`003 (at)
`
`(
`
`005W”
`
`0: Sin (at)
`
`‘fFiSin (“0)(xi)
`
`005 (at)
`
`3/”!
`
`a
`
`2s+|=2.-—wras
`
`(0<5i<1;6i=il;c=il;i=0. Luna—1).
`
`in which the iteration stepsize 6,- is defined by
`
`(19)
`
`15,-:tan(a,-)=2"Sw.
`
`The set of integers {S(i)} parametrizing the iterations is called CORDIC sequence.
`Equation ( 17) can be interpreted, except for a scaling factor of
`
`1
`cos ((1,)
`
`(20)
`
`kg:
`
`=iil+t5%,
`
`as a rotation of (xi, ya? through the angle 02,, where the sign :1; = i1 gives the rotation
`direction. After 21 iterations, the results are given by
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.orfljournals/ojsa.php
`
`
`
`
`
`
`
`
`
`(21)
`
`(22)
`
`Page 5 of 13
`
`(xn)=K(cosa
`
`31:10:
`
`y”
`
`"sina)(x0)7
`
`COSII
`
`yo
`
`Zn=zo—sot,
`
`5
`
`5
`
`Page 5 of 13
`
`
`
`718
`
`B. YANG AND J. F. BéHME
`
`with the overall scaling factor K = H,- kl- and the total rotation angle a = Z,- afag. Now,
`if the CORDIC sequence satisfies the following convergence condition
`n 7 l
`
`(23)
`
`as— E affirm—i
`j=i+l
`
`(i=0=l,"',n—2),
`
`we can choose the sign parameter
`
`(24)
`
`at:
`
`”Sign (x-y-)
`.
`i
`r
`s1gn(az.)
`
`fory +0,
`n
`for z,,—» 0
`
`to force y,’ or 2,, to zero, provided that the input data x0, yo, and 20 lie in the conver-
`gence region
`
`(25)
`
`"—1
`C= Z of;
`1:0
`
`larctan (yo/xofi
`
`foryr-O,
`
`IZol
`
`fora->0.
`
`In this way. two difl'erent types of CORDIC trigonometric functions can be computed
`(Table 1). In the mode yn —) 0, the Cartesian coordinate (x0, yo) of a plane vector is
`convened to its polar representation, where the parameter c = :1 determines the sign
`of the phase angle calculated. When 2,. —r 0, a given plane vector is rotated through the
`angle 20, where a = :1 controls the rotation direction.
`In Table l, the principal value larctan ( yO/xn)| é 90° ofthe inverse tangent function
`is calculated when computing Cartesian-to-polar coordinates conversions. Correspond-
`ingly, x” may be positive or negative according to the sign of x0. So, it is guaranteed that
`a vector is always rotated through the smaller angle onto the x—axis in accordance with
`(13). In this case, a convergence region of C a 90° is sufficient for the generation mode
`of the two-sided rotation.
`
`One main drawback of the CORDIC algorithm is the need of correcting the scal-
`ing factor K that arises during the iterations (17). For example, if we use Volder’s
`CORDIC sequence
`
`{26)
`
`{S(i)}={0,l,2,3,---,p—l,p}.
`
`with n : p + l CORDIC iterations for a mantissa accuracy of 2 ‘p, the scaling factor is
`K 2: 1.64676. Compensating this undesired scaling effect with a minimum number of
`computations is of particular importance.
`Clearly, multiplying x” and yn in Table 1 by K" will degrade the algorithm per-
`formance substantially. Most of the scaling correction issues are based on shift-add oper‘
`ations. For a two-sided rotation that is implemented by four plane rotations, each matrix
`element undergoes two plane rotations so that the total scaling factor to be corrected
`is K2. In this case, Cavallaro and Luk [16] have pointed out that there is a simple
`systematic approach for sealing correction when using the CORDIC sequence (26).
`They proposed to use [19/41 scaling iterations of the type at 4— x — 2’2J'x with j E J =
`{1, 3, 5, -- -
`, 2|"pf4'l — 1} and one shift operation 2“. The remaining scaling error is
`
`TABLE 1
`CORDIC trigonometricfimcu‘ons (e = 1]).
`
`
`y” -|' U
`z... —I- 0
`
`x.’ = K sign (xnll’x3+yfi
`x” : Km, cos zo — ryo sin 7.0)
`
`2,, = 20 + e arctan (yolxo)
`y» = Kuxo sin 2:; + yo cos :0)
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.orfljournals/ojsa.php
`
`
`
`
`
`
`
`
`
`Page 6 of 13
`
`6
`
`6
`
`Page 6 of 13
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`719
`
`bounded by 2"“ l,‘
`
`(27)
`
`l-2_'H(l---2"2")'I{2
`jEJ'
`
`
`
`J7
`
`=|1—2" [1(1 —2r2-")-H(1+2r2i) <2’P".
`
`i=0
`
`J's}
`
`This approach, however, fails in the TPR method. Here, each matrix element un-
`dergoes only one plane rotation. The scaling factor to be corrected is thus K rather than
`K2. In order to solve this more difficult problem, different techniques have been developed
`in the literature. Haviland and Tuszynski [17] used similar scaling iterations as Cavallaro
`and Luk. Ahmed [18] repeated some CORDIC iterations to force K to a power of the
`machine radix. Delosme [19] combined both methods of Haviland, Tuszynski, and
`Ahmed for minimizing the number of computations. Deprettere, Dewilde, and Udo [20]
`suggested the double-shift concept.
`We designed a computer program [21] for a systematic search of CORDIC
`sequences. We allow shifts parameters SH) (1‘ = 0, 1,
`, n — 1) with differences
`S(r’ + l) — 8(1‘) 6 {0, 1, 2} to provide more optimization freedom. For an efficient
`scaling correction, we require that the scaling factor K be corrected by a sequence of ”k
`shift-add operations,
`
`(28)
`
`2—110) fi (l+n(j)2_T‘-”)-K=1+AK
`i=1
`
`(T(j)integers,n(j)=il).
`
`These additional scaling iterations are parametrized by the set of signed integers
`{1"(0), n( l )T(1), . -
`.
`, n(nk)T{nk)} . The total number ofiterations is L = n + in.
`In (28 ), AK denotes the remaining relative scaling error after the scaling correction.
`We emphasize that this is a systematic error with a constant sign. By contrast, the other
`two types of CORDIC errors, the angular error due to the limited angle resolution and
`the rounding error, are of statistical nature because they may be positive or negative.
`The scaling error is thus more critical with respect to error accumulation when repeated
`CORDIC operations on the same data have to be computed as in SVD applications.
`Roughly speaking, the total scaling error after k CORDIC function calls increases linearly
`with k, a fact that has been verified by our numerical experiments. For this reason, we
`require IAKI to be much smaller than 2‘”.
`We found catalogues of CORDIC sequences with complexity comparable to those
`of Cavallaro and Luk. In the following, five examples for different mantissa lengths p =
`16, 20, 24, 23, and 32, including the total number of iterations L : n + m, the convergence
`region C, and the remaining scaling error AK are given:
`
`p=16: {sm}={0123w 1516},
`
`{ntleU)}={1+2—5+9+10}, L=l7+4, CwlOO°, arm—24w}
`
`p=20z
`
`{S(r‘)}={0123~~1920},
`
`{n(j)T(j)}={l+2—5+9+10+16},
`
`L=2l+5,
`
`carom,
`
`AKmIZ—n‘”,
`
`' When replacing [p/4i by [(p - ”/41 or [(p + ”/41, the upper bound in (27) becomes 2‘” or 2“”,
`respectively.
`2 When appending an additional scaling iteration with n(5)T{5) = +16, the scaling accuracy can be
`enhanced to AK m 2‘”.
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.orgfjournals/ojsaphp
`
`
`
`
`
`
`
`
`
`Page 7 of 13
`
`7
`
`Page 7 of 13
`
`
`
`720
`
`B. YANG AND J. F. Bonus
`
`p=24:
`
`{S(t)}:{1123345566788910---2324},
`
`{n(j)T(j)}={072+6},
`
`L=29+2,
`
`Cm9]°, Airflow”,
`
`p=28:
`
`{S(i)}={1123345566788910111213141415---2723},
`
`{n(j)T(j)}={0—2+6},
`
`L=34+2,
`
`Cm91°,
`
`AKmZ‘R'”,
`
`p=32:
`
`{S(i)}={001333456789910---3132},
`
`{11(le(1‘)}={1—3—8+16—25—27}.
`
`L=36+5,
`
`CmMS",
`
`Alia—2‘39”.
`
`Remember that in order to meet the convergence condition (23) and to provide a
`convergence region C E 90", the minimum number of CORDIC iterations is p + 1. So,
`for all CORDIC sequences given above, the number L — (p + 1 ) of additional iterations
`for scaling corrrection is pl4. Moreover, except for the first CORDIC sequence, the
`remaining sealing error |AK| is significantly smaller than 2"”. This leads to improved
`numerical properties compared with other CORDIC sequences reported in the literature.
`We also remark that if the symmetric eigenvalue problem is considered for which a
`convergence region of C 2 45° is sufficient [2], the total number of CORDIC iterations
`L can be further reduced. An example that is nearly identical to the last CORDIC sequence
`given above is
`
`p=32;
`
`{S(i)}={1333456789910---3132},
`
`{amen={o—3—s+16—25—27},
`
`L=34+5,
`
`CmSS",
`
`AKflflg_2e39.93.
`
`For comparison, Delosme [5] has also given an optimized CORDIC sequence for the
`same situation. His sequence requires one iteration more (L = 40) and achieves a scaling
`accuracy ofAK m 2—33-16.
`We suspect that similar results can also be obtained by using Deprettere’s double-
`shif’t concept. However, this method requires a slightly increased hardware complexity
`and will not be discussed in this paper.
`
`5. CORDIC implementation of the SVD PEs. For easy illustration, we first introduce
`a CORDIC processor symbol as shown in Fig. 2. The descriptions inside the box determine
`uniquely the function mode of the CORDIC processor according to Table l. The output
`data x and y are assumed to be scaling corrected.
`It is now simple to map the operations (8 )—( l l ) and ( l2 )-( l 5 ) of the TPR method
`onto a two CORDIC processor architecture. In Fig. 3, the diagonal PEs of the SVD array
`are implemented by two CORDIC processors and eight adders. The dotted inputs of the
`adders represent negated inputs. Because the diagonal PEs work in the generation mode,
`
`X0
`
`Yo
`
`20
`
`Y
`2—”0
`
`E=i1
`
`X
`
`Y
`
`Z
`
`FIG. 2. CORDIC symbol.
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:/fwwsiamcrgfjournals/ojsa.php
`
`
`
`
`
`
`
`
`
`Page 8 of 13
`
`8
`
`Page 8 of 13
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`721
`
`
`
`FIG. 3. CORD‘IC implementation ofthe diagonal PE of the SVD array.
`
`both CORDIC processors are driven in the “y -> 0” mode for computing Cartesian-to-
`polar coordinates conversions. In Fig. 4, the oflldiagonal PEs working in the rotation
`mode are implemented by two CORDIC processors and ten adders. Here, the CORDIC
`processors are driven in the “z —I~ 0” mode for performing plane rotations.
`Obviously, both CORDIC implementations have nearly the same architecture. All
`PEs of the SVD array can thus be implemented by one unified CORDIC SVD module
`(Fig. 5) without considerably increased hardware cost. The different computation modes
`of the diagonal and olildiagonal PEs are easily “programmed" by one control bit. The
`resulting SVD array is similar to that in Fig. l, but homogeneous with identical PEs.
`We remark that Fig. 5 is more a “graphic program" describing the sequence of
`operations to be computed rather than a hardware block diagram. We show in the fol-
`lowing that the 12 adders that are paired into three pre-butterflys and three post-butterflys
`can be integrated into the two CORDIC processors without separate hardware realizations.
`The Jacobi SVD method is a recursive method. Each PE of the SVD array has to exchange
`data with its diagonal neighbors. Because of this data dependency, only recursive CORDIC
`processors can be used here. This is an arithmetic unit consisting of mainly three adders
`and two barrel-shifters. It carries out the L iterations of the CORDIC algorithm in L
`cycles by using data feedback. The two CORDIC processors contained in one CORDIC
`SVD module require six adders altogether. So, it is natural to modify the CORDIC
`processor architecture slightly and to use the existing six adders for computing both the
`pro-butterfly and the post-butterfly operations. The resulting CORDIC SVD module has
`
`
`
`FIG. 4. CORDIC implementation oftire qfiidiagoncl PE oj'the SVD array.
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.orfljournals/ojsa.php
`
`
`
`
`
`
`
`
`
`Page 9 of 13
`
`9
`
`Page 9 of 13
`
`
`
`722
`
`B. YANG AND J. F. BéHME
`
`pre—butterflys
`
`post—butterflys
`
`
`
`
`
`
`
`
`
`Downloaded01104113to128.148.252.35.RedistributionsubjeettoSIAMlicenseorcopyright;seehttp:ffww.siatn.org’journals/ojsa.php
`
`FIG. 5. A unified CORDIC SVD modulebr implementing all PEs offhe SVD array.
`
`the hardware complexity of two recursive CORDIC processors and requires a total com-
`putation time of L + 2 iterations.
`In Fig. 6, the principal architecture of such a two CORDIC processor SVD module
`is shown. The dashed lines and boxes represent the additional hardware components
`
`
`
`
`
`
`
`
`
`
`
`FIG. 6. The principal architecture afthe unified CORDIC SVD module.
`
`Page 10 of 13
`
`10
`
`10
`
`Page 10 of 13
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS svo ARRAY
`
`723
`
`enabling the CORDIC processors to compute the butterfly operations. It is easily verified
`that the upper four adders devoted to x and y can perform the following types of operations
`2"): -_|- 2“ y (pro-butterfly), x i y (post-butterfly), x i 2“ y (CORDIC iteration) and
`x i 2"): (sealing iteration) while the lower two adders devoted to 2 can compute 2“z,
`: 2‘12; (pro-butterfly}, z. i 22 (post-butterfly), and z i a (CORDIC iteration), re-
`spectively. The cross switch between the registers “Y.” and “X2” is needed to exchange
`data when the CORDIC SVD module switches from the pro-butterfly operations into
`the CORDIC iterations or from the CORDIC iterations into the post-butterfly operations,
`respectively. Then, we see from Fig. 3 that the output data pairs of the pre-butterflys are
`(p. , p2) and (q1 . so), while the desired input data pairs for the CORDIC iterations are
`(pl, (1. ) and (1);, q;), respectively. So, in; and q1 have to be exchanged.
`
`6. Comparisons. We now compare the new CORDIC SVD module with those pro-
`posed by Cavallaro and Luk [4] and Delosme [5]. Let Acm and Tam denote the area
`and time complexity ofa CORDIC SVD module and ADM,c and Twmic those of a CORDIC
`processor, respectively. Cavallaro and Luk have shown that their most efficient parallel
`diagonalization method requires Amd as 2Amdic and Tessa a.» 371mm,, for the diagonal
`PBS and Ac“, m 2Acmic and To“, 5:: 2 Tmmic for the off-diagonal PEs. By using the TPR
`method, we require Tm, m 2Acondic and de m Tmmic for all PEs. In other words, having
`approximately the same hardware complexity, the computation time is reduced by more
`than 50 percent.
`A comparison to Delosme’s method is more difficult because he follows a quite
`different approach. Therefore, only rough performance estimates are given here. In our
`method, we compute the rotation angles explicitly. After these computations have been
`completed in the diagonal PBS, the angles propagate to the off-diagonal ones. We assume
`that the propagation from one PE to its neighbors takes one cycle Tm.“ the time required
`for computing one CORDIC iteration. This implies local communications without
`broadcasting data. At the beginning of the second propagation cycle, the angles reach
`the diagonal neighbors of the diagonal PEs which complete their computations after
`Tum. This means that the diagonal PEs have to wait for a delay time To...” = Tcycle +
`TCM before they can exchange data with their diagonal neighbors.3 The total time elapsed
`between two adjacent activities at each PE is thus TQM + Tag.“ = 21;,” + Tm], m ZTWdic
`because Two is negligible with respect to Tom-[c = L- Tome.
`Delosme does not compute the rotation angles explicitly. He rather calculates en-
`codings of the angles, i.e., sequences of signs :1, and sends them to the cit-diagonal PEs.
`This enables overlap of diagonal and elf—diagonal rotations because the encoding signs
`are recursively obtained and become available before the completion of diagonal oper-
`ations. Accordingly, no delay time is required (Tam), = 0), provided that the SVD array
`size (the half of the matrix size) is smaller than the number of CORDIC scaling iterations
`m. (for details, see [5]). The drawback is, however, that the TPR method cannot be
`applied to the off-diagonal PEs. Four plane rotations are hence required, resulting in
`TQM = 231mm for two CORDIC processors in one module. In other words, the time
`complexities Tam. + Tdday of both methods are nearly identical and equal ZTwmic. If,
`however, multiple problems are interleaved, the fraction of idle time that is 50 percent
`in our case can be reduced to almost zero. In such a situation, our method provides the
`double speed compared with Delosme’s one.
`
`3 If the propagation time is assumed to be Tam, we get the weli~known result Tam = 2?“... given by Brent
`and Luk [2].
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:/fwwwsiamcrgfjournals/ojsa.php
`
`
`
`
`
`
`
`
`
`Page 11 of13
`
`11
`
`11
`
`Page 11 of 13
`
`
`
`724
`
`B. YANG AND J. F. BGHME
`
`In terms of area complexity, both CORDIC SVD modules contain two CORDlC
`processors. Our module consists of essentially six adders, four barrel shifters, and one
`ROM table containing n angle values. Delosme’s architecture requires four carry-save
`adders, four adders, and eight barrel shifters. So, as a rough estimate, both SVD modules
`have the same order of area complexities.
`Perhaps the most important advantage of Delosme’s approach is the 2-bit wide
`horizontal and vertical data connections for sending angle encodings serially rather than
`sending the full angle values in parallel. The prices are the upper bound of the SVD array
`size depending on the number of CORDIC scaling iterations, the relatively complicated
`timing, and a nonregular CORDIC architecture design. We also mention that while
`Delosme's method presumes a CORDIC implementation, the TPR method is applicable
`to other computing architectures.
`
`7. Other applications of the TPR method. Another advantage of the TPR method
`seems to be the relatively wide range of applications. We indicate some of them in the
`following.
`For the SVD of a rectangular matrix, a well-known method is first to triangularize
`the matrix by QR decomposition and then to apply the Jacobi SVD procedure to the
`triangular factor. Luk [22] has shown that both steps can be implemented by one triangular
`systolic array. Each PE contains a 2 X 2 submatrix. It applies two plane rotations (through
`the same angle) to the two column vectors at the QR step and a two-sided rotation at
`the SVD step. For computing the SVD step, the PE can be realized by the CORDIC
`SVD module, as before. On the other side, the two CORDIC processors contained in
`the module are also appropriate to perform the two—plane rotations of the QR step. The
`CORDIC SVD module presented in this paper thus provides a suitable PE for Luk’s
`triangular SVD array.
`Stewart [23] has proposed a square systolic array for computing the Schur decom-
`position (SD) of a non-Hermitian matrix which, for example, is useful for evaluating
`the eigenvalues of the matrix. His approach is similar to the Jacobi SVD method. It is
`based on applying a sequence of two-sided rotations to 2 X 2 submatrices, where the left
`and right rotation angles are identical to make the diagonal submatrices upper triangular.
`While the diagonal PEs perform operations different from those in SVD, the elf-diagonal
`PEs have exactly the same computational task as in SVD computing. Therefore, the