`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`SIAM J. MATRIX ANAL, APPL.
`Vol. 12, No. 4, pp. 713-725, October 1991
`
`© 1991 Society for Industrial and Applied Mathematics
`009
`
`REDUCING THE COMPUTATIONS OF THE SINGULAR VALUE
`DECOMPOSITION ARRAY GIVEN BY BRENT AND LUK*
`
`B. YANG} anp J. F. BOHME?
`
`Abstract. A new,efficient, two-plane rotation (TPR) method for computing two-sided rotations involved
`in singular value decomposition (SVD) is presented.It is shown that a two-sided rotation can be evaluated by
`only two plane rotations and a few additions. This leads to significantly reduced computations. Moreover,if
`coordinate rotation digital computer (CORDIC) processors are used for realizing the processing elements (PEs)
`of the SVD array given by Brent and Luk, the computational overhead of the diagonal PEs due to angle
`calculations can be avoided. The resulting SVD array has a homogeneousstructure with identical diagonal and
`off-diagonal PEs. Similar results can also be obtained if the TPR method is applied to Luk’s triangular SVD
`array and to Stewart's Schur decomposition array.
`
`Key words. singular value decomposition, systolic arrays, CORDIC, two-sided rotations, VLSI
`
`AMS(MOS)subject classification. 15418
`
`1. Introduction. One important problem in linear algebra and digital signal pro-
`cessing is the singular value decomposition (SVD). Typical applications arise in beam-
`forming and direction finding, spectrum analysis, digital image processing,etc. [1]. Re-
`cently, there has been a massive interest in parallel architectures for computing SVD
`because of the high computational complexity of SVD, the growing importance ofreal-
`timesignal processing, and the rapid advancesin very large scale integration (VLSI) that
`make low-cost, high-density and fast processing memory devices available.
`There are different numerically stable methods for computing complete singular
`value and singular vector systems of dense matrices, for example, the Jacobi SVD method,
`the QR method, and the one-sided Hestenes method. Forparallel implementations, the
`Jacobi SVD methodis far superior in terms of simplicity, regularity, and local com-
`munications. Brent, Luk, and Van Loan have shown how the Jacobi SVD method with
`parallel ordering can be implemented by a two-dimensionalsystolic array [2], [3]. Various
`coordinate rotation digital computer (CORDIC) realizations ofthe SVD array have been
`reported by Cavallaro and Luk [4] and Delosme[5], [6].
`The Jacobi SVD method is based on, as common forall two-sided approaches,
`applying a sequence of two-sided rotations to 2 X 2 submatrices of the original matrix.
`The computational complexity is thus determined by how to compute the two-sided
`rotations. In most previous works, a two-sided rotation is evaluated in a straightforward
`manner by four plane rotations, where two of them are applied from left to the two
`columnvectors of the 2 X 2 submatrix and the other ones are applied from right to the
`row vectors, respectively. In the diagonal processing elements (PEs), additional operations
`for calculating rotation angles are required. This leads to an inhomogeneousarray ar-
`chitecture containing two different types of PEs.
`In this paper, we develop a two-plane rotation (TPR) method for computing two-
`sided rotations. We show that the above computational complexity can be reduced sig-
`nificantly because each two-sided rotation can be evaluated by only two plane rotations
`and a few additions. Moreover, the SVD array given by Brent and Luk becomes ho-
`mogeneouswith identical diagonal and off-diagonal PEs when CORDIC processors are
`
`* Received by the editors September 28, 1989; accepted for publication (in revised form) August 2, 1990.
`+ Departmentof Electrical Engineering, Ruhr-Universitat Bochum, 4630 Bochum, Germany.
`713
`
`Page 1 of 13
`
`SAMSUNG EXHIBIT 1030
`
`1
`
`LG 1008
`
`1
`
`LG 1008
`
`Page 1 of 13
`
`SAMSUNG EXHIBIT 1030
`
`
`
`used. In a recent work [6], Delosme has also indicated this possibility in connection
`with “rough rotations” independently. He has taken, however, a different approach that
`is based on encoding the rotation angles. He hasstill required four plane rotations on
`the off-diagonal PEs while diagonal and off-diagonal operations can be overlapped.
`Ourpaperis organized as follows. In § 2, we briefly reexamine Jacobi’s SVD method
`and Brent and Luk’s SVD array. Then, we develop the TPR methodin § 3. The CORDIC
`algorithm is described in § 4, where in particular CORDICscaling correction techniques
`are discussed and examples of scaling-corrected CORDIC sequencesare given. In § 5, a
`unified CORDIC SVD module for all PEs of the SVD array is presented. This module
`is compared to those proposed by Cavallaro, Luk, and Delosmein§6. Finally, we stress
`the applicability of the TPR method to several other problems.
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`714
`
`B. YANG AND J. F. BOHME
`
`2. Jacobi SVD method. In this paper, we consider real, square, and nonsymmetric
`matrices. Let Af € R***” be a matrix of dimension V. The SVDis given by
`
`(1)
`
`M=UZV’",
`
`where U € R**% and V € R”™ are orthogonal matrices containing the left and right
`singular vectors, and = € R“*” is a diagonal matrix ofsingular values, respectively. The
`superscript 7 denotes matrix transpose. Based on an extension of the Jacobi eigenvalue
`algorithm [7], Kogbetliantz [8] and Forsythe and Henrici [9] proposed to diagonalize
`M by a sequence of two-sided rotations,
`
`(2)
`
`Mo=M,
`
`Mpa 1 = ULM.V,
`
`(k=0, 1,2, ---).
`
`U, and V;, describe two rotations in the (7, /)-plane (1 S i <j S N), where the rotation
`angles are chosen to annihilate the elements of M4, at the positions (i, j) and (j, i).
`Usually, several sweeps are necessary to complete the SVD, where a sweepis a sequence
`of N(N — 1)/2 two-sided rotations according to a special ordering of the NUN — 1)/2
`different index pairs (i, /).
`For sequential computing on a uniprocessor system, possibly the most frequently
`used orderings are the cyclic orderings, namely, the cyclic row ordering
`
`(3)
`
`(4,7) =(1,2),C1,3), +++ C1,N).(2,3), 0+ (2,.N), 0+ (NIN)
`
`or the equivalent cyclic column ordering. Sameh [10] and Schwiegelshohn and Thiele
`[11] have shown how to implement the cyclic row ordering on a ring-connected or a
`mesh-connected processor array. Recently, a variety of parallel orderings have been de-
`veloped. Luk and Park [12] have shownthatthese parallel orderings are essentially equiv-
`alent to the cyclic orderings and thus share the same convergence properties.
`Brent and Luk have suggested a particular parallel ordering and developed a square
`systolic array consisting of [N/21X [N/21PEs for implementing the Jacobi SVD method
`(Fig. 1). To do this, the matrix M is partitioned into 2 X 2 submatrices. Each PE contains
`one submatrix and performs a two-sided rotation
`
`(4)
`
`where
`
`(5)
`
`B= R(0,)"AR(62),
`
`a2,
`
`a2
`
`a=(@ ) and B=(7 2)
`
`bo,
`
`by»
`
`Page 2 of 13
`
`2
`
`2
`
`Page 2 of 13
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`715
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`
`
`
`
`
`
`Fic. 1. The SVD array given by Brent and Luk.
`
`denote the submatrix before and after the two-sided rotation, respectively, and
`
`(6)
`
`RO) =(
`
`sin é
`cos?
`—siné@ cosé@
`
`describes a plane rotation through the angle @. At first, the diagonal PEs (symbolized by
`a double squarein Fig.
`| ) generate the rotation angles to diagonalize the 2 * 2 submatrices
`(b\2 = by, = 0) stored in them. This meansthat @, and 4are first calculated from the
`elements of A and then relation (4) is used to compute ),; and by). We call this the
`generation mode. Then, the rotation angles are sent to all off-diagonal PEs in the following
`way: the angles associated to the left-side rotations propagate along the rows while the
`angles associated to the right-side rotations propagate along the columns. Once these
`angles are received, the off-diagonal PEs perform the two-sided rotations (4) on their
`stored data. Wecall this the rotation mode. Clearly, if we compute the rotation mode
`straightforwardly, we require four plane rotations, For the generation mode, additional
`operations for calculating 6, and 02 are required.
`
`3. TPR method for computing two-sided rotations. In order to develop the TPR
`method for computing two-sided rotations more efficiently, we first discuss the com-
`mutative properties of two special types, the rotation-type and the reflection-type, of
`2 X 2 matrices. We define
`
`(7) al * *)nver} and at =|(*
`
`
`“yx
`
`vox
`
`”)
`
` xyer],
`
`The formeris called rotation-type because it has the same matrix structure as a 2 X 2
`plane rotation matrix. Similarly, the latter is called reflection-type because it has the
`same matrix structure as a 2 X 2 Givensreflection matrix [13]. Note that x and » must
`not be normalized to x? + 7 = 1, Using the above definitions, the following results can
`be shown by some elementary manipulations.
`LEMMA 1, JfA, € " and Ay € 4€™,
`then A, Az = A.A, € 4E™,
`LEMMA 2. IfA, € 4% and A, €.™,
`then Aj A, = ATA, € M™,
`In particular, if we consider two plane rotations, we know the following.
`LEMMA 3. If R(6;) and R(62) are plane rotations described by (6),
`R(6,)R(02) = R(8; + 62) and R(0;)7R(02) = R(O — 41).
`Now,we give a theorem describing the rotation mode of the TPR method.
`THEOREM. Ifthe 2 X 2 matrix A and the two rotation angles 6, and 02 are given,
`then the two-sided rotation (4) can be computed by two plane rotations, ten additions,
`
`then
`
`Page 3 of 13
`
`3
`
`Page 3 of 13
`
`
`
`andfour scalings by 4:
`P2 = (22 — @1)/2,
`Py = (@22+ a1)/2,
`(8)
`
`Gi = (21 — @2)/2, G2 = (21 + G12)/2,
`
`716
`
`B. YANG AND J. F, BOHME
`
`(9)
`
`(10)
`(11)
`
`@_=6,—-6),
`
`0, =O.+6,,
`
`ty
`
`1
`
`(7)=Re(?'),
`bi=rni-n,
`m=riTr,
`by, =th +h,
`
`to
`
`q2
`
`(77)=20.(?'),
`be=—-hth,
`12
`ith
`by =ryith.
`
`Proof. Using (8), the matrix A can be reformulated as
`
`Ana tar=(” a a
`
`1
`
`Pi
`
`dz
`
`P2
`
`Clearly, R(6,), R(62) in (4) and A, are elements of .4™ while 4, belongs to .'. This
`leads to the following reformulation of the matrix B by using Lemmas 1-3:
`
`B= R(6,)7AR(62)
`
`= R(6,)"A,R(@2)+ R(8;)"A2R(62)
`
`= R(0,:)"R(O2)Ai + R(Gs) 7R(G2)"Ad
`
`= R(0.—0,)A, + R(62+6,)7Az
`
`=R0(7"
`mi +R04)"
`-("
`‘)+( r2 *.
`
`p1
`
`_
`
`bo
`
`oh
`
`M1
`
`—t
`
`ry
`
`r
`
`ty
`
`—P2 ml
`
`22
`
`G2
`
`This completes the proof.
`The generation mode of the TPR method follows directly from the above theorem.
`COROLLARY. Ifthe 2 X 2 matrix A is given, we can diagonalize A and calculate
`the corresponding rotation angles 6, and 6, by two Cartesian-to-polar coordinates con-
`versions, eight additions, andfour scalings by 4:
`
`(12)
`
`Dy = (dx + a)1)/2,
`
`Py = (4p. — y1)/2,
`
`di = (421 — a2) /2,
`
`G2 = (a2) + a12)/2,
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`(13) r=sign(p,)Vpi+q?,—m=sign (p2)Vp3+43,
`
`§_=arctan (q:/p1),
`
`6, = arctan (q2/P2),
`
`(14)
`
`(15)
`
`6, = (6. —-6_)/2,
`
`6, = (64 +6_)/2,
`
`by =ri—h,
`
`by =r) tro.
`
`Proof. Regarding (11), bj. = bz; = 0 is equivalent to f; = tf, = 0. Equation (13)
`follows then from (10). This completes the proof.
`In equation (13), we choose the rotation through the smaller angle. All vectors
`lyingin the first or the fourth quadrantare rotated onto the positive x-axis, and all vectors
`lying in the second and the third quadrant are rotated onto the negative x-axis. For
`vectors on the y-axis, the rotation direction is arbitrary. Thus, the generated rotation
`
`Page 4 of 13
`
`4
`
`Page 4 of 13
`
`
`
`
`
`
`
`
`
`
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`717
`
`angles 6_ and 6, satisfy |@_|,
`(16)
`
`|6,| S 90°. This results in
`|@:|<90°
`and
`|6,| 90°,
`
`due to (14).
`Equation (16) is important with respect to the convergence of the Jacobi SVD
`method. Forsythe and Henrici [9] have proven the convergence for cyclic orderingsif
`the rotation angles #, and 42 are restricted to a closed interval inside the open interval
`(—90°, 90°). They have also demonstrated that this condition may fail to hold, i.e., 6,
`and #, may be +90°, if the off-diagonal elements 5,2 and 52; in (5) have to be exactly
`annihilated. As a remedy, they suggested an under- or overrotation by computing the
`two-sided rotation (4) with angles (1 — +)@, and (1 — ~)@.(—1 < y < 1) and proved
`its convergence. In practice, however, the finite machine accuracy in the real arithmetic
`allows only an approximative computation of the rotation angles and implies under- or
`overrotations. So the Jacobi SVD method converges without using under- or overrotations
`as shown by the experimental results of Brent, Luk, and Van Loan[3]. In case of CORDIC
`implementations, the effect of implicit under- or overrotations is more apparent. The
`angles +90° can never be exactly calculated because of the limited angle resolution arc-
`tan (2~”) of the CORDIC algorithm, where p denotes the mantissa length.
`
`4. The CORDICalgorithm. In the previous section, we have seen that the main
`operations of the TPR-method are plane rotations and Cartesian-to-polar coordinates
`conversions. These operations can be carried out by multiplier-adder-based processors
`supported by software or special hardware units. An alternative approach is the use of
`dedicated processors that usually map algorithms more effectively to hardware. The
`CORDICprocessor is such a powerful one for calculating trigonometric functions.
`The CORDICalgorithm wasoriginally designed by Volder [14] as an iterative pro-
`cedure for computing plane rotations and Cartesian-to-polar coordinates conversions.It
`was later generalized and unified by Walther [15], enabling a CORDIC processor to
`calculate more functions, including hyperbolic functions, as well as multiplications and
`divisions. In the following, we consider Volder’s CORDIC algorithm because only trig-
`onometric functions are involved in SVD applications.
`The CORDICalgorithm consists of iterative shift-add operations on a three-com-
`ponentvector,
`
`(7) (* ) _ (* - et) _
`
`
`
`1
`
`(
`
`cos(a;) —9; sin eC)
`
`(18)
`
`Zui = 2-000;
`
`(0<46,< ls o,= +lje=+1;7=0,1,°-+,2—-1),
`
`in which theiteration stepsize 6; is defined by
`
`(19)
`
`6;= tan (a;)= 275,
`
`The set of integers {.S(i)} parametrizing the iterations is called CORDIC sequence.
`Equation (17) can be interpreted, except for a scaling factor of
`
`k= cos (au)
`(20)
`as a rotation of (x;, y;)’ through the angle a, where the sign o; = +1 gives the rotation
`direction. After n iterations, the results are given by
`
`= V1+6?,
`
`(21)
`
`(22)
`
`(*)=«(or* aan(ea
`
`Yn
`
`sin a
`
`cos a}\
`
`yo
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`Vi+i cos (a;)\a; sin («;) cos (ai)/\ViYi t+ o/5;X; °
`
`Zn = Zo — €Q,
`
`5
`
`Page 5 of 13
`
`5
`
`Page 5 of 13
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`718
`
`B. YANG AND J. F, BOHME
`
`with the overall scaling factor K = []; k; and thetotal rotation angle a = 2; o;a;. Now,
`if the CORDIC sequence satisfies the following convergence condition
`nl
`aj— 2D aSayn-|
`grit
`
`(i=0,1, +++ ,n—2),
`
`(23)
`
`we can choose the sign parameter
`
`(24)
`
`a
`
`—sign (x;V;)
`ivi
`sign (ez;)
`
`fory,—~0,
`n
`for z,— 0
`
`to force y, or z, to zero, provided that the input data x9, yo, and Zp lie in the conver-
`gence region
`
`(25)
`
`n-l
`C= ¥ a2
`i=0
`
`arctan (yo/Xo)|
`
`fory,~0,
`
`|Zo|
`
`for z,—> 0.
`
`In this way, two different types of CORDIC trigonometric functions can be computed
`(Table 1). In the mode yj, — 0, the Cartesian coordinate (xo, Vo) of a plane vectoris
`converted to its polar representation, where the parameter e = +1 determines the sign
`of the phase angle calculated. When z, — 0, a given plane vectoris rotated through the
`angle z), where « = +1 controls the rotation direction.
`In Table 1, the principal value |arctan (yo/X)| = 90° ofthe inverse tangent function
`is calculated when computing Cartesian-to-polar coordinates conversions. Correspond-
`ingly, x, may be positive or negative according to the sign of xo. So, it is guaranteed that
`a vector is always rotated through the smaller angle onto the x-axis in accordance with
`(13). In this case, a convergence region of C’ = 90° is sufficient for the generation mode
`of the two-sided rotation.
`One main drawback of the CORDICalgorithm is the need of correcting the scal-
`ing factor K that arises during the iterations (17). For example, if we use Volder’s
`CORDIC sequence
`
`(26)
`
`{S(i)} ={0,1,2,3, +++ ,p—l,p},
`
`with n = p+ | CORDICiterations for a mantissa accuracy of 2~”, the scaling factoris
`K = 1.64676. Compensating this undesired scaling effect with a minimum number of
`computationsis of particular importance.
`Clearly, multiplying x, and y, in Table 1 by K~! will degrade the algorithm per-
`formance substantially. Most of the scaling correction issues are based on shift-add oper-
`ations. For a two-sided rotation that is implemented by four plane rotations, each matrix
`element undergoes two plane rotations so that the total scaling factor to be corrected
`is K*. In this case, Cavallaro and Luk [16] have pointed out that there is a simple
`systematic approach for scaling correction when using the CORDIC sequence (26).
`They proposed to use [p/41 scaling iterations of the type x < x — 24% with je J =
`{1, 3, 5, -++, &p/41— 1} and oneshift operation 2~'. The remaining scaling error is
`
`TABLE |
`CORDICtrigonometricfunctions (e = +1).
`
`
`Yn > 0
`In > 0
`
`Xn = K sign (Xo) Vx3+ yi
`Xq = K(Xp cos Zp — ey Sin Zp)
`
`Zy = Zo + ¢ arctan (Vo/X)
`Yn = K(exo sin Zp + Vo COS Zo)
`
`Page 6 of 13
`
`6
`
`6
`
`Page 6 of 13
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`719
`
`bounded by 27?~!,!
`
`(27)
`
`|1-27'[[Q-2-)- Kk?
`jet
`
` -|1-2" a ~2-4/).F] (142) <2?71,
`
`i=0
`
`jed
`
`This approach, however, fails in the TPR method. Here, each matrix element un-
`dergoes only one planerotation. The scaling factor to be corrected is thus K rather than
`K?. In orderto solve this moredifficult problem, different techniques have been developed
`in the literature. Haviland and Tuszynski [17] used similar scaling iterations as Cavallaro
`and Luk. Ahmed [18] repeated some CORDICiterations to force K to a power of the
`machine radix. Delosme [19] combined both methods of Haviland, Tuszynski, and
`Ahmed for minimizing the numberof computations. Deprettere, Dewilde, and Udo [20]
`suggested the double-shift concept.
`Wedesigned a computer program [21] for a systematic search of CORDIC
`sequences. We allow shifts parameters S(i) (i = 0, 1,°::, ™ — 1) with differences
`Si + 1) — S(i) €
`{0, 1, 2} to provide more optimization freedom. For an efficient
`scaling correction, we require that the scaling factor K be corrected by a sequence of 7,
`shift-add operations,
`
`(28)
`
`2-7) Il (l4+(/)2-7™)-K=1+AK
`jel
`
` (T(j)integers, n(j)=+1).
`
`These additional scaling iterations are parametrized by the set of signed integers
`{T(O), n(1)TC1), ++:
`, n(y) T(m,)}. The total numberofiterations is L = n + my.
`In (28), AK denotes the remainingrelative scaling error after the scaling correction.
`Weemphasize that this is a systematic error with a constant sign. By contrast, the other
`two types of CORDICerrors, the angular error due to the limited angle resolution and
`the rounding error, are of statistical nature because they may be positive or negative.
`Thescaling error is thus morecritical with respect to error accumulation when repeated
`CORDICoperations on the same data have to be computed as in SVD applications.
`Roughly speaking,the total scaling error after k CORDICfunction calls increases linearly
`with k, a fact that has been verified by our numerical experiments. For this reason, we
`require | AK| to be much smaller than 2~”.
`Wefound catalogues of CORDIC sequences with complexity comparable to those
`of Cavallaro and Luk. In the following, five examples for different mantissa lengths p =
`16, 20, 24, 28, and 32, including the total numberofiterations L = n + n,, the convergence
`region C, and the remaining scaling error AK are given:
`
`p=16:
`
`{S(i)}={0123--- 1516},
`{n)TU)} ={142-549+4+10}, L=174+4, Cx 100°, AKx—27'69?
`
`p=20:
`
`{S(i)}={0123--- 1920},
`
`{n(J)T(/)} = (142-5 +9 +10+16},
`L=21+5,
` Cx100°,
`
`AKw 27259,
`
`' Whenreplacing [p/41 by [(p — 1)/4lor[(p + 1)/41, the upper bound in (27) becomes 2° or 2-°~?,
`respectively.
`? When appending an additional scaling iteration with n(5)7(5) = +16, the scaling accuracy can be
`enhanced to AK = 2-7,
`
`Page 7 of 13
`
`7
`
`Page 7 of 13
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`720
`
`p=24:
`
`B. YANG AND J. F. BOHME
`
`{S(i)}={11233455667889 10 --+ 2324},
`{nT} ={0-24+6},
`L=29+2,
`Ce9l®9, AK 2°73,
`
`p=28:
`
`p=32:
`
`{S(i)}={11233455667889 1011 1213141415 --- 2728},
`{nA TCA} = {0-246},
`L=344+2,
`C#91°,
`AK we 273253
`{S(i)}={0013334567899 10 --+ 3132},
`
`{n(j) TU) } = {1-3-8 +16 —25 —27},
`L=36+5,
`C145°,
`
` AKwe—2-%°%,
`
`Rememberthat in order to meet the convergence condition (23) and to provide a
`convergence region C = 90°, the minimum number of CORDICiterations is p + 1. So,
`for all CORDIC sequencesgiven above, the number L — (p + 1) of additional iterations
`for scaling corrrection is p/4. Moreover, except for the first CORDIC sequence, the
`remaining scaling error | AK| is significantly smaller than 2~”. This leads to improved
`numerical properties compared with other CORDIC sequences reported in theliterature.
`We also remark that if the symmetric eigenvalue problem is considered for which a
`convergence region of C = 45° is sufficient [2], the total number of CORDICiterations
`can be further reduced. An example that is nearly identical to the last CORDIC sequence
`given aboveis
`
`p=32:
`
`{S(i)}={1333456789910 «++ 3132},
`
`{nA T(A)} = {0-3-8 +16 —25 —27},
`L=34+5,
`C= 55°,
`
`AK mm —2739-3,
`
`For comparison, Delosme [5] has also given an optimized CORDIC sequence for the
`samesituation. His sequence requires one iteration more (L = 40) and achievesa scaling
`accuracy of AK = 27336,
`Wesuspect that similar results can also be obtained by using Deprettere’s double-
`shift concept. However, this method requires a slightly increased hardware complexity
`and will not be discussed in this paper.
`
`5. CORDIC implementation of the SVD PEs. Foreasyillustration, we first introduce
`a CORDICprocessor symbol as shown in Fig. 2. The descriptionsinside the box determine
`uniquely the function mode of the CORDIC processor according to Table 1. The output
`data x and y are assumedto be scaling corrected.
`It is now simple to map the operations (8)—(11) and (12)-—(15) of the TPR method
`onto a two CORDICprocessorarchitecture. In Fig. 3, the diagonal PEs of the SVD array
`are implemented by two CORDICprocessors and eight adders, The dotted inputs of the
`adders represent negated inputs. Because the diagonal PEs work in the generation mode,
`
`Xo
`
`Yo
`
`Zo
`
`Y
`779
`
`E=+1
`
`x
`
`y
`
`z
`
`Fic. 2. CORDIC symbol,
`
`Page 8 of 13
`
`8
`
`Page 8 of 13
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`721
`
`
`
`Fic, 3. CORDIC implementation ofthe diagonal PE of the SVD array.
`
`both CORDICprocessors are driven in the “‘y — 0” mode for computing Cartesian-to-
`polar coordinates conversions. In Fig. 4, the off-diagonal PEs working in the rotation
`mode are implemented by two CORDICprocessors and ten adders. Here, the CORDIC
`processors are driven in the ‘““z ~ 0” mode for performing planerotations.
`Obviously, both CORDIC implementations have nearly the same architecture. All
`PEs of the SVD array can thus be implemented by one unified CORDIC SVD module
`(Fig. 5) without considerably increased hardware cost. The different computation modes
`of the diagonal and off-diagonal PEs are easily ‘“‘programmed” by one control bit. The
`resulting SVD arrayis similar to that in Fig. 1, but homogeneous with identical PEs.
`We remark that Fig. 5 is more a “graphic program” describing the sequence of
`operations to be computed rather than a hardware block diagram. We show in the fol-
`lowingthat the 12 adders that are paired into three pre-butterflys and three post-butterflys
`can be integrated into the two CORDIC processors without separate hardwarerealizations.
`The Jacobi SVD method is a recursive method. Each PE of the SVD array has to exchange
`data with its diagonal neighbors. Because of this data dependency, only recursive CORDIC
`processors can be used here. This is an arithmetic unit consisting of mainly three adders
`and two barrel-shifters. It carries out the L iterations of the CORDIC algorithm in L
`cycles by using data feedback. The two CORDICprocessors contained in one CORDIC
`SVD module require six adders altogether. So, it is natural to modify the CORDIC
`processor architecture slightly and to use the existing six adders for computing both the
`pre-butterfly and the post-butterfly operations. The resulting CORDIC SVD module has
`
`
`
`Fic. 4. CORDIC implementation ofthe off-diagonal PE ofthe SVD array.
`
`Page 9 of 13
`
`9
`
`Page 9 of 13
`
`
`
`722
`
`B. YANG AND J. F. BOHME
`
`pre—butterflys
`
`post—butterflys
`
`Sey Oo
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp://www.siam.org/journals/ojsa.php
`
`
`
`
`
`
`
`
`
`Fic. 5. A unified CORDIC SVD module for implementing all PEs of the SVD array.
`
`the hardware complexity of two recursive CORDIC processors and requires a total com-
`putation time of L + 2 iterations.
`In Fig. 6, the principal architecture of such a two CORDIC processor SVD module
`is shown. The dashed lines and boxes represent the additional hardware components
`
`
`
`
`
`FiG. 6. The principal architecture ofthe unified CORDIC SVD module.
`
`Page 10 of 13
`
`10
`
`10
`
`Page 10 of 13
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`723
`
`enabling the CORDIC processors to computethe butterfly operations.It is easily verified
`that the upper four adders devoted to x and y can perform the following types of operations
`2~-'x + 27'y (pre-butterfly), x + y (post-butterfly), x + 2~°y (CORDICiteration) and
`x + 2~‘x (scaling iteration) while the lower two adders devoted to z can compute 27'z,
`+ 27'z, (pre-butterfly), z; + z2 (post-butterfly), and z + a (CORDICiteration), re-
`spectively. The cross switch between the registers “Y," and ‘“‘X»” is needed to exchange
`data when the CORDIC SVD module switches from the pre-butterfly operations into
`the CORDICiterations or from the CORDICiterations into the post-butterfly operations,
`respectively. Then, we see from Fig. 3 that the output data pairs of the pre-butterflys are
`(P1, P2) and (q1, q2), while the desired input data pairs for the CORDICiterations are
`(pi, q:) and (p2, G2), respectively. So, p. and g; have to be exchanged.
`
`6. Comparisons. We now compare the new CORDIC SVD module with those pro-
`posed by Cavallaro and Luk [4] and Delosme [5]. Let A..,q and Tesvq denote the area
`and time complexity ofa CORDIC SVD module and A.oric and Teoric those of a CORDIC
`processor, respectively. Cavallaro and Luk have shown that their most efficient parallel
`diagonalization method requires Aggya = 2Acordic ANd Teva & 3Tcoraic for the diagonal
`PEs and Aosya = 2 Acordic ANd Tesva & 2 Tcordic for the off-diagonal PEs. By using the TPR
`method, we require Toya = 2Acoraic ANG Tesva & Toordic for all PEs. In other words, having
`approximately the same hardware complexity, the computation time is reduced by more
`than 50 percent.
`A comparison to Delosme’s method is more difficult because he follows a quite
`different approach. Therefore, only rough performance estimates are given here. In our
`method, we compute the rotation angles explicitly. After these computations have been
`completed in the diagonal PEs, the angles propagate to the off-diagonal ones. We assume
`that the propagation from one PEtoits neighbors takes one cycle Tyyce, the time required
`for computing one CORDIC iteration. This implies local communications without
`broadcasting data. At the beginning of the second propagation cycle, the angles reach
`the diagonal neighbors of the diagonal PEs which complete their computations after
`T.sva. This means that the diagonal PEs have to wait for a delay time Tactay = Teycre +
`Tesva before they can exchange data with their diagonal neighbors.* Thetotal time elapsed
`between two adjacent activities at each PE is thus Toya + Tactay = 2Tesva + Teycie © 2 Teordic
`because Teycie is negligible with respect to Teoraic = L* Teycte-
`Delosme does not compute the rotation angles explicitly. He rather calculates en-
`codingsof the angles,i.c., sequences ofsigns +1, and sends them to the off-diagonal PEs.
`This enables overlap of diagonal and off-diagonal rotations because the encoding signs
`are recursively obtained and become available before the completion of diagonal oper-
`ations. Accordingly, no delay timeis required ( Tyciay = 0), provided that the SVD array
`size (the half of the matrix size) is smaller than the number of CORDICscalingiterations
`n, (for details, see [5]). The drawback is, however, that the TPR method cannot be
`applied to the off-diagonal PEs. Four plane rotations are hence required, resulting in
`Tosa = 2Tcoraic for two CORDIC processors in one module. In other words, the time
`complexities Toya + Toetay Of both methods are nearly identical and equal 27yordic. If,
`however, multiple problemsare interleaved, the fraction of idle time that is 50 percent
`in our case can be reduced to almost zero. In such a situation, our method provides the
`double speed compared with Delosme’s one.
`
`3 If the propagation timeis assumed to be T.,4, we get the well-knownresult Tasiay = 2 Teva given by Brent
`and Luk [2].
`
`Page 11 of 13
`
`11
`
`11
`
`Page 11 of 13
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`724
`
`B. YANG AND J. F. BOHME
`
`In terms of area complexity, both CORDIC SVD modules contain two CORDIC
`processors. Our module consists of essentially six adders, four barrel shifters, and one
`ROMtable containing # angle values. Delosme’s architecture requires four carry-save
`adders, four adders, and eight barrel shifters. So, as a rough estimate, both SVD modules
`have the same order of area complexities.
`Perhaps the most important advantage of Delosme’s approach is the 2-bit wide
`horizontal and vertical data connections for sending angle encodings serially rather than
`sending thefull angle values in parallel. The prices are the upper bound of the SVD array
`size depending on the number of CORDICscaling iterations, the relatively complicated
`timing, and a nonregular CORDIC architecture design. We also mention that while
`Delosme’s method presumes a CORDIC implementation, the TPR method is applicable
`to other computing architectures.
`
`7. Other applications of the TPR method. Another advantage of the TPR method
`seemsto be the relatively wide range of applications. We indicate some of them in the
`following.
`For the SVD of a rectangular matrix, a well-known methodis first to triangularize
`the matrix by QR decomposition and then to apply the Jacobi SVD procedure to the
`triangular factor. Luk [22] has shownthat both steps can be implemented by onetriangular
`systolic array. Each PE contains a 2 x 2 submatrix. It applies two plane rotations (through
`the same angle) to the two column vectors at the QR step and a two-sided rotation at
`the SVD step. For computing the SVD step, the PE can be realized by the CORDIC
`SVD module, as before. On the other side, the two CORDIC processors contained in
`the module are also appropriate to perform the two-plane rotations of the QR step. The
`CORDIC SVD module presented in this paper thus provides a suitable PE for Luk’s
`triangular SVDarray.
`Stewart [23] has proposed a square systolic array for computing the Schur decom-
`position (SD) of a non-Hermitian matrix which, for example, is useful for evaluating
`the eigenvalues of the matrix. His approach is similar to the Jacobi SVD method. It is
`based on applying a sequence of two-sided rotations to 2 < 2 submatrices, where theleft
`and right rotation angles are identical to make the diagonal submatrices uppertriangular.
`While the diagonal PEs perform operationsdifferent from those in SVD,the off-diagonal
`PEs have exactly the same computational task as in SVD computing. Therefore