`
`
`
`
`
`
`
`Downloaded01f04i’13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffwnmv.siam.orgjoumalsfojsa.php
`
`
`
`
`
`
`
`
`
`SIAM J. MATRlX ANAL. AWL.
`vol. I2, No. 4.pp, ‘l'I3—‘l'25. October I99l
`
`1'13 I99l Society for Industrial and Applied Mathematics
`009
`
`REDUCING THE COMPUTATIONS OF THE SINGULAR VALUE
`DECOMPOSITION ARRAY GIVEN BY BRENT AND LUK‘
`
`B. YANG? AND J. F. BOHMEt
`
`Abstract. A new, eflicient, two-plane rotation (TPR) method for computing two-sided rotations involved
`in singular value decomposition (SVD) is presented. It is shown that a two-sided rotation can be evaluated by
`only two plane rotations and a few additions. This leads to significantly reduced computations. Moreover, if
`coordinate rotation digital computer (CORDIC) processors are used for realizing the processing elements (PEs)
`of the SVD array given by Brent and Luk, the computational overhead of the diagonal PEs due to angle
`calculations can be avoided. The resulting SVD array has a homogeneous structure with identical diagonal and
`off-diagonal PEs. Similar results can also be obtained if the TPR method is applied to Luk’s triangular SVD
`array and to Stewart’s Schur decomposition array.
`
`Key words. singular value decomposition, systolic arrays, CORDIC, two-sided rotations, VLSI
`
`AMS(MOS) subject classification.
`
`lSAlS
`
`1. Introduction. One important problem in linear algebra and digital signal pro-
`cessing is the singular value decomposition (SVD). Typical applications arise in beam-
`forming and direction finding, spectrum analysis, digital image processing, etc. [1]. Re-
`cently, there has been a massive interest in parallel architectures for computing SVD
`because of the high computational complexity of SVD, the growing importance of real-
`time signal processing, and the rapid advances in very large scale integration (VLSI) that
`make low-cost, high-density and fast processing memory devices available.
`There are different numerically stable methods for computing complete singular
`value and singular vector systems of dense matrices, for example, the Jacobi SVD method,
`the QR method, and the one-sided Hestenes method. For parallel implementations, the
`Jacobi SVD method is far superior in terms of simplicity, regularity, and local com-
`munications. Brent, Luk, and Van Loan have shown how the Jacobi SVD method with
`parallel ordering can be implemented by a two-dimensional systolic array [2 ], [3]. Various
`coordinate rotation digital computer (CORDIC) realizations of the SVD array have been
`reported by Cavallaro and Luk [4] and Delosme [5], [6].
`The Jacobi SVD method is based on, as common for all two-sided approaches,
`applying a sequence of two-sided rotations to 2 X 2 submatrices of the original matrix.
`The computational complexity is thus determined by how to compute the two-sided
`rotations. In most previous works, a two-sided rotation is evaluated in a straightforward
`manner by four plane rotations, where two of them are applied from left to the two
`column vectors of the 2 x 2 submatrix and the other ones are applied from right to the
`row vectors, respectively. In the diagonal processing elements (PEs), additional operations
`for calculating rotation angles are required. This leads to an inhomogeneous array ar-
`chitecture containing two different types of PBS.
`In this paper, we develop a two-plane rotation (TPR) method for computing two—
`sided rotations. We show that the above computational complexity can be reduced sig-
`nificantly because each two-sided rotation can be evaluated by only two plane rotations
`and a few additions. Moreover, the SVD array given by Brent and Luk becomes ho-
`mogeneous with identical diagonal and off-diagonal PEs when CORDIC processors are
`
`“‘ Received by the editors September 28, 1989; accepted for publication {in revised form) August 2, 1990.
`1' Department of Electrical Engineering, Ruhr-Universitat Bochum, 4630 Bochum, Germany.
`7l3
`
`LG 1008
`
`1
`
`LG 1008
`
`
`
`714
`
`B. YANG AND .I. F. 3614MB
`
`used. In a recent work [6], Delosme has also indicated this possibility in connection
`with “rough rotations" independently. He has taken, however, a different approach that
`is based on encoding the rotation angles. He has still required four plane rotations on
`the off-diagonal PEs while diagonal and off-diagonal operations can be overlapped.
`Our paper is organized as follows. In § 2, we briefly reexamine Jacobi’s SVD method
`and Brent and Luk’s SVD array. Then, we develop the TPR method in § 3. The CORDIC
`algorithm is described in § 4, where in particular CORDIC scaling correction techniques
`are discussed and examples of scaling-corrected CORDIC sequences are given. In § 5, a
`unified CORDIC SVD module for all PEs of the SVD array is presented. This module
`is compared to those proposed by Cavallaro, Luk, and Delosme in § 6. Finally, we stress
`the applicability of the TPR method to several other problems.
`
`2. Jacobi SVD method. In this paper, we consider real, square, and nonsymmetric
`matrices. Let ME RNX N be a matrix of dimension N. The SVD is given by
`
`(l)
`
`M=UEVT,
`
`where U 6 R” X” and V e lit”>< N are orthogonal matrices containing the left and right
`singular vectors, and E 6 IR” "N is a diagonal matrix of singular values, respectively. The
`superscript Tdenotes matrix transpose. Based on an extension of the Jacobi eigenvalue
`algorithm [7], Kogbetliantz [8] and Forsythe and Henrici [9] proposed to diagonalize
`M by a sequence of two-sided rotations,
`
`{2)
`
`Mo=M, Mr+.=Ui'MrI/t
`
`(k=0.1.2, ---).
`
`Ur and Vk describe two rotations in the (i, j)-plane [1 E i < j 3 N), where the rotation
`angles are chosen to annihilate the elements of My, at the positions (1', j) and (j, 1‘).
`Usually, several sweeps are necessary to complete the SVD, where a sweep is a sequence
`of N( N -— l ) / 2 two-sided rotations according to a special ordering of the N(N — 1N2
`different index pairs (1‘, j).
`For sequential computing on a uniprocessor system, possibly the most frequently
`used orderings are the cyclic orderings, namely, the cyclic row ordering
`
`(3)
`
`(i,j)=(1,2),(1,3),
`
`,(13N).(2,3). JAN}. WAN" LN)
`
`or the equivalent cyclic column ordering. Sameh [10] and Schwiegelshohn and Thiele
`[11] have shown how to implement the cyclic row ordering on a ring-connected or a
`mesh-connected processor array. Recently, a variety of parallel orderings have been de-
`veloped. Luk and Park [12 ] have shown that these parallel orderings are essentially equiv-
`alent to the cyclic orderings and thus share the same convergence properties.
`Brent and Luk have suggested a particular parallel ordering and developed a square
`systolic array consisting of IN/ 21 X IN/ 21 PBS for implementing the Jacobi SVD method
`(Fig. 1 ). To do this, the matrix M is partitioned into 2 X 2 submatrices. Each PE contains
`one submatrix and performs a two-sided rotation
`
`where
`
`(5)
`
`A=(a“ 6312)
`
`022
`
`(121
`
`and B=(b”
`
`521
`
`(312)
`
`ll322
`
`
`
`
`
`
`
`DownloadedDUO-4H3to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.org/journals/ojsaphp
`
`
`
`
`
`
`
`
`
`2
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`715
`
`
`
`
`
`FIG. 1. The SVD array given by Brent and Lttk.
`
`denote the submatrix before and after the two-sided rotation, respectively, and
`
`(6}
`
`R(9):(
`
`cos 9
`
`—sin 6
`
`sin 6
`
`cos 6
`
`describes a plane rotation through the angle 6. At first, the diagonal PEs (symbolized by
`a double square in Fig.
`l ) generate the rotation angles to diagonalize the 2 X 2 submatrices
`(1),; = [321 = 0) stored in them. This means that 3. and fig are first calculated from the
`elements ofA and then relation (4) is used to compute (an and an. We call this the
`generation mode. Then, the rotation angles are sent to all off~diagonal PBS in the following
`way: the angles associated to the left-side rotations propagate along the rows while the
`angles associated to the right-side rotations propagate along the columns. Once these
`angles are received, the off-diagonal PEs perform the two-sided rotations (4) on their
`stored data. We call this the rotation mode. Clearly, if we compute the rotation mode
`straightforwardly, we require four plane rotations. For the generation mode, additional
`operations for calculating 61 and 62 are required.
`
`3. TPR method for computing two-sided rotations. In order to develop the TPR
`method for computing two-sided rotations more efficiently, we first discuss the com-
`mutative properties of two special types, the rotation-type and the reflection-type, of
`2 X 2 matrices. We define
`
`(7)
`
`M’°‘=[( x y)x,yen]
`
`
`_.V x
`
`and caret=[(x
`
`y
`
`3’)
`
`-x
`
` x,yelR].
`
`The former is called rotation-type because it has the same matrix structure as a 2 X 2
`plane rotation matrix. Similarly, the latter is called reflection-type because it has the
`same matrix structure as a 2 X 2 Givens reflection matrix [13]. Note that x and y must
`not be normalized to Jr2 + y2 = 1. Using the above definitions, the following results can
`be shown by some elementary manipulations.
`LEMMA 1. HA, E at!“ andA2 e M“, then/11A; : A2141 e all“.
`LEMMA 2. HA, 6 car“ and A2 6 at“, then .41212 = A§A1 e an”?
`In particular, if we consider two plane rotations, we know the following.
`LEMMA 3.
`.9" R891) and R(62) are piane rotations described by (6),
`R(31}R(02) = R(31+ 32) and R(01)TR(62) = 53092 — 31)-
`Now, we give a theorem describing the rotation mode of the TPR method.
`THEOREM. Ifthe 2 X 2 matrix A and the two rotation angles 31 and 62 are given,
`then the two-sided rotation (4} can be computed by two plane rotations, ten additions,
`
`then
`
`
`
`
`
`
`
`DownloadedDUO-4H3to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffwwsiam.org/journalstojsaphp
`
`
`
`
`
`
`
`
`
`3
`
`
`
`716
`
`B. YANG AND J. F. Bot-1MB
`
`andfour scoffngs by l:
`
`(8)
`
`(9)
`
`(ll)
`
`P1=(322+011)/2.
`Q1=iflzl_a|2)i’2,
`
`P2=(022"flil)/2,
`G2 =(a21'l'aI2M29
`
`9-=32—9I,
`
`6+=62+6h
`
`(swirl
`
`l
`
`l
`
`bll=rI—r2,
`
`biz=—Ii+12.
`
`b2|:t1+lz,
`
`b22=r1+r2.
`
`Proof. Using {8), the matrix A can be reformulated as
`
`A=A1+A2=(p1 why-(“”2 Q2).
`
`612
`
`P2
`
`(11
`
`PI
`
`Clearly, Rail), Run) in (4) and A. are elements of}:r "°‘ while A; belongs to JP“. This
`leads to the following reformulation of the matrix B by using Lemmas 1—3:
`
`B=R(6.)TAR{62)
`
`= R(9.)TA,R(62)+ R(61)TA2R(62)
`
`= R(e,)TR(62)A. +remainusgfii2
`
`=R(62— 61)A1+R(61+ 607A;
`
`PI
`’31
`
`‘ql
`
`PI )+R(9.)T(_‘”1 ‘12)
`
`92
`
`P2
`
`This completes the proof.
`The generation mode of the TPR method follows directly from the above theorem.
`COROLLARY. [fine 2 X 2 matrix A is given, we can diogonaiize A and calculate
`the corresponding rotation angles til and 62 by two Cartesian-to-poiar coordinates con-
`versions, eighi additions, andfour scalings by %:
`
`(12)
`
`(13)
`
`(l4)
`
`(15)
`
`PI:(022+GH)/2.
`
`P2:(£122"311)/2,
`
`Q1:(flzl—a|2)/2,
`
`42=(flzi+012)/2,
`
`r1= sign (pt) Vpi+ qt,
`
`r2= sign (p2) Vpé+qi
`
`9— earctan (qifpi).
`
`6+ =arctan (fit/P2).
`
`31=(3+*3—)/2,
`
`92=(3++3—)f2,
`
`bl1=rl_r2s
`
`b22=rl+r2-
`
`Proof. Regarding ( l l ), b]; = .62] = O is equivalent to I. = t2 = 0. Equation (13)
`follows then from (10). This completes the proof.
`In equation (13), we choose the rotation through the smaller angle. All vectors
`lying in the first or the fourth quadrant are rotated onto the positive x—axis, and all vectors
`lying in the second and the third quadrant are rotated onto the negative x—axis. For
`vectors on the y-axis, the rotation direction is arbitrary. Thus, the generated rotation
`
`
`
`
`
`
`
`DownloadedDUO-4H3to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:fiww.siam.org/journalsiojsa.php
`
`
`
`
`
`
`
`
`
`4
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`717
`
`angles (L and 6+ satisfy ltL I ,
`
`|6+| g 90". This results in
`
`(l6)
`
`|al|s90° and wean:
`
`due to ( 14).
`Equation (16) is important with respect to the convergence of the Jacobi SVD
`method. Forsythe and Henrici [9] have pr0ven the convergence for cyclic orderings if
`the rotation angles 6. and 62 are restricted to a closed interval inside the open interval
`(-90”, 90°). They have also demonstrated that this condition may fail to hold, Le, 61
`and 62 may be :90°, if the off—diagonal elements blz and 1);, in (5) have to be exactly
`annihilated. As a remedy, they suggested an under- or overrotation by computing the
`two-sided rotation (4) with angles (1 — 106. and (l — 7mg (—1 < "y < I) and proved
`its convergence. In practice, however, the finite machine accuracy in the real arithmetic
`allows only an approximative computation of the rotation angles and implies under- or
`overrotations. So the Jacobi SVD method converges without using under— or overrotations
`as shown by the experimental results of Brent, Luk, and Van Loan [3]. In case of CORDIC
`implementations, the effect of implicit under- or overrotations is more apparent. The
`angles i90° can never be exactly calculated because of the limited angle resolution are-
`tan (2") of the CORDIC algorithm, where p denotes the mantissa length.
`
`4. The CORDIC algorithm. In the previous section, we have seen that the main
`operations of the TPR-method are plane rotations and Cartesian-to-polar coordinates
`conversions. These operations can be carried out by multiplier—adder—based processors
`supported by software or special hardware units. An alternative approach is the use of
`dedicated processors that usually map algorithms more effectively to hardware. The
`CORDIC processor is such a powerful one for calculating trigonometric functions.
`The CORDIC algorithm was originally designed by Volder [l4] as an iterative pro-
`cedure for computing plane rotations and Cartesian-to-polar coordinates conversions. It
`was later generalized and unified by Walther [15], enabling a CORDIC processor to
`calculate more functions, including hyperbolic functions, as well as multiplications and
`divisions. In the following, we consider Volder’s CORDIC algorithm because only trig-
`onometric functions are involved in SVD applications.
`The CORDIC algorithm consists of iterative shift-add operations on a three—com-
`ponent vector,
`
`xr+i)_(xr—Ur5ryi)_
`
`n+1
`
`
`
`l
`
`yr+arém costar) UrSiHUIr')
`
`(
`
`costar)
`
`earsin(ar))(xs)
`
`costar)
`
`yr
`
`’
`
`(17)
`
`(18)
`
`Zr+I=Zr—wrar
`
`io<5r<1;Ur=i1§€=i1;i=0,1;"‘3fl—1),
`
`in which the iteration stepsize 5,- is defined by
`
`(19)
`
`5,-=tan(a,-)=2‘S(”.
`
`The set of integers {8(5)} parametrizing the iterations is called CORDIC sequence.
`Equation ( 17) can be interpreted, except for a scaling factor of
`
`I
`cos (an)
`
`I‘ll:
`
`=lll+6,3,
`
`(20)
`
`as a rotation of (16,-, yr)? through the angle arr, where the sign a, = i1 gives the rotation
`direction. After 11 iterations, the results are given by
`
`(21)
`
`(22)
`
`(xn)=K(cosa —sina)(xo),
`
`Slnoc
`
`cosa
`
`yo
`
`y»
`
`zfl=zo—car,
`
`
`
`
`
`
`
`DownloadedMID-4&3to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.org/journals/ojsaphp
`
`
`
`
`
`
`
`
`
`5
`
`
`
`718
`
`s. YANG AND J. F. BoHME
`
`with the overall scaling factor K = H; kt and the total rotation angle a = E,- afaf. Now,
`if the CORDIC sequence satisfies the following convergence condition
`n — I
`
`(23)
`
`as— E ajéan—l
`j=i+l
`
`(5:051,
`
`tit—2),
`
`we can choose the sign parameter
`
`(24)
`
`an:
`
`{usisn (xiyt)
`
`,
`s1gn(cz.)
`
`foryn-v0,
`
`for z,,—- 0
`
`to force y” or 2,, to zero, provided that the input data x0, yo, and 20 lie in the conver-
`gence region
`
`(25)
`
`C: 2 as;
`I=0
`
`"—1
`
`[larctan (ygfxon
`
`lZol
`
`form-*0,
`
`forzn—bO.
`
`In this way, two difl‘erent types of CORDIC trigonometric functions can be computed
`(Table 1). In the mode y" —» 0, the Cartesian coordinate (x0, yo) of a plane vector is
`converted to its polar representation, where the parameter c = :1 determines the sign
`of the phase angle calculated. When z,l —r 0, a given plane vector is rotated through the
`angle 20, where a = :1 controls the rotation direction.
`In Table l, the principal value larctan ( yu/XDH 2 90° ofthe inverse tangent function
`is calculated when computing Cartesian-to-polar coordinates conversions. Correspond-
`ingly, x” may be positive or negative according to the sign of x0. So, it is guaranteed that
`a vector is always rotated through the smaller angle onto the x—axis in accordance with
`(13). In this case, a convergence region of C a 90° is suflicient for the generation mode
`of the two-sided rotation.
`
`One main drawback of the CORDIC algorithm is the need of correcting the scal-
`ing factor K that arises during the iterations (17). For example, if we use Volder’s
`CORDIC sequence
`
`(26)
`
`{S(i)}={0,l,2,3,-'-,p—l,p},
`
`with n = p + l CORDIC iterations for a mantissa accuracy of 2 “1", the scaling factor is
`K as 1.64676. Compensating this undesired scaling effect with a minimum number of
`computations is of particular importance.
`Clearly, multiplying x" and y" in Table l by X“ will degrade the algorithm per-
`formance substantially. Most of the scaling correction issues are based on shift-add oper.
`ations. For a two-sided rotation that is implemented by four plane rotations, each matrix
`element undergoes two plane rotations so that the total scaling factor to be corrected
`is K2. In this case, Cavallaro and Luk [16] have pointed out that there is a simple
`systematic approach for sealing correction when using the CORDIC summer: (26).
`They proposed to use rp/41 scaling iterations of the type 3: 4— x — 2'21x with j E J =
`{1, 3, 5, -- -
`, 2fpf4'l — 1} and one shift operation 2". The remaining scaling error is
`
`TABLE 1
`CORDIC trigonometricfunctions (a : it).
`
`
`
` y" -* 0 z" —I~ 0
`
`x, = Ktxu cos zo — cyg sin 7-0}
`x, = K9811 (Xe) tx3+y5
`
`2,, = 2.3 + e arctan (yo/x0) y" = Hex.) sin 29 + yo cos :9)
`
`
`
`
`
`
`
`DownloadedMID-4&3to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffwrw.siam.org/journals/ojsa.php
`
`
`
`
`
`
`
`
`
`6
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`719
`
`bounded by 2"“ l,‘
`
`(27)
`
`l—2“H(1~2"2’)'K2
`jeJ
`
` =|1—2"1’1(1—2*2-r')-1'pj(1+2—=’) <2—P-l.
`
`i=0
`
`jeJ
`
`This approach, however, fails in the TPR method. Here, each matrix element un-
`dergoes only one plane rotation. The scaling factor to be corrected is thus K rather than
`K2. In order to solve this more difficult problem, different techniques have been developed
`in the literature. Haviland and Tuszynski [17] used similar scaling iterations as Cavallaro
`and Luk. Ahmed [18] repeated some CORDIC iterations to force K to a power of the
`machine radix. Delosme [19] combined both methods of Haviland, Tuszynski, and
`Ahmed for minimizing the number of computations. Deprettere, Dewilde, and Udo [20]
`suggested the double-shift concept.
`We designed a computer program [21] for a systematic search of CORDIC
`sequences. We allow shifts parameters S(i) (i = 0, 1,
`, n — 1) with differences
`8(1‘ + l) — 8(1‘) 6 {0, 1, 2} to provide more optimization freedom. For an efficient
`scaling correction, we require that the scaling factor K be corrected by a sequence of Hr
`shift-add operations,
`
`(28)
`
`2—110) fi (l+n(j)2_T‘-‘I’) - K=1+AK
`j=l
`
`(T(j)integers, 11(j) = i1).
`
`These additional scaling iterations are parametrized by the set of signed integers
`{T(O), n( l )T(1), -
`-
`-
`, n(m,)T{nk)} . The total number ofiterations is L = n + in.
`In (28), AK denotes the remaining relative scaling error after the scaling correction.
`We emphasize that this is a systematic error with a constant sign. By contrast, the other
`two types of CORDIC errors, the angular error due to the limited angle resolution and
`the rounding error, are of statistical nature because they may be positive or negative.
`The scaling error is thus more critical with respect to error accumulation when repeated
`CORDIC operations on the same data have to be computed as in SVD applications.
`Roughly speaking, the total scaling error after k CORDIC function calls increases linearly
`with k, a fact that has been verified by our numerical experiments. For this reason, we
`require IAKI to be much smaller than 2‘”.
`We found catalogues of CORDIC sequences with complexity comparable to those
`of Cavallaro and Luk. In the following, five examples for different mantissa lengths p =
`16, 20, 24, 28, and 32, including the total number of iterations L = n + m, the convergence
`region C, and the remaining scaling error AK are given:
`
`p=16: {5(3)}:{0123m 1516},
`
`{er)T(j)}={1+2—5+9+10}, L=17+4, C'z'leO", AKzs—2‘lfi'm}
`
`p=20z
`
`{S(i)}={0123--- 1920},
`
`{memo}={1+2—5+9+10+16},
`
`L=2|+5,
`
`CmIOO", Marat”,
`
`' When replacing Ipl41 by [(p - l}!41 or [(p + l]!41, the upper bound in (27] becomes 2” or 2””,
`respectively.
`2 When appending an additional scaling iteration with n(S)T(5) = +16, the scaling accuracy can be
`enhanced to AK 2-: 2'23.
`
`
`
`
`
`
`
`DownloadedDUO-4H3to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.org/journals/ojsaphp
`
`
`
`
`
`
`
`
`
`7
`
`
`
`720
`
`B. YANG AND J. F. aoHME
`
`p=24:
`
`{S(i)}={1123345566788910---2324},
`
`{n(j)T(j)}={0—2+6},
`
`L=29+2,
`
`Cm9l°, Mia—2‘29”,
`
`p=28:{S(i)}={1123345566788910111213141415---2723},
`
`{n[j)T(j)}:{0—2+6},
`
`L:34+2,
`
`Cm91°,
`
`Mw2_32‘53,
`
`p=32:
`
`{S(i)}={001333456789910---3132},
`
`{ntj)T(j)}={17378+16725727},
`
`L=36+5,
`
`Cw145°,
`
`Alia—2‘3“”,
`
`Remember that in order to meet the convergence condition (23) and to provide a
`convergence region C 2 90°, the minimum number of CORDIC iterations is p + 1. So,
`for all CORDlC sequences given above, the number L — (p + 1 ) of additional iterations
`for sealing corrrection is p/ 4. Moreover, except for the first CORDIC sequence, the
`remaining scaling error IAKI is significantly smaller than 2"”. This leads to improved
`numerical properties compared with other CORDIC sequences reported in the literature.
`We also remark that if the symmetric eigenvalue problem is considered for which a
`convergence region of C a 45° is sufficient [2], the total number of CORDIC iterations
`L can be further reduced. An example that is nearly identical to the last CORDIC sequence
`given above is
`
`p=32z
`
`{S(i)}={1333456789910~-3132},
`
`{moron ={0—3—8+16—25—27},
`
`L=34+5,
`
`CmSSO’
`
`AKfi_z—39.93‘
`
`For comparison, Delosme [5] has also given an optimized CORDIC sequence for the
`same situation. His sequence requires one iteration more {L = 40) and achieves a scaling
`accuracy ofAK m 2—33-16.
`We suspect that similar results can also be obtained by using Deprettere’s double-
`shiit concept. However, this method requires a slightly increased hardware complexity
`and will not be discussed in this paper.
`
`5. CORDIC implementation of the SVD PEs. For easy illustration, we first introduce
`a CORDIC processor symbol as shown in Fig. 2. The descriptions inside the box determine
`uniquely the function mode of the CORDIC processor according to Table 1. The output
`data x and y are assumed to be scaling corrected.
`It is now simple to map the operations (8 )~[ 1 l ) and ( 12 )-( l 5 ) of the TPR method
`onto a two CORDIC processor architecture. In Fig. 3, the diagonal PEs of the SVD array
`are implemented by two CORDIC processors and eight adders. The dotted inputs of the
`adders represent negated inputs. Because the diagonal PEs work in the generation mode,
`
`x0
`
`Yo
`
`20
`
`Y
`2—.0
`
`5211
`
`x
`
`Y
`
`2
`
`FIG. 2. CORDIC symbol.
`
`
`
`
`
`
`
`DownloadedDUO-4H3to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.org/journals/ojsaphp
`
`
`
`
`
`
`
`
`
`8
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`721
`
`
`
`FIG. 3A CORDIC implementation ofthe diagona.’ PE ofthe SVD array.
`
`both CORDIC processors are driven in the “y -> 0” mode for computing Cartesian-to~
`polar coordinates conversions. In Fig. 4, the ofildiagonal PEs working in the rotation
`mode are implemented by two CORDIC processors and ten adders. Here, the CORDIC
`processors are driven in the “z —‘> 0" mode for performing plane rotations.
`Obviously, both CORDIC implementations have nearly the same architecture. All
`PEs of the SVD array can thus be implemented by one unified CORDIC SVD module
`(Fig. 5) without considerably increased hardware cost. The different computation modes
`of the diagonal and off-diagonal PEs are easily “programmed” by one control bit. The
`resulting SVD array is similar to that in Fig. l, but homogeneous with identical PEs.
`We remark that Fig. 5 is more a “graphic program" describing the sequence of
`operations to be computed rather than a hardware block diagram. We show in the fol-
`lowing that the 12 adders that are paired into three pre»butterflys and three post-butterflys
`can be integrated into the two CORDIC processors without separate hardware realizations.
`The Jacobi SVD method is a recursive method. Each PE of the SVD array has to exchange
`data with its diagonal neighbors. Because of this data dependency, only recursive CORDIC
`processors can be used here. This is an arithmetic unit consisting of mainly three adders
`and two barrel-shifters. It carries out the L iterations of the CORDIC algorithm in L
`cycles by using data feedback. The two CORDIC processors contained in one CORDIC
`SVD module require six adders altogether. So, it is natural to modify the CORDIC
`processor architecture slightly and to use the existing six adders for computing both the
`pro-butterfly and the post-butterfly operations. The resulting CORDIC SVD module has
`
`
`
`FIG. 4. CORDIC impfememart'on ofthe ofldtagonai PE oj'the SVD array.
`
`
`
`
`
`
`
`DownloadedDUO-4H3to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.crg/journals/ojsaphp
`
`
`
`
`
`
`
`
`
`9
`
`
`
`
`
`
`
`
`
`DownloadedDUO-4H3to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttpMJ'wwsiamcrg/joumals/ojsa.php
`
`
`
`
`
`
`
`
`
`722
`
`B. YANG AND .I. F. BéHME
`
`pre—butterflys
`
`post—butterflys
`
`
`
`FIG. 5‘ A unified CORDIC SVD modulefirvr implementing all PE: offhe SVD array.
`
`the hardware complexity of two recursive CORDIC processors and requires a total com«
`putation time of L + 2 iterations.
`In Fig. 6, the principal architecture of such a two CORDIC processor SVD module
`is shown. The dashed lines and boxes represent the additional hardware components
`
`
`
`
`
`FIG. 6. The principal architecture oj'i‘he unified CORDIC SVD module.
`
`10
`
`10
`
`
`
`
`
`
`
`
`
`DownloadedDUO-4H3to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.org/journals/ojsaphp
`
`
`
`
`
`
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`723
`
`enabling the CORDIC processors to compute the butterfly Operations. It is easily verified
`that the upper four adders devoted to x and y can perform the following types of operations
`2"): -_|- 2" y (pro-butterfly), x i y (post-butterfly), x i 2"); (CORDIC iteration) and
`x i 2"): (scaling iteration) while the lower two adders devoted to 2 can compute 242,
`: 2‘12; (pro-butterfly), z. i Z2 (post-butterfly), and z i a (CORDIC iteration), re-
`spectively. The cross switch between the registers “Y.” and “X2” is needed to exchange
`data when the CORDIC SVD module switches from the pre-butterfly operations into
`the CORDIC iterations or from the CORDIC iterations into the post-butterfly operations,
`respectively. Then, we see from Fig. 3 that the output data pairs of the pre—butterflys are
`(p. , p2) and (q1 , (12), while the desired input data pairs for the CORDIC iterations are
`(pl, (1. ) and (p2, Eh), respectively. So, 192 and q1 have to be exchanged.
`
`6. Comparisons. We now compare the new CORDIC SVD module with those pro-
`posed by Cavallaro and Luk [4] and Delosme [5]. Let AM, and Tam denote the area
`and time complexity ofa CORDIC SVD module and Atomic and Tomi: those of a CORDIC
`processor, resPectively. Cavallaro and Luk have shown that their most efficient parallel
`diagonalization method requires Amd av. 214mm: and T5,“, a 33'1“,c for the diagonal
`PBS and Ac“, m 2Acmic and Tom 2e 2 Tom“, for the off-diagonal PEs. By using the TPR
`method, we require Tom a: ZAmnfic and Tm 2: Tmrdic for all PEs. In other words, having
`approximately the same hardware complexity, the computation time is reduced by more
`than 50 percent.
`A comparison to Delosme’s method is more diflicult because he follows a quite
`different approach. Therefore, only rough performance estimates are given here. In our
`method, we compute the rotation angles explicitly. After these computations have been
`completed in the diagonal PBS, the angles propagate to the off-diagonal ones. We assume
`that the propagation from one PE to its neighbors takes one cycle Twig, the time required
`for computing one CORDIC iteration. This implies local communications without
`broadcasting data. At the beginning of the second propagation cycle, the angles reach
`the diagonal neighbors of the diagonal PEs which complete their computations after
`Tom. This means that the diagonal PEs have to wait for a delay time Tag.” = Tm, +
`Tcsvd before they can exchange data with their diagonal neighbors.3 The total time elapsed
`between two adjacent activities at each PE is thus Tm, + Tdela, = Zde + Tm], m 2Tmrd,C
`because Two is negligible with respect to Tom-,0 = L- Toyota.
`Delosme does not compute the rotation angles explicitly. He rather calculates en-
`codings of the angles, i.e., sequences of signs 1 1, and sends them to the elf-diagonal PEs.
`This enables overlap of diagonal and off-diagonal rotations because the encoding signs
`are recursively obtained and become available before the completion of diagonal oper-
`ations. Accordingly, no delay time is required (Tam, = 0), provided that the SVD array
`size (the half of the matrix size) is smaller than the number of CORDIC scaling iterations
`m, (for details, see [5]). The drawback is, however, that the TPR method cannot be
`applied to the off-diagonal PEs. Four plane rotations are hence required, resulting in
`Tm, = 2710mm for two CORDIC processors in one module. In other words, the time
`complexities Tm“, + new of both methods are nearly identical and equal ZTwmg. If,
`however, multiple problems are interleaved, the fraction of idle time that is 50 percent
`in our case can be reduced to almost zero. In such a situation, our method provides the
`double speed compared with Delosme’s one.
`
`3 If the propagation time is assumed to be de, we get the well-known result Tam 2 2TH,“ given by Brent
`and Luk [2].
`
`11
`
`11
`
`
`
`
`
`
`
`
`
`DownloadedDUO-4H3to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.org/journals/ojsaphp
`
`
`
`
`
`
`
`
`
`724
`
`B. YANG AND J. F. BoHME
`
`In terms of area complexity, both CORDIC SVD modules contain two CORDIC
`processors. Our module consists of essentially six adders, four barrel shifters, and one
`ROM table containing n angle values. Delosme’s architecture requires four carry-save
`adders, four adders, and eight barrel shifters. So, as a rough estimate, both SVD modules
`have the same order of area complexities.
`Perhaps the most important advantage of Delosme’s approach is the 2-bit wide
`horizontal and vertical data connections for sending angle encodings serially rather than
`sending the full angle values in parallel. The prices are the upper bound of the SVD array
`size depending on the number of CORDIC scaling iterations, the relatively complicated
`timing, and a nonregular CORDIC architecture design. We also mention that while
`DeIOsme‘s method presumes a CORDIC implementation, the TPR method is applicable
`to other computing architectures.
`
`7. Other applications of the TPR method. Another advantage of the TPR method
`seems to be the relatively wide range of applications. We indicate some of them in the
`following.
`For the SVD of a rectangular matrix, a well-known method is first to triangularize
`the matrix by QR decomposition and then to apply the Jacobi SVD procedure to the
`triangular factor. Luk [22] has shown that both steps can be implemented by one triangular
`systolic array. Each PE contains a 2 X 2 submatrix. It applies two plane rotations (through
`the same angle) to the two column vectors at the QR step and a two-sided rotation at
`the SVD step. For computing the SVD step, the PE can be realized by the CORDIC
`SVD module, as before. On the other side, the two CORDIC processors contained in
`the module are also appropriate to perform the two-plane rotations of the QR step. The
`CORDIC SVD module presented in this paper thus provides a suitable PE for Luk’s
`triangular SVD array.
`Stewart [23] has proposed a square systolic array for computing the Schur decom-
`position (SD) of a non-Hermitian matrix which, for example, is useful for evaluating
`the eigenvalues of the matrix. His approach is similar to the Jacobi SVD method. It is
`based on applying a sequence of two-sided rotations to 2 X 2 submatrices, where the left
`and right rotation angles are identical to make the diagonal submatrices upper triangular.
`While the diagonal PEs perform operations different from those in SVD, the ofi-diagonal
`PEs have exactly the same computational task as in SVD computing. Therefore, the
`CORDIC SVD module can also be used in Stewart’s SD array.
`Even in sequential computations on a uniprocessor system, one can still apply the
`TPR method to reduce the computational complexity of two-sided rotations.
`
`8. Conclusion. We have investigated a novel algorithm for computing two~sided
`rotations requiring only two plane rotat