`Vol. 12, No. 4, pp. 713- 725, October 1991
`
`© 1991 Society for Industrial and Applied Mathematics
`009
`
`REDUONG THE COMPUTATIONS OF THE SINGULAR VALUE
`DECOMPOSITION ARRAY GIVEN BY BRENT AND LUK*
`
`B. YANGt AND J. F. BOHMEt
`
`Abstract. A new, efficient, two-plane rotation (TPR ) method for computing two-sided rotations involved
`in singular value decomposition (SVD) is presented. It is shown that a two-sided rotation can be evaluated by
`only two plane rotations and a few additions. This leads to significantly reduced computations. Moreover, if
`coordinate rotation digital computer ( CORDIC} processors are used for realizing the processing elements ( PEs)
`of the SVD array given by Brent and Luk, the computational overhead of the diagonal PEs due to angle
`calculations can be avoided. The resulting SVD array has a homogeneous structure with identical diagonal and
`off-diagonal PEs. Similar results can also be obtained if the TPR method is applied to Luk's triangular SVD
`array and to Stewart's Schur decomposition array.
`
`Key words. singular value decomposition, systolic arrays, CORDIC, two-sided rotations, VLSI
`
`AMS(MOS) subje<:t classification. 15A I 8
`
`1. Introduction. One important problem in linear algebra and digital signal pro(cid:173)
`cessing is the singular value decomposition (SYD). Typical applications arise in beam(cid:173)
`forming and direction finding, spectrum analysis, digital image processing, etc. [ l}. Re(cid:173)
`cently, there has been a massive interest in parallel architectures for computing SYD
`because of the high computational complexity of SYD, the growing importance of real(cid:173)
`time signal processing, and the rapid advances in very large scale integration (VLSI) that
`make low-cost, high-density and fast processing memory devices available.
`There are different numerically stable methods for computing complete singular
`value and singular vector systems of dense matrices, for example, the Jacobi SYD method,
`the QR method, and the one-sided Hestenes method. For parallel implementations, the
`Jacobi SYD method is far superior in terms of simplicity, regularity, and local com(cid:173)
`munications. Brent, Luk, and Van Loan have shown how the Jacobi SYD method with
`parallel ordering can be implemented by a two-dimensional systolic array [ 2}, [ 3}. Various
`coordinate rotation digital computer ( CORDIC) realizations of the SYD array have been
`reported by Cavallaro and Luk [ 4] and Delosme [ 5}, [ 6].
`The Jacobi SYD method is based on, as common for aU two-sided approaches,
`applying a sequence of two-sided rotations to 2 X 2 submatrices of the original matrix.
`The computational complexity is thus determined by how to compute the two-sided
`rotations. In most previous works, a two-sided rotation is evaluated in a straightforward
`manner by four plane rotations, where two of them are applied from left to the two
`column vectors of the 2 X 2 submatrix and the other ones are applied from right to the
`row vectors, respectively. In the diagonal processing elements ( PEs), additional operations
`for calculating rotation angles are required. This leads to an inhomogeneous array ar(cid:173)
`chitecture containing two different types of PEs.
`In this paper, we develop a two-plane rotation (TPR) method for computing two(cid:173)
`sided rotations. We show that the above computational complexity can be reduced sig(cid:173)
`nificantly because each two-sided rotation can be evaluated by only two plane rotations
`and a few additions. Moreover, the SYD array given by Brent and Luk becomes ho(cid:173)
`mogeneous with identical diagonal and off-diagonal PEs when CORDIC processors are
`
`* Received by the editors September 28, 1989; accepted for publication ( in revised form) August 2, 1990.
`t Department of Electrical Engineering, Ruhr-Universitlit Bochum, 4630 Bochum, Germany.
`713
`
`00
`N
`
`
`
`714
`
`B. YANG AND J. F. BOHME
`
`used. In a recent work [ 6], Delosme has also indicated this possibility in connection
`with "rough rotations" independently. He has taken, however, a different approach that
`is based on encoding the rotation angles. He has still required four plane rotations on
`the off-diagonal PEs while diagonal and off-diagonal operations can be overlapped.
`Our paper is organized as follows. In§ 2, we briefly reexamine Jacobi's SYD method
`and Brent and Luk's SYD array. Then, we develop the TPR method in§ 3. The CORDIC
`algorithm is described in§ 4, where in particular CORDIC scaling correction techniques
`are discussed and examples of scaling-corrected CORDIC sequences are given. In § 5, a
`unified CORDIC SVD module for all PEs of the SYD array is presented. This module
`is compared to those proposed by Cavallaro, Luk, and Delosme in § 6. Finally, we stress
`the applicability of the TPR method to several other problems.
`
`2. Jacobi SVD method. In this paper, we consider real, square, and nonsymmetric
`matrices. Let ME ~NxN be a matrix of dimension N. The SYD is given by
`
`( I)
`
`where U E ~NxN and VE ~NxN are orthogonal matrices containing the left and right
`singular vectors, and ~ E ~Nx N is a diagonal matrix of singular values, respectively. The
`superscript T denotes matrix transpose. Based on an extension of the Jacobi eigenvalue
`algorithm [ 7], Kogbetliantz [ 8] and Forsythe and Henrici [ 9] proposed to diagonalize
`M by a sequence of two-sided rotations,
`
`(2)
`
`Mo = M,
`
`(k = O, 1, 2, .. ·).
`
`Uk and Vk describe two rotations in the (i, j)-plane ( I ~ i <j ~ N ), where the rotation
`angles are chosen to annihilate the elements of M k at the positions (i, j) and (j, i).
`Usually, several sweeps are necessary to complete the SYD, where a sweep is a sequence
`of N(N -
`I )/2 two-sided rotations according to a special ordering of the N(N - 1 )/2
`different index pairs ( i, j).
`For sequential computing on a uniprocessor system, possibly the most frequently
`used orderings are the cyclic orderings, namely, the cyclic row ordering
`
`(3)
`
`(i,j) = (l ,2),(1,3), ... ,(l,N),(2,3), ... ,(2,N), ... ,(N- 1,N)
`
`or the equivalent cyclic column ordering. Sameh [!OJ and Schwiegelshohn and Thiele
`[ 11] have shown how to implement the cyclic row ordering on a ring-connected or a
`mesh-connected processor array. Recently, a variety of parallel orderings have been de(cid:173)
`veloped. Luk and Park [ 12] have shown that these parallel orderings are essentially equiv(cid:173)
`alent to the cyclic orderings and thus share the same convergence properties.
`Brent and Luk have suggested a particular parallel ordering and developed a square
`systolic array consisting off N / 21 X r N / 21 PEs for implementing the Jacobi SYD method
`( Fig. I). To do this, the matrix Mis partitioned into 2 X 2 submatrices. Each PE contains
`one submatrix and performs a two-sided rotation
`
`(4)
`
`where
`
`(5 )
`
`V'>
`M
`r--i
`V'>
`N
`00
`:!
`00
`N
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`715
`
`FIG. I. The SYD array given by Brent and Luk.
`
`denote the submatrix before and after the two-sided rotation, respectively, and
`cos O sin O)
`- sin O cos 0
`
`R(O) =
`
`(
`
`(6)
`
`describes a plane rotation through the angle 0. At first, the diagonal PEs (symbolized by
`a double square in Fig. I) generate the rotation angles to diagonalize the 2 X 2 submatrices
`(b12 = b21 = 0) stored in them. This means that 01 and 02 are first calculated from the
`elements of A and then relation ( 4) is used to compute b11 and b22. We call this the
`generation mode. Then, the rotation angles are sent to all off-diagonal PEs in the following
`way: the angles associated to the left-side rotations propagate along the rows while the
`angles associated to the right-side rotations propagate along the columns. Once these
`angles are received, the off-diagonal PEs perform the two-sided rotations ( 4) on their
`stored data. We call this the rotation mode. Clearly, if we compute the rotation mode
`straightforwardly, we require four plane rotations. For the generation mode, additional
`operations for calculating 01 and 02 are required.
`
`3. TPR method for computing two-sided rotations. In order to develop the TPR
`method for computing two-sided rotations more efficiently, we first discuss the com(cid:173)
`mutative properties of two special types, the rotation-type and the reflection-type, of
`2 X 2 matrices. We define
`
`(7)
`
`vl{'01={(_; ;)jx,yE~} and Arcr ={(; _;)jx,yE~}-
`
`V"l
`M
`r--i
`V"l
`N
`00
`:!
`00
`N
`
`In particular, if we consider two plane rotations, we know the following.
`LEMMA 3. If R(01) and R(02 ) are plane rotations described by (6), then
`R(01)R(02 ) = R(01 + 02) and R(Oi) TR(02 ) = R(02 - 01).
`Now, we give a theorem describing the rotation mode of the TPR method.
`THEOREM. If the 2 X 2 matrix A and the two rotation angles 01 and 02 are given,
`then the two-sided rotation ( 4) can be computed by two plane rotations, ten additions,
`
`The former is called rotation-type because it has the same matrix structure as a 2 X 2
`plane rotation matrix. Similarly, the latter is called reflection-type because it has the
`same matrix structure as a 2 X 2 Givens reflection matrix [ 13]. Note that x and y must
`not be normalized to x 2 + y 2 = I. Using the above definitions, the following results can
`be shown by some elementary manipulations.
`LEMMA I. If A1 EA'°' and A2 E A'°t, then A1A2 = A2A1 E A'°1
`LEMMA 2. If A1 E Arcr and A2 E A'0
`then A1A2 = AfA1 E .,I/ref_
`
`\
`
`•
`
`
`
`716
`
`B. YANG AND J. F. BOHME
`
`and/our scalings by ½:
`
`Pi = (a22 + a11 )/2,
`
`P2 = (a22 - a11 )/2,
`
`qi =(a21 - a12)/2,
`()_ = 82- 81'
`
`q2= (a21 +a12)/2,
`
`0+= 82+ 01,
`
`(8)
`
`(9)
`
`(10)
`
`( 11)
`
`b22=r1 + r2.
`b21=t1+t2,
`Proof Using ( 8), the matrix A can be reformulated as
`
`A=A1+A2=
`(
`
`P1
`- q')+(-p2 q2).
`Pi
`Qi P2
`q,
`Clearly, R(O, ), R(82) in ( 4) and A 1 are elements of Jtro• while A 2 belongs to Jtrcr_ This
`leads to the following reformulation of the matrix B by using Lemmas 1-3:
`B = R(Oi) T AR(02)
`= R(8, )T A , R(02) + R(8i)T A2R(02)
`= R(8i) TR(82)A, +R(8i) TR (0i)T Ai
`= R(82- 8, )A, + R(02 + 8, )T A 2
`
`V'l
`M
`r---i
`V'l
`N
`00
`:!
`00
`N
`
`This completes the proof.
`The generation mode of the TPR method follows directly from the above theorem.
`COROLLARY . I/the 2 X 2 matrix A is given, we can diagonalize A and calculate
`the corresponding rotation angles 81 and 82 by two Cartesian-to-polar coordinates con(cid:173)
`versions, eight additions, and four scalings by ½:
`
`( 12)
`
`(13)
`
`(14)
`
`p, = (a22+a11)/2,
`
`Pi= (a22 - a1, )/2,
`
`q, = (a21 -a,2)/2,
`r, = sign (P1) V PT + QT,
`o_ = arctan (qi/ p, ),
`81 = (8+- 8-)/2,
`
`Qi = (a2, + a 12)/2,
`r2 = sign (P2) V p~ + qL
`B+ = arctan (qi/Pi ),
`
`82=(8++ 0_)/2,
`
`(15)
`b22=r1+r2.
`b11=r, - r2,
`Proof. Regarding ( 11), b12 = b21 = 0 is equivalent to t, = ti = 0. Equation ( 13)
`follows then from ( 10). This completes the proof.
`In equation ( 13), we choose the rotation through the smaller angle. All vectors
`lying in the first or the fourth quadrant are rotated onto the positive x-axis, and all vectors
`lying in the second and the third quadrant are rotated onto the negative x-axis. For
`vectors on the y-axis, the rotation direction is arbitrary. Thus, the generated rotation
`
`
`
`REDUCED SYD COMPUTATIONS AND HOMOGENEOUS SYD ARRAY
`angles e_ and 8+ satisfy 18- 1, l8+1 ~ 90°. This results in
`l81 I ~ 90° and
`182I ~ 90°,
`( 16)
`due to ( 14 ).
`Equation ( 16) is important with respect to the convergence of the Jacobi SVD
`method. Forsythe and Henrici [9] have proven the convergence for cyclic orderings if
`the rotation angles 81 and 82 are restricted to a closed interval inside the open interval
`( - 90°, 90°). They have also demonstrated that this condition may fail to hold, i.e., 81
`and 82 may be ± 90°, if the off-diagonal elements b12 and b21 in (5) have to be exactly
`annihilated. As a remedy, they suggested an under- or overrotation by computing the
`'Y )82 (-1 < 'Y < I) and proved
`two-sided rotation ( 4) with angles ( 1 -
`'Y )81 and ( 1 -
`its convergence. In practice, however, the finite machine accuracy in the real arithmetic
`allows only an approximative computation of the rotation angles and implies under- or
`overrotations. So the Jacobi SVD method converges without using under- or overrotations
`as shown by the experimental results of Brent, Luk, and Van Loan [3]. In case ofCORDIC
`implementations, the effect of implicit under- or overrotations is more apparent. The
`angles ± 90° can never be exactly calculated because of the limited angle resolution arc(cid:173)
`tan ( 2 - p) of the CORDIC algorithm, where p denotes the mantissa length.
`4. The CORDIC algorithm. In the previous section, we have seen that the main
`operations of the TPR-method are plane rotations and Cartesian-to-polar coordinates
`conversions. These operations can be carried out by multiplier-adder-based processors
`supported by software or special hardware units. An alternative approach is the use of
`dedicated processors that usually map algorithms more effectively to hardware. The
`CORDIC processor is such a powerful one for calculating trigonometric functions.
`The CORD IC algorithm was originally designed by Voider [ 14] as an iterative pro(cid:173)
`cedure for computing plane rotations and Cartesian-to-polar coordinates conversions. It
`was later generalized and unified by Walther [ 15], enabling a CORD IC processor to
`calculate more functions, including hyperbolic functions, as well as multiplications and
`divisions. In the following, we consider Voider's CORDIC algorithm because only trig(cid:173)
`onometric functions are involved in SYD applications.
`The CORDIC algorithm consists of iterative shift-add operations on a three-com(cid:173)
`ponent vector,
`X; + 1)
`(X; - er;o;y;)
`cos (a;)
`(
`1
`Y;+1 = y;+er;O;X; = cos(a;) er;sin( a ;)
`Z;+1 = Z; - eer;a;
`(0<o;< l ; er;= ±l;e = ±l;i = 0, I, ··· ,n- I),
`( 18)
`in which the iteration stepsize o; is defined by
`(19)
`
`717
`
`- er; sin (a;) )(x;),
`cos(a;) Y;
`
`(I 7)
`
`(
`
`V"l
`M
`r--i
`V"l
`N
`00
`:!
`00
`N
`
`The set of integers { S( i) } parametrizing the iterations is called CORDIC sequence.
`Equation ( 17) can be interpreted, except for a scaling factor of
`
`1-
`
`= Vi +o2
`k ·= -
`'cos(a;)
`' '
`
`(20)
`
`as a rotation of(x;, y ;)T through the angle a;, where the sign er; = ± I gives the rotation
`direction. After n iterations, the results are given by
`
`(21)
`
`(22)
`
`(
`
`x,, )
`(cos a
`y,, =K sin a
`
`- sin a )( xo ) ,
`cos a Yo
`z,, = z0 - ea,
`
`
`
`718
`
`B. y ANG AND J. F. BOHME
`
`with the overall scaling factor K = IL k; and the total rotation angle a = L; u;a;. Now,
`if the CORDIC sequence satisfies the following convergence condition
`n-1
`
`(23)
`
`a; - L ai;;i,. a,,_,
`
`(i=0, I,··· ,n-2),
`
`j • i + I
`we can choose the sign parameter
`
`( 24 )
`
`u;= { - sign (x;y;)
`sign ( cz;)
`
`fory 11-+0,
`forz11 -+0
`
`to force y 11 or z,, to zero, provided that the input data x0 , y0 , and z0 lie in the conver(cid:173)
`gence region
`
`(25)
`
`,, _ ,
`{larctan(yo/Xo)I
`C= L a;ei;;
`;- o
`
`lzol
`
`foryn-o,
`
`forz11 -+0.
`
`In this way, two different types of CORDIC trigonometric functions can be computed
`(Table 1 ). In the mode y,, -+ 0, the Cartesian coordinate (xo, y0 ) of a plane vector is
`converted to its polar representation, where the parameter c = ± I determines the sign
`of the phase angle calculated. When z11 - . 0, a given plane vector is rotated through the
`angle z0 , where c = ± I controls the rotation direction.
`In Table I, the principal value I arctan (y0 / .xo) I ;;i,. 90° of the inverse tangent function
`is calculated when computing Cartesian-to-polar coordinates conversions. Correspond(cid:173)
`ingly, x,, may be positive or negative according to the sign of x0 . So, it is guaranteed that
`a vector is always rotated through the smaller angle onto the x-axis in accordance with
`( 13). In this case, a convergence region of C ~ 90° is sufficient for the generation mode
`of the two-sided rotation.
`One main drawback of the CORDIC algorithm is the need of correcting the scal(cid:173)
`ing factor K that arises during the iterations ( 17). For example, if we use Voider's
`CORDIC sequence
`
`{S(i)} = {0, 1,2, 3, · · · ,p-1 ,p},
`(26)
`with n = p + I CORDIC iterations for a mantissa accuracy of 2 - P, the scaling factor is
`K = 1.64676. Compensating this undesired scaling effect with a minimum number of
`computations is of particular importance.
`Clearly, multiplying x,, and y,, in Table I by K- 1 will degrade the algorithm per(cid:173)
`formance substantially. Most of the scaling correction issues are based on shift-add oper(cid:173)
`ations. For a two-sided rotation that is implemented by four plane rotations, each matrix
`element undergoes two plane rotations so that the total scaling factor to be corrected
`is K 2• In this case, Cavallaro and Luk [ I 6] have pointed out that there is a simple
`systematic approach for scaling correction when using the CORD IC sequence ( 26).
`They proposed to use r p / 41 scaling iterations of the type x +- x - 2- 2i x with j E J =
`I} and one shift operation 2- 1
`{I, 3, 5, • • •, 2rp/4l -
`• The remaining scaling error is
`
`TABLE I
`CORDIC trigonometric fimclions (e = ± 1).
`
`y .... 0
`
`Z11 _.. 0
`
`x. = K sign (x0 ) Vxij + yij
`z. = z0 + c arctan (yo/x,,)
`
`x . = K(xo cos Zo - <Yo sin Zo)
`Yn = K(cxo sin Zo + Yo cos zo )
`
`V"l
`M
`r--i
`V"l
`N
`00
`:!
`00
`N
`
`
`
`REDUCED SYD COMPUTATIONS AND HOMOGENEOUS SYD ARRAY
`
`719
`
`bounded by 2- p- I, 1
`
`(27) 11- 2- 1 II ( 1- 2- 21) ·K21 =11 -2 - 1 II (l -2 - 21), II ( l +2-2;)1<2-p-l _
`
`JEJ
`
`JEJ
`
`1-0
`
`This approach, however, fails in the TPR method. Here, each matrix element un(cid:173)
`dergoes only one plane rotation. The scaling factor to be corrected is thus K rather than
`K 2
`• In order to solve this more difficult problem, different techniques have been developed
`in the literature. Haviland and Tuszynski [ 17] used similar scaling iterations as Cavallaro
`and Luk. Ahmed [ 18] repeated some CORD IC iterations to force K to a power of the
`machine radix. Delosme (19] combined both methods of Haviland, Tuszynski, and
`Ahmed for minimizing the number of computations. Deprettere, Dewilde, and Udo [ 20]
`suggested the double-shift concept.
`We designed a computer program [21] for a systematic search of CORDIC
`I) with differences
`sequences. We allow shifts parameters S(i) (i = 0, l, · · · , n -
`S(i + l) - S( i) E { 0, I, 2} to provide more optimization freedom. For an efficient
`scaling correction, we require that the scaling factor K be corrected by a sequence of nk
`shift-add operations,
`"*
`2- T(O) II ( I +7J(j)2- T(j)) · K = I +t::i.K
`
`( T(j) integers, 7/U) = ± l ) .
`
`(28)
`
`j~ I
`
`These additional scaling iterations are parametrized by the set of signed integers
`{ T(0), 7/( l) T(l ), · · · , 7/(nk)T(nk)}. The total number of iterations is L = n + nk.
`In ( 28 ), t::i.K denotes the remaining relative scaling error after the scaling correction.
`We emphasize that this is a systematic error with a constant sign. By contrast, the other
`two types of CORDIC errors, the angular error due to the limited angle resolution and
`the rounding error, are of statistical nature because they may be positive or negative.
`The scaling error is thus more critical with respect to error accumulation when repeated
`CORDIC operations on the same data have to be computed as in SYD applications.
`Roughly speaking, the total scaling error after k CORDIC function calls increases linearly
`with k, a fact that has been verified by our numerical experiments. For this reason, we
`require I t::i.KI to be much smaller than 2 - p _
`We found catalogues of CORDIC sequences with complexity comparable to those
`of Cavallaro and Luk. In the following, five examples for different mantissa lengths p =
`16, 20, 24, 28, and 32, including the total number of iterations L = n + nk, the convergence
`region C, and the remaining scaling error t::i.K are given:
`
`p = 16:
`
`{ S(i)} = { 0 I 2 3 .. · 15 I 6},
`{7J(j)T(j)} = {l+2 - 5+9+10}, L = 17 + 4, C= 100°, !:::i.K=-2- 16
`
`01
`-
`
`,
`
`2
`
`p = 20: { S(i) } = { 0 l 2 3 · · · 19 20},
`{7J(j)T(j)} = {I +2-5+9+10+16},
`
`L=21 + 5, C= 100°,
`
`t::i.K = 2-23.os,
`
`'When replacing r p/41 by r(p -
`respectively.
`2 When appending an additional scaling iteration with 71( 5) T( 5) ; + 16, the scaling accuracy can be
`enhanced to t:J( = 2 - 23 •
`
`I ) /41 or r(p + I )/41, the upper bound in (27) becomes 2-• or 2-•- 2 ,
`
`V"l
`M
`r--i
`V"l
`N
`00
`:!
`00
`N
`
`
`
`720
`
`B. YANG AND J. F. BOHME
`
`p=24: {S( i)} ={ l 123345566788910 ··· 2324 },
`
`{17(j)T(j)} ={ 0 -2+6},
`
`L=29+2, C=91°,
`
`AK= - r29· 13 ,
`
`p = 28: { S(i)} = {I 12334 5 5 6 6 7 8 8 9 10 11 12 13 14 14 15 .. · 2728},
`
`{11(j)T(j)}={0-2+6 },
`
`L =34+2, C=9 1°,
`
`M =2-32.53,
`
`p=32: {S(i)}={001333456789910 .. . 3132 },
`
`{ 17(j) T(j)} = { 1 - 3 - 8 + 16 -25 - 27},
`
`L = 36+5, C= 145°,
`
`AK= _ 2 - 39.93.
`Remember that in order to meet the convergence condition ( 23) and to provide a
`convergence region C ~ 90°, the minimum number of CORD IC iterations is p + I. So,
`for all CORDIC sequences given above, the number L - (p + I) of additional iterations
`for scaling corrrection is p/4. Moreover, except for the first CORDIC sequence, the
`remaining scaling error I AKI is significantly smaller than 2 -p. This leads to improved
`numerical properties compared with other CORDIC sequences reported in the literature.
`We also remark that if the symmetric eigenvalue problem is considered for which a
`convergence region of C ~ 45 ° is sufficient [ 2], the total number of CORDIC iterations
`L can be further reduced. An example that is nearly identical to the last CORDIC sequence
`given above is
`
`p=32: {S( i) }={I 3334567899IO ... 3132 },
`
`L = 34 + 5,
`
`{11(j) T(j)} = {0-3-8+16-25-27 },
`AK= _ 2 - 39.93 .
`For comparison, Delosme [ 5] has also given an optimized CORD IC sequence for the
`same situation. His sequence requires one iteration more ( L = 40) and achieves a scaling
`accuracy of M = 2- 33
`16
`•
`.
`We suspect that similar results can also be obtained by using Deprettere's double(cid:173)
`shift concept. However, this method requires a slightly increased hardware complexity
`and will not be discussed in this paper.
`
`5. CORD IC implementation of the SVD PEs. For easy illustration, we first introduce
`a CORD IC processor symbol as shown in Fig. 2. The descriptions inside the box determine
`uniquely the function mode of the CORDIC processor according to Table I. The output
`data x and y are assumed to be scaling corrected.
`It is now simple to map the operations ( 8 )- ( 11 ) and ( 12 )- ( 15) of the TPR method
`onto a two CORDIC processor architecture. In Fig. 3, the diagonal PEs of the SVD array
`are implemented by two CORDIC processors and eight adders. The dotted inputs of the
`adders represent negated inputs. Because the diagonal PEs work in the generation mode,
`
`V'l
`M
`r--i
`V'l
`N
`00
`:!
`00
`N
`
`Xo
`
`Yo
`
`Zo
`
`Y_O
`z
`E=±l
`
`X
`
`y
`
`z
`
`FIG. 2. CORDIC symbol.
`
`
`
`REDUCED SYD COMPUTATIONS AND HOMOGENEOUS SYD ARRAY
`
`721
`
`FIG. 3. CORDIC implementation of the diagonal PE of the SYD array.
`
`both CORDIC processors are driven in the "y -+ O" mode for computing Cartesian-to(cid:173)
`polar coordinates conversions. In Fig. 4, the off-diagonal PEs working in the rotation
`mode are implemented by two CORDIC processors and ten adders. Here, the CORDIC
`processors are driven in the "z-+ O" mode for performing plane rotations.
`Obviously, both CORDIC implementations have nearly the same architecture. All
`PEs of the SVD array can thus be implemented by one unified CORDIC SYD module
`( Fig. 5) without considerably increased hardware cost. The different computation modes
`of the diagonal and off-diagonal PEs are easily "programmed" by one control bit. The
`resulting SYD array is similar to that in Fig. I, but homogeneous with identical PEs.
`We remark that Fig. 5 is more a "graphic program" describing the sequence of
`operations to be computed rather than a hardware block diagram. We show in the fol(cid:173)
`lowing that the 12 adders that are paired into three pre-butterflys and three post-butterflys
`can be integrated into the two CORDIC processors without separate hardware realizations.
`The Jacobi SYD method is a recursive method. Each PE of the SYD array has to exchange
`data with its diagonal neighbors. Because of this data dependency, only recursive CORDIC
`processors can be used here. This is an arithmetic unit consisting of mainly three adders
`and two barrel-shifters. It carries out the L iterations of the CORDIC algorithm in L
`cycles by using data feedback. The two CORDIC processors contained in one CORDIC
`SYD module require six adders altogether. So, it is natural to modify the CORDIC
`processor architecture slightly and to use the existing six adders for computing both the
`pre-butterfly and the post-butterfly operations. The resulting CORDIC SYD module has
`
`V'l
`M
`r--i
`V'l
`N
`00
`:!
`00
`N
`
`0
`
`,__e_+ _ _ .... +
`
`2e1
`
`FIG. 4. CORDIC implementation of the off-diagonal PE of the SYD array.
`
`
`
`722
`
`B. YANG AND J. F. BOHME
`
`pre-butterflys
`, - - - " - - - - - - .
`
`post- butterflys
`, - - - " - - - - - - .
`
`FIG. 5. A unified CORDIC SYD module for implementing all PEs of the SYD array.
`
`the hardware complexity of two recursive CORDIC processors and requires a total com(cid:173)
`putation time of L + 2 iterations.
`In Fig. 6, the principal architecture of such a two CORDIC processor SYD module
`is shown. The dashed lines and boxes represent the additional hardware components
`
`V"l
`M
`r--i
`V"l
`N
`00
`:!
`00
`N
`
`~ x o r
`x~ x / 2
`
`x~ x y
`or
`y
`y
`X
`
`FIG. 6. The principal architecture of the unified CORDIC SYD module.
`
`
`
`REDUCED SYD COMPUTATIONS AND HOMOOENEOUS SYD ARRAY
`
`723
`
`enabling the CORDIC processors to compute the butterfly operations. It is easily verified
`that the upper four adders devoted to x and y can perform the following types of operations
`2- 1 x ± 2- 1 y (pre-butterfly), x ± y (post-butterfly), x ± 2- sy (CORDIC iteration) and
`x ± 2- s x ( scaling iteration) while the lower two adders devoted to z can compute 2 - l z 1
`± 2- 1z2 (pre-butterfly), z1 ± z2 (post-butterfly), and z ± a (CORDIC iteration), re(cid:173)
`spectively. The cross switch between the registers "Y1" and "X2" is needed to exchange
`data when the CORDIC SYD module switches from the pre-butterfly operations into
`the CORDIC iterations or from the CORDIC iterations into the post-butterfly operations,
`respectively. Then, we see from Fig. 3 that the output data pairs of the pre-butterflys are
`(p1, Pi) and (q1 , Q2), while the desired input data pairs for the CORDIC iterations are
`(p1 , Q1) and (p2 , q2), respectively. So, p2 and Q1 have to be exchanged.
`
`6. Comparisons. We now compare the new CORDIC SYD module with those pro(cid:173)
`posed by Cavallaro and Luk [ 4] and Delosme [ 5]. Let Acsvd and Tcsvd denote the area
`and time complexity of a CORDIC SYD module and Acordic and Tcordic those of a CORDIC
`processor, respectively. Cavallaro and Luk have shown that their most efficient parallel
`diagonalization method requires Acsvd = 2Acordic and Tcsvd = 3Tcordic for the diagonal
`PEs and Acsvd = 2Acordic and T csvd = 2 Tcordic for the off-diagonal PEs. By using the TPR
`method, we require T csvd = 2Acordic and T csvd = Tcordic for all PEs. In other words, having
`approximately the same hardware complexity, the computation time is reduced by more
`than 50 percent.
`A comparison to Delosme's method is more difficult because he follows a quite
`different approach. Therefore, only rough performance estimates are given here. In our
`method, we compute the rotation angles explicitly. After these computations have been
`completed in the diagonal PEs, the angles propagate to the off-diagonal ones. We assume
`that the propagation from one PE to its neighbors takes one cycle Tcycte, the time required
`for computing one CORDIC iteration. This implies local communications without
`broadcasting data. At the beginning of the second propagation cycle, the angles reach
`the diagonal neighbors of the diagonal PEs which complete their computations after
`T csvd• This means that the diagonal PEs have to wait for a delay time Tdelay = Tcycte +
`T c,vd before they can exchange data with their diagonal neighbors.3 The total time elapsed
`between two adjacent activities at each PE is thus Tcsvd + Tdetay = 2 Tc,vd + Tcycle = 2T cordic
`because T cyclc is negligible with respect to T cordic = L · Tcycle •
`Delosme does not compute the rotation angles explicitly. He rather calculates en(cid:173)
`codings of the angles, i.e., sequences of signs± I, and sends them to the off-diagonal PEs.
`This enables overlap of diagonal and off-diagonal rotations because the encoding signs
`are recursively obtained and become available before the completion of diagonal oper(cid:173)
`ations. Accordingly, no delay time is required ( Tdetay = 0), provided that the SYD array
`size ( the half of the matrix size) is smaller than the number ofCORDIC scaling iterations
`nk ( for details, see [ 5]). The drawback is, however, that the TPR method cannot be
`applied to the off-diagonal PEs. Four plane rotations are hence required, resulting in
`Tcvsd = 2Tcordic for two CORDIC processors in one module. In other words, the time
`complexities T csvd + Tdelay of both methods are nearly identical and equal 2T cordic• If,
`however, multiple problems are interleaved, the fraction of idle time that is 50 percent
`in our case can be reduced to almost zero. In such a situation, our method provides the
`double speed compared with Delosme's one.
`
`3 If the propagation time is assumed to be T,.vd, we get the well-known result T00,., = 2Ta ,d given by Brent
`and Luk (2).
`
`V"l
`M
`r--i
`V"l
`N
`00
`:!
`00
`N
`
`
`
`724
`
`B. YANG AND J. F. BOHME
`
`In terms of area complexity, both CORDIC SYD modules contain two CORDIC
`processors. Our module consists of essentially six adders, four barrel shifters, and one
`ROM table containing n angle values. Delosme's architecture requires four carry-save
`adders, four adders, and eight barrel shifters. So, as a rough estimate, both SYD modules
`have the same order of area complexities.
`Perhaps the most important advantage of Delosme's approach is the 2-bit wide
`horizontal and vertical data connections for sending angle encodings serially rather than
`sending the full angle values in parallel. The prices are the upper bound of the SVD array
`size depending on the number ofCORDIC scaling iterations, the relatively complicated
`timing, and a nonregular CORDIC architecture design. We also mention that while
`Delosme's method presumes a CORDIC implementation, the TPR method is applicable
`to other computing architectures.
`
`7. Other applications of the TPR method. Another advantage of the TPR method
`seems to be the relatively wide range of applications. We indicate some of them in the
`following.
`For the SYD of a rectangular matrix, a well-known method is first to triangularize
`the matrix by QR decomposition and then to apply the Jacobi SYD procedure to the
`triangular factor. Luk [ 22] has shown that both steps can be implemented by one triangular
`systolic array. Each PE contains a 2 X 2 submatrix. It applies two plane rotations ( through
`the same angle) to the two column vectors at the QR step and a two-sided rotation at
`the SYD step. For computing the SYD step, the PE can be realized by the CORDIC
`SYD module, as before. On the other side, the two CORDIC processors contained in
`the module are also appropriate to perform the two-plane rotations of the QR step. The
`CORDIC SYD module presented in this paper thus provides a suitable PE for Luk's
`triangular SYD array.
`Stewart (23 ] has proposed a square systolic array for computing the Schur decom(cid:173)
`position (SD) of a non-Hermitian matrix which, for example, is useful for evaluating
`the eigenvalues of the matrix. His approach is similar to the Jacobi SYD method. It is
`based on applying a sequence of two-sided rotations to 2 X 2 submatrices, where the left
`and right rotation angles are identical to make the diagonal submatrices upper triangular.
`While the diagonal PEs perform operations different from those in SYD, the off-diagonal
`PEs have exactly the same computational task as in SYD computing. Therefore, the
`CORDIC SYD module can also be used in Stewart's SD array.
`Even in sequential computations on a uniprocessor system, one can still apply the
`TPR method to reduce the computational complexity of two-sided rotations.
`
`8. Conclusion. We have investigated a novel algorithm for computing two-sided
`rotations requiring only two plane rotations and a few additions. This results in signifi(c