throbber
SIAM J, MATRIX ANAL. APPL
`Vol. 12, No. 4, pp. 713- 725, October 1991
`
`© 1991 Society for Industrial and Applied Mathematics
`009
`
`REDUONG THE COMPUTATIONS OF THE SINGULAR VALUE
`DECOMPOSITION ARRAY GIVEN BY BRENT AND LUK*
`
`B. YANGt AND J. F. BOHMEt
`
`Abstract. A new, efficient, two-plane rotation (TPR ) method for computing two-sided rotations involved
`in singular value decomposition (SVD) is presented. It is shown that a two-sided rotation can be evaluated by
`only two plane rotations and a few additions. This leads to significantly reduced computations. Moreover, if
`coordinate rotation digital computer ( CORDIC} processors are used for realizing the processing elements ( PEs)
`of the SVD array given by Brent and Luk, the computational overhead of the diagonal PEs due to angle
`calculations can be avoided. The resulting SVD array has a homogeneous structure with identical diagonal and
`off-diagonal PEs. Similar results can also be obtained if the TPR method is applied to Luk's triangular SVD
`array and to Stewart's Schur decomposition array.
`
`Key words. singular value decomposition, systolic arrays, CORDIC, two-sided rotations, VLSI
`
`AMS(MOS) subje<:t classification. 15A I 8
`
`1. Introduction. One important problem in linear algebra and digital signal pro(cid:173)
`cessing is the singular value decomposition (SYD). Typical applications arise in beam(cid:173)
`forming and direction finding, spectrum analysis, digital image processing, etc. [ l}. Re(cid:173)
`cently, there has been a massive interest in parallel architectures for computing SYD
`because of the high computational complexity of SYD, the growing importance of real(cid:173)
`time signal processing, and the rapid advances in very large scale integration (VLSI) that
`make low-cost, high-density and fast processing memory devices available.
`There are different numerically stable methods for computing complete singular
`value and singular vector systems of dense matrices, for example, the Jacobi SYD method,
`the QR method, and the one-sided Hestenes method. For parallel implementations, the
`Jacobi SYD method is far superior in terms of simplicity, regularity, and local com(cid:173)
`munications. Brent, Luk, and Van Loan have shown how the Jacobi SYD method with
`parallel ordering can be implemented by a two-dimensional systolic array [ 2}, [ 3}. Various
`coordinate rotation digital computer ( CORDIC) realizations of the SYD array have been
`reported by Cavallaro and Luk [ 4] and Delosme [ 5}, [ 6].
`The Jacobi SYD method is based on, as common for aU two-sided approaches,
`applying a sequence of two-sided rotations to 2 X 2 submatrices of the original matrix.
`The computational complexity is thus determined by how to compute the two-sided
`rotations. In most previous works, a two-sided rotation is evaluated in a straightforward
`manner by four plane rotations, where two of them are applied from left to the two
`column vectors of the 2 X 2 submatrix and the other ones are applied from right to the
`row vectors, respectively. In the diagonal processing elements ( PEs), additional operations
`for calculating rotation angles are required. This leads to an inhomogeneous array ar(cid:173)
`chitecture containing two different types of PEs.
`In this paper, we develop a two-plane rotation (TPR) method for computing two(cid:173)
`sided rotations. We show that the above computational complexity can be reduced sig(cid:173)
`nificantly because each two-sided rotation can be evaluated by only two plane rotations
`and a few additions. Moreover, the SYD array given by Brent and Luk becomes ho(cid:173)
`mogeneous with identical diagonal and off-diagonal PEs when CORDIC processors are
`
`* Received by the editors September 28, 1989; accepted for publication ( in revised form) August 2, 1990.
`t Department of Electrical Engineering, Ruhr-Universitlit Bochum, 4630 Bochum, Germany.
`713
`
`00
`N
`
`

`

`714
`
`B. YANG AND J. F. BOHME
`
`used. In a recent work [ 6], Delosme has also indicated this possibility in connection
`with "rough rotations" independently. He has taken, however, a different approach that
`is based on encoding the rotation angles. He has still required four plane rotations on
`the off-diagonal PEs while diagonal and off-diagonal operations can be overlapped.
`Our paper is organized as follows. In§ 2, we briefly reexamine Jacobi's SYD method
`and Brent and Luk's SYD array. Then, we develop the TPR method in§ 3. The CORDIC
`algorithm is described in§ 4, where in particular CORDIC scaling correction techniques
`are discussed and examples of scaling-corrected CORDIC sequences are given. In § 5, a
`unified CORDIC SVD module for all PEs of the SYD array is presented. This module
`is compared to those proposed by Cavallaro, Luk, and Delosme in § 6. Finally, we stress
`the applicability of the TPR method to several other problems.
`
`2. Jacobi SVD method. In this paper, we consider real, square, and nonsymmetric
`matrices. Let ME ~NxN be a matrix of dimension N. The SYD is given by
`
`( I)
`
`where U E ~NxN and VE ~NxN are orthogonal matrices containing the left and right
`singular vectors, and ~ E ~Nx N is a diagonal matrix of singular values, respectively. The
`superscript T denotes matrix transpose. Based on an extension of the Jacobi eigenvalue
`algorithm [ 7], Kogbetliantz [ 8] and Forsythe and Henrici [ 9] proposed to diagonalize
`M by a sequence of two-sided rotations,
`
`(2)
`
`Mo = M,
`
`(k = O, 1, 2, .. ·).
`
`Uk and Vk describe two rotations in the (i, j)-plane ( I ~ i <j ~ N ), where the rotation
`angles are chosen to annihilate the elements of M k at the positions (i, j) and (j, i).
`Usually, several sweeps are necessary to complete the SYD, where a sweep is a sequence
`of N(N -
`I )/2 two-sided rotations according to a special ordering of the N(N - 1 )/2
`different index pairs ( i, j).
`For sequential computing on a uniprocessor system, possibly the most frequently
`used orderings are the cyclic orderings, namely, the cyclic row ordering
`
`(3)
`
`(i,j) = (l ,2),(1,3), ... ,(l,N),(2,3), ... ,(2,N), ... ,(N- 1,N)
`
`or the equivalent cyclic column ordering. Sameh [!OJ and Schwiegelshohn and Thiele
`[ 11] have shown how to implement the cyclic row ordering on a ring-connected or a
`mesh-connected processor array. Recently, a variety of parallel orderings have been de(cid:173)
`veloped. Luk and Park [ 12] have shown that these parallel orderings are essentially equiv(cid:173)
`alent to the cyclic orderings and thus share the same convergence properties.
`Brent and Luk have suggested a particular parallel ordering and developed a square
`systolic array consisting off N / 21 X r N / 21 PEs for implementing the Jacobi SYD method
`( Fig. I). To do this, the matrix Mis partitioned into 2 X 2 submatrices. Each PE contains
`one submatrix and performs a two-sided rotation
`
`(4)
`
`where
`
`(5 )
`
`V'>
`M
`r--i
`V'>
`N
`00
`:!
`00
`N
`
`

`

`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`715
`
`FIG. I. The SYD array given by Brent and Luk.
`
`denote the submatrix before and after the two-sided rotation, respectively, and
`cos O sin O)
`- sin O cos 0
`
`R(O) =
`
`(
`
`(6)
`
`describes a plane rotation through the angle 0. At first, the diagonal PEs (symbolized by
`a double square in Fig. I) generate the rotation angles to diagonalize the 2 X 2 submatrices
`(b12 = b21 = 0) stored in them. This means that 01 and 02 are first calculated from the
`elements of A and then relation ( 4) is used to compute b11 and b22. We call this the
`generation mode. Then, the rotation angles are sent to all off-diagonal PEs in the following
`way: the angles associated to the left-side rotations propagate along the rows while the
`angles associated to the right-side rotations propagate along the columns. Once these
`angles are received, the off-diagonal PEs perform the two-sided rotations ( 4) on their
`stored data. We call this the rotation mode. Clearly, if we compute the rotation mode
`straightforwardly, we require four plane rotations. For the generation mode, additional
`operations for calculating 01 and 02 are required.
`
`3. TPR method for computing two-sided rotations. In order to develop the TPR
`method for computing two-sided rotations more efficiently, we first discuss the com(cid:173)
`mutative properties of two special types, the rotation-type and the reflection-type, of
`2 X 2 matrices. We define
`
`(7)
`
`vl{'01={(_; ;)jx,yE~} and Arcr ={(; _;)jx,yE~}-
`
`V"l
`M
`r--i
`V"l
`N
`00
`:!
`00
`N
`
`In particular, if we consider two plane rotations, we know the following.
`LEMMA 3. If R(01) and R(02 ) are plane rotations described by (6), then
`R(01)R(02 ) = R(01 + 02) and R(Oi) TR(02 ) = R(02 - 01).
`Now, we give a theorem describing the rotation mode of the TPR method.
`THEOREM. If the 2 X 2 matrix A and the two rotation angles 01 and 02 are given,
`then the two-sided rotation ( 4) can be computed by two plane rotations, ten additions,
`
`The former is called rotation-type because it has the same matrix structure as a 2 X 2
`plane rotation matrix. Similarly, the latter is called reflection-type because it has the
`same matrix structure as a 2 X 2 Givens reflection matrix [ 13]. Note that x and y must
`not be normalized to x 2 + y 2 = I. Using the above definitions, the following results can
`be shown by some elementary manipulations.
`LEMMA I. If A1 EA'°' and A2 E A'°t, then A1A2 = A2A1 E A'°1
`LEMMA 2. If A1 E Arcr and A2 E A'0
`then A1A2 = AfA1 E .,I/ref_
`
`\
`
`•
`
`

`

`716
`
`B. YANG AND J. F. BOHME
`
`and/our scalings by ½:
`
`Pi = (a22 + a11 )/2,
`
`P2 = (a22 - a11 )/2,
`
`qi =(a21 - a12)/2,
`()_ = 82- 81'
`
`q2= (a21 +a12)/2,
`
`0+= 82+ 01,
`
`(8)
`
`(9)
`
`(10)
`
`( 11)
`
`b22=r1 + r2.
`b21=t1+t2,
`Proof Using ( 8), the matrix A can be reformulated as
`
`A=A1+A2=
`(
`
`P1
`- q')+(-p2 q2).
`Pi
`Qi P2
`q,
`Clearly, R(O, ), R(82) in ( 4) and A 1 are elements of Jtro• while A 2 belongs to Jtrcr_ This
`leads to the following reformulation of the matrix B by using Lemmas 1-3:
`B = R(Oi) T AR(02)
`= R(8, )T A , R(02) + R(8i)T A2R(02)
`= R(8i) TR(82)A, +R(8i) TR (0i)T Ai
`= R(82- 8, )A, + R(02 + 8, )T A 2
`
`V'l
`M
`r---i
`V'l
`N
`00
`:!
`00
`N
`
`This completes the proof.
`The generation mode of the TPR method follows directly from the above theorem.
`COROLLARY . I/the 2 X 2 matrix A is given, we can diagonalize A and calculate
`the corresponding rotation angles 81 and 82 by two Cartesian-to-polar coordinates con(cid:173)
`versions, eight additions, and four scalings by ½:
`
`( 12)
`
`(13)
`
`(14)
`
`p, = (a22+a11)/2,
`
`Pi= (a22 - a1, )/2,
`
`q, = (a21 -a,2)/2,
`r, = sign (P1) V PT + QT,
`o_ = arctan (qi/ p, ),
`81 = (8+- 8-)/2,
`
`Qi = (a2, + a 12)/2,
`r2 = sign (P2) V p~ + qL
`B+ = arctan (qi/Pi ),
`
`82=(8++ 0_)/2,
`
`(15)
`b22=r1+r2.
`b11=r, - r2,
`Proof. Regarding ( 11), b12 = b21 = 0 is equivalent to t, = ti = 0. Equation ( 13)
`follows then from ( 10). This completes the proof.
`In equation ( 13), we choose the rotation through the smaller angle. All vectors
`lying in the first or the fourth quadrant are rotated onto the positive x-axis, and all vectors
`lying in the second and the third quadrant are rotated onto the negative x-axis. For
`vectors on the y-axis, the rotation direction is arbitrary. Thus, the generated rotation
`
`

`

`REDUCED SYD COMPUTATIONS AND HOMOGENEOUS SYD ARRAY
`angles e_ and 8+ satisfy 18- 1, l8+1 ~ 90°. This results in
`l81 I ~ 90° and
`182I ~ 90°,
`( 16)
`due to ( 14 ).
`Equation ( 16) is important with respect to the convergence of the Jacobi SVD
`method. Forsythe and Henrici [9] have proven the convergence for cyclic orderings if
`the rotation angles 81 and 82 are restricted to a closed interval inside the open interval
`( - 90°, 90°). They have also demonstrated that this condition may fail to hold, i.e., 81
`and 82 may be ± 90°, if the off-diagonal elements b12 and b21 in (5) have to be exactly
`annihilated. As a remedy, they suggested an under- or overrotation by computing the
`'Y )82 (-1 < 'Y < I) and proved
`two-sided rotation ( 4) with angles ( 1 -
`'Y )81 and ( 1 -
`its convergence. In practice, however, the finite machine accuracy in the real arithmetic
`allows only an approximative computation of the rotation angles and implies under- or
`overrotations. So the Jacobi SVD method converges without using under- or overrotations
`as shown by the experimental results of Brent, Luk, and Van Loan [3]. In case ofCORDIC
`implementations, the effect of implicit under- or overrotations is more apparent. The
`angles ± 90° can never be exactly calculated because of the limited angle resolution arc(cid:173)
`tan ( 2 - p) of the CORDIC algorithm, where p denotes the mantissa length.
`4. The CORDIC algorithm. In the previous section, we have seen that the main
`operations of the TPR-method are plane rotations and Cartesian-to-polar coordinates
`conversions. These operations can be carried out by multiplier-adder-based processors
`supported by software or special hardware units. An alternative approach is the use of
`dedicated processors that usually map algorithms more effectively to hardware. The
`CORDIC processor is such a powerful one for calculating trigonometric functions.
`The CORD IC algorithm was originally designed by Voider [ 14] as an iterative pro(cid:173)
`cedure for computing plane rotations and Cartesian-to-polar coordinates conversions. It
`was later generalized and unified by Walther [ 15], enabling a CORD IC processor to
`calculate more functions, including hyperbolic functions, as well as multiplications and
`divisions. In the following, we consider Voider's CORDIC algorithm because only trig(cid:173)
`onometric functions are involved in SYD applications.
`The CORDIC algorithm consists of iterative shift-add operations on a three-com(cid:173)
`ponent vector,
`X; + 1)
`(X; - er;o;y;)
`cos (a;)
`(
`1
`Y;+1 = y;+er;O;X; = cos(a;) er;sin( a ;)
`Z;+1 = Z; - eer;a;
`(0<o;< l ; er;= ±l;e = ±l;i = 0, I, ··· ,n- I),
`( 18)
`in which the iteration stepsize o; is defined by
`(19)
`
`717
`
`- er; sin (a;) )(x;),
`cos(a;) Y;
`
`(I 7)
`
`(
`
`V"l
`M
`r--i
`V"l
`N
`00
`:!
`00
`N
`
`The set of integers { S( i) } parametrizing the iterations is called CORDIC sequence.
`Equation ( 17) can be interpreted, except for a scaling factor of
`
`1-
`
`= Vi +o2
`k ·= -
`'cos(a;)
`' '
`
`(20)
`
`as a rotation of(x;, y ;)T through the angle a;, where the sign er; = ± I gives the rotation
`direction. After n iterations, the results are given by
`
`(21)
`
`(22)
`
`(
`
`x,, )
`(cos a
`y,, =K sin a
`
`- sin a )( xo ) ,
`cos a Yo
`z,, = z0 - ea,
`
`

`

`718
`
`B. y ANG AND J. F. BOHME
`
`with the overall scaling factor K = IL k; and the total rotation angle a = L; u;a;. Now,
`if the CORDIC sequence satisfies the following convergence condition
`n-1
`
`(23)
`
`a; - L ai;;i,. a,,_,
`
`(i=0, I,··· ,n-2),
`
`j • i + I
`we can choose the sign parameter
`
`( 24 )
`
`u;= { - sign (x;y;)
`sign ( cz;)
`
`fory 11-+0,
`forz11 -+0
`
`to force y 11 or z,, to zero, provided that the input data x0 , y0 , and z0 lie in the conver(cid:173)
`gence region
`
`(25)
`
`,, _ ,
`{larctan(yo/Xo)I
`C= L a;ei;;
`;- o
`
`lzol
`
`foryn-o,
`
`forz11 -+0.
`
`In this way, two different types of CORDIC trigonometric functions can be computed
`(Table 1 ). In the mode y,, -+ 0, the Cartesian coordinate (xo, y0 ) of a plane vector is
`converted to its polar representation, where the parameter c = ± I determines the sign
`of the phase angle calculated. When z11 - . 0, a given plane vector is rotated through the
`angle z0 , where c = ± I controls the rotation direction.
`In Table I, the principal value I arctan (y0 / .xo) I ;;i,. 90° of the inverse tangent function
`is calculated when computing Cartesian-to-polar coordinates conversions. Correspond(cid:173)
`ingly, x,, may be positive or negative according to the sign of x0 . So, it is guaranteed that
`a vector is always rotated through the smaller angle onto the x-axis in accordance with
`( 13). In this case, a convergence region of C ~ 90° is sufficient for the generation mode
`of the two-sided rotation.
`One main drawback of the CORDIC algorithm is the need of correcting the scal(cid:173)
`ing factor K that arises during the iterations ( 17). For example, if we use Voider's
`CORDIC sequence
`
`{S(i)} = {0, 1,2, 3, · · · ,p-1 ,p},
`(26)
`with n = p + I CORDIC iterations for a mantissa accuracy of 2 - P, the scaling factor is
`K = 1.64676. Compensating this undesired scaling effect with a minimum number of
`computations is of particular importance.
`Clearly, multiplying x,, and y,, in Table I by K- 1 will degrade the algorithm per(cid:173)
`formance substantially. Most of the scaling correction issues are based on shift-add oper(cid:173)
`ations. For a two-sided rotation that is implemented by four plane rotations, each matrix
`element undergoes two plane rotations so that the total scaling factor to be corrected
`is K 2• In this case, Cavallaro and Luk [ I 6] have pointed out that there is a simple
`systematic approach for scaling correction when using the CORD IC sequence ( 26).
`They proposed to use r p / 41 scaling iterations of the type x +- x - 2- 2i x with j E J =
`I} and one shift operation 2- 1
`{I, 3, 5, • • •, 2rp/4l -
`• The remaining scaling error is
`
`TABLE I
`CORDIC trigonometric fimclions (e = ± 1).
`
`y .... 0
`
`Z11 _.. 0
`
`x. = K sign (x0 ) Vxij + yij
`z. = z0 + c arctan (yo/x,,)
`
`x . = K(xo cos Zo - <Yo sin Zo)
`Yn = K(cxo sin Zo + Yo cos zo )
`
`V"l
`M
`r--i
`V"l
`N
`00
`:!
`00
`N
`
`

`

`REDUCED SYD COMPUTATIONS AND HOMOGENEOUS SYD ARRAY
`
`719
`
`bounded by 2- p- I, 1
`
`(27) 11- 2- 1 II ( 1- 2- 21) ·K21 =11 -2 - 1 II (l -2 - 21), II ( l +2-2;)1<2-p-l _
`
`JEJ
`
`JEJ
`
`1-0
`
`This approach, however, fails in the TPR method. Here, each matrix element un(cid:173)
`dergoes only one plane rotation. The scaling factor to be corrected is thus K rather than
`K 2
`• In order to solve this more difficult problem, different techniques have been developed
`in the literature. Haviland and Tuszynski [ 17] used similar scaling iterations as Cavallaro
`and Luk. Ahmed [ 18] repeated some CORD IC iterations to force K to a power of the
`machine radix. Delosme (19] combined both methods of Haviland, Tuszynski, and
`Ahmed for minimizing the number of computations. Deprettere, Dewilde, and Udo [ 20]
`suggested the double-shift concept.
`We designed a computer program [21] for a systematic search of CORDIC
`I) with differences
`sequences. We allow shifts parameters S(i) (i = 0, l, · · · , n -
`S(i + l) - S( i) E { 0, I, 2} to provide more optimization freedom. For an efficient
`scaling correction, we require that the scaling factor K be corrected by a sequence of nk
`shift-add operations,
`"*
`2- T(O) II ( I +7J(j)2- T(j)) · K = I +t::i.K
`
`( T(j) integers, 7/U) = ± l ) .
`
`(28)
`
`j~ I
`
`These additional scaling iterations are parametrized by the set of signed integers
`{ T(0), 7/( l) T(l ), · · · , 7/(nk)T(nk)}. The total number of iterations is L = n + nk.
`In ( 28 ), t::i.K denotes the remaining relative scaling error after the scaling correction.
`We emphasize that this is a systematic error with a constant sign. By contrast, the other
`two types of CORDIC errors, the angular error due to the limited angle resolution and
`the rounding error, are of statistical nature because they may be positive or negative.
`The scaling error is thus more critical with respect to error accumulation when repeated
`CORDIC operations on the same data have to be computed as in SYD applications.
`Roughly speaking, the total scaling error after k CORDIC function calls increases linearly
`with k, a fact that has been verified by our numerical experiments. For this reason, we
`require I t::i.KI to be much smaller than 2 - p _
`We found catalogues of CORDIC sequences with complexity comparable to those
`of Cavallaro and Luk. In the following, five examples for different mantissa lengths p =
`16, 20, 24, 28, and 32, including the total number of iterations L = n + nk, the convergence
`region C, and the remaining scaling error t::i.K are given:
`
`p = 16:
`
`{ S(i)} = { 0 I 2 3 .. · 15 I 6},
`{7J(j)T(j)} = {l+2 - 5+9+10}, L = 17 + 4, C= 100°, !:::i.K=-2- 16
`
`01
`-
`
`,
`
`2
`
`p = 20: { S(i) } = { 0 l 2 3 · · · 19 20},
`{7J(j)T(j)} = {I +2-5+9+10+16},
`
`L=21 + 5, C= 100°,
`
`t::i.K = 2-23.os,
`
`'When replacing r p/41 by r(p -
`respectively.
`2 When appending an additional scaling iteration with 71( 5) T( 5) ; + 16, the scaling accuracy can be
`enhanced to t:J( = 2 - 23 •
`
`I ) /41 or r(p + I )/41, the upper bound in (27) becomes 2-• or 2-•- 2 ,
`
`V"l
`M
`r--i
`V"l
`N
`00
`:!
`00
`N
`
`

`

`720
`
`B. YANG AND J. F. BOHME
`
`p=24: {S( i)} ={ l 123345566788910 ··· 2324 },
`
`{17(j)T(j)} ={ 0 -2+6},
`
`L=29+2, C=91°,
`
`AK= - r29· 13 ,
`
`p = 28: { S(i)} = {I 12334 5 5 6 6 7 8 8 9 10 11 12 13 14 14 15 .. · 2728},
`
`{11(j)T(j)}={0-2+6 },
`
`L =34+2, C=9 1°,
`
`M =2-32.53,
`
`p=32: {S(i)}={001333456789910 .. . 3132 },
`
`{ 17(j) T(j)} = { 1 - 3 - 8 + 16 -25 - 27},
`
`L = 36+5, C= 145°,
`
`AK= _ 2 - 39.93.
`Remember that in order to meet the convergence condition ( 23) and to provide a
`convergence region C ~ 90°, the minimum number of CORD IC iterations is p + I. So,
`for all CORDIC sequences given above, the number L - (p + I) of additional iterations
`for scaling corrrection is p/4. Moreover, except for the first CORDIC sequence, the
`remaining scaling error I AKI is significantly smaller than 2 -p. This leads to improved
`numerical properties compared with other CORDIC sequences reported in the literature.
`We also remark that if the symmetric eigenvalue problem is considered for which a
`convergence region of C ~ 45 ° is sufficient [ 2], the total number of CORDIC iterations
`L can be further reduced. An example that is nearly identical to the last CORDIC sequence
`given above is
`
`p=32: {S( i) }={I 3334567899IO ... 3132 },
`
`L = 34 + 5,
`
`{11(j) T(j)} = {0-3-8+16-25-27 },
`AK= _ 2 - 39.93 .
`For comparison, Delosme [ 5] has also given an optimized CORD IC sequence for the
`same situation. His sequence requires one iteration more ( L = 40) and achieves a scaling
`accuracy of M = 2- 33
`16
`•
`.
`We suspect that similar results can also be obtained by using Deprettere's double(cid:173)
`shift concept. However, this method requires a slightly increased hardware complexity
`and will not be discussed in this paper.
`
`5. CORD IC implementation of the SVD PEs. For easy illustration, we first introduce
`a CORD IC processor symbol as shown in Fig. 2. The descriptions inside the box determine
`uniquely the function mode of the CORDIC processor according to Table I. The output
`data x and y are assumed to be scaling corrected.
`It is now simple to map the operations ( 8 )- ( 11 ) and ( 12 )- ( 15) of the TPR method
`onto a two CORDIC processor architecture. In Fig. 3, the diagonal PEs of the SVD array
`are implemented by two CORDIC processors and eight adders. The dotted inputs of the
`adders represent negated inputs. Because the diagonal PEs work in the generation mode,
`
`V'l
`M
`r--i
`V'l
`N
`00
`:!
`00
`N
`
`Xo
`
`Yo
`
`Zo
`
`Y_O
`z
`E=±l
`
`X
`
`y
`
`z
`
`FIG. 2. CORDIC symbol.
`
`

`

`REDUCED SYD COMPUTATIONS AND HOMOGENEOUS SYD ARRAY
`
`721
`
`FIG. 3. CORDIC implementation of the diagonal PE of the SYD array.
`
`both CORDIC processors are driven in the "y -+ O" mode for computing Cartesian-to(cid:173)
`polar coordinates conversions. In Fig. 4, the off-diagonal PEs working in the rotation
`mode are implemented by two CORDIC processors and ten adders. Here, the CORDIC
`processors are driven in the "z-+ O" mode for performing plane rotations.
`Obviously, both CORDIC implementations have nearly the same architecture. All
`PEs of the SVD array can thus be implemented by one unified CORDIC SYD module
`( Fig. 5) without considerably increased hardware cost. The different computation modes
`of the diagonal and off-diagonal PEs are easily "programmed" by one control bit. The
`resulting SYD array is similar to that in Fig. I, but homogeneous with identical PEs.
`We remark that Fig. 5 is more a "graphic program" describing the sequence of
`operations to be computed rather than a hardware block diagram. We show in the fol(cid:173)
`lowing that the 12 adders that are paired into three pre-butterflys and three post-butterflys
`can be integrated into the two CORDIC processors without separate hardware realizations.
`The Jacobi SYD method is a recursive method. Each PE of the SYD array has to exchange
`data with its diagonal neighbors. Because of this data dependency, only recursive CORDIC
`processors can be used here. This is an arithmetic unit consisting of mainly three adders
`and two barrel-shifters. It carries out the L iterations of the CORDIC algorithm in L
`cycles by using data feedback. The two CORDIC processors contained in one CORDIC
`SYD module require six adders altogether. So, it is natural to modify the CORDIC
`processor architecture slightly and to use the existing six adders for computing both the
`pre-butterfly and the post-butterfly operations. The resulting CORDIC SYD module has
`
`V'l
`M
`r--i
`V'l
`N
`00
`:!
`00
`N
`
`0
`
`,__e_+ _ _ .... +
`
`2e1
`
`FIG. 4. CORDIC implementation of the off-diagonal PE of the SYD array.
`
`

`

`722
`
`B. YANG AND J. F. BOHME
`
`pre-butterflys
`, - - - " - - - - - - .
`
`post- butterflys
`, - - - " - - - - - - .
`
`FIG. 5. A unified CORDIC SYD module for implementing all PEs of the SYD array.
`
`the hardware complexity of two recursive CORDIC processors and requires a total com(cid:173)
`putation time of L + 2 iterations.
`In Fig. 6, the principal architecture of such a two CORDIC processor SYD module
`is shown. The dashed lines and boxes represent the additional hardware components
`
`V"l
`M
`r--i
`V"l
`N
`00
`:!
`00
`N
`
`~ x o r
`x~ x / 2
`
`x~ x y
`or
`y
`y
`X
`
`FIG. 6. The principal architecture of the unified CORDIC SYD module.
`
`

`

`REDUCED SYD COMPUTATIONS AND HOMOOENEOUS SYD ARRAY
`
`723
`
`enabling the CORDIC processors to compute the butterfly operations. It is easily verified
`that the upper four adders devoted to x and y can perform the following types of operations
`2- 1 x ± 2- 1 y (pre-butterfly), x ± y (post-butterfly), x ± 2- sy (CORDIC iteration) and
`x ± 2- s x ( scaling iteration) while the lower two adders devoted to z can compute 2 - l z 1
`± 2- 1z2 (pre-butterfly), z1 ± z2 (post-butterfly), and z ± a (CORDIC iteration), re(cid:173)
`spectively. The cross switch between the registers "Y1" and "X2" is needed to exchange
`data when the CORDIC SYD module switches from the pre-butterfly operations into
`the CORDIC iterations or from the CORDIC iterations into the post-butterfly operations,
`respectively. Then, we see from Fig. 3 that the output data pairs of the pre-butterflys are
`(p1, Pi) and (q1 , Q2), while the desired input data pairs for the CORDIC iterations are
`(p1 , Q1) and (p2 , q2), respectively. So, p2 and Q1 have to be exchanged.
`
`6. Comparisons. We now compare the new CORDIC SYD module with those pro(cid:173)
`posed by Cavallaro and Luk [ 4] and Delosme [ 5]. Let Acsvd and Tcsvd denote the area
`and time complexity of a CORDIC SYD module and Acordic and Tcordic those of a CORDIC
`processor, respectively. Cavallaro and Luk have shown that their most efficient parallel
`diagonalization method requires Acsvd = 2Acordic and Tcsvd = 3Tcordic for the diagonal
`PEs and Acsvd = 2Acordic and T csvd = 2 Tcordic for the off-diagonal PEs. By using the TPR
`method, we require T csvd = 2Acordic and T csvd = Tcordic for all PEs. In other words, having
`approximately the same hardware complexity, the computation time is reduced by more
`than 50 percent.
`A comparison to Delosme's method is more difficult because he follows a quite
`different approach. Therefore, only rough performance estimates are given here. In our
`method, we compute the rotation angles explicitly. After these computations have been
`completed in the diagonal PEs, the angles propagate to the off-diagonal ones. We assume
`that the propagation from one PE to its neighbors takes one cycle Tcycte, the time required
`for computing one CORDIC iteration. This implies local communications without
`broadcasting data. At the beginning of the second propagation cycle, the angles reach
`the diagonal neighbors of the diagonal PEs which complete their computations after
`T csvd• This means that the diagonal PEs have to wait for a delay time Tdelay = Tcycte +
`T c,vd before they can exchange data with their diagonal neighbors.3 The total time elapsed
`between two adjacent activities at each PE is thus Tcsvd + Tdetay = 2 Tc,vd + Tcycle = 2T cordic
`because T cyclc is negligible with respect to T cordic = L · Tcycle •
`Delosme does not compute the rotation angles explicitly. He rather calculates en(cid:173)
`codings of the angles, i.e., sequences of signs± I, and sends them to the off-diagonal PEs.
`This enables overlap of diagonal and off-diagonal rotations because the encoding signs
`are recursively obtained and become available before the completion of diagonal oper(cid:173)
`ations. Accordingly, no delay time is required ( Tdetay = 0), provided that the SYD array
`size ( the half of the matrix size) is smaller than the number ofCORDIC scaling iterations
`nk ( for details, see [ 5]). The drawback is, however, that the TPR method cannot be
`applied to the off-diagonal PEs. Four plane rotations are hence required, resulting in
`Tcvsd = 2Tcordic for two CORDIC processors in one module. In other words, the time
`complexities T csvd + Tdelay of both methods are nearly identical and equal 2T cordic• If,
`however, multiple problems are interleaved, the fraction of idle time that is 50 percent
`in our case can be reduced to almost zero. In such a situation, our method provides the
`double speed compared with Delosme's one.
`
`3 If the propagation time is assumed to be T,.vd, we get the well-known result T00,., = 2Ta ,d given by Brent
`and Luk (2).
`
`V"l
`M
`r--i
`V"l
`N
`00
`:!
`00
`N
`
`

`

`724
`
`B. YANG AND J. F. BOHME
`
`In terms of area complexity, both CORDIC SYD modules contain two CORDIC
`processors. Our module consists of essentially six adders, four barrel shifters, and one
`ROM table containing n angle values. Delosme's architecture requires four carry-save
`adders, four adders, and eight barrel shifters. So, as a rough estimate, both SYD modules
`have the same order of area complexities.
`Perhaps the most important advantage of Delosme's approach is the 2-bit wide
`horizontal and vertical data connections for sending angle encodings serially rather than
`sending the full angle values in parallel. The prices are the upper bound of the SVD array
`size depending on the number ofCORDIC scaling iterations, the relatively complicated
`timing, and a nonregular CORDIC architecture design. We also mention that while
`Delosme's method presumes a CORDIC implementation, the TPR method is applicable
`to other computing architectures.
`
`7. Other applications of the TPR method. Another advantage of the TPR method
`seems to be the relatively wide range of applications. We indicate some of them in the
`following.
`For the SYD of a rectangular matrix, a well-known method is first to triangularize
`the matrix by QR decomposition and then to apply the Jacobi SYD procedure to the
`triangular factor. Luk [ 22] has shown that both steps can be implemented by one triangular
`systolic array. Each PE contains a 2 X 2 submatrix. It applies two plane rotations ( through
`the same angle) to the two column vectors at the QR step and a two-sided rotation at
`the SYD step. For computing the SYD step, the PE can be realized by the CORDIC
`SYD module, as before. On the other side, the two CORDIC processors contained in
`the module are also appropriate to perform the two-plane rotations of the QR step. The
`CORDIC SYD module presented in this paper thus provides a suitable PE for Luk's
`triangular SYD array.
`Stewart (23 ] has proposed a square systolic array for computing the Schur decom(cid:173)
`position (SD) of a non-Hermitian matrix which, for example, is useful for evaluating
`the eigenvalues of the matrix. His approach is similar to the Jacobi SYD method. It is
`based on applying a sequence of two-sided rotations to 2 X 2 submatrices, where the left
`and right rotation angles are identical to make the diagonal submatrices upper triangular.
`While the diagonal PEs perform operations different from those in SYD, the off-diagonal
`PEs have exactly the same computational task as in SYD computing. Therefore, the
`CORDIC SYD module can also be used in Stewart's SD array.
`Even in sequential computations on a uniprocessor system, one can still apply the
`TPR method to reduce the computational complexity of two-sided rotations.
`
`8. Conclusion. We have investigated a novel algorithm for computing two-sided
`rotations requiring only two plane rotations and a few additions. This results in signifi(c

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket