throbber
SIAM J. MATRIX ANAL. AWL.
`vol. I2, No. 4. pp, Tit—1'25, October I99i
`
`l3 l9§|l Society for Industrial and Applied Mathematics
`009
`
`REDUCING THE COMPUTATIONS OF THE SINGULAR VALUE
`DECOMPOSITION ARRAY GIVEN BY BRENT AND LUK“
`
`B. YANG? AND J. F. BOHME‘i‘
`
`Abstract. A new, efficient, two-plane rotation (TPR) method for computing two-sided rotations involved
`in singular value decomposition (SVD) is presented. It is shown that a two-sided rotation can be evaluated by
`only two plane rotations and a few additions. This leads to significantly reduced computations. Moreover, if
`coordinate rotation digital computer (OORDIC) processors are used for realizing the processing elements (PEs)
`of the SVD array given by Brent and Luk, the computational overhead of the diagonal PEs due to angle
`calculations can be avoided. The resulting SVD array has a homogeneous structure with identical diagonal and
`off-diagonal PEs. Similar results can also be obtained if the TPR method is applied to Luk’s triangular SVD
`array and to Stewart’s Schur decomposition array.
`
`Key words. singular value decomposition, systolic arrays, CORDIC, two-sided rotations, VLSI
`
`AMS(MOS) subject classification.
`
`ISAIS
`
`1. Introduction. One important problem in linear algebra and digital signal pro-
`cessing is the singular value decomposition (SVD). Typical applications arise in beam-
`forming and direction finding, spectrum analysis, digital image processing, etc. [1]. Re-
`cently, there has been a massive interest in parallel architectures for computing SVD
`because of the high computational complexity of SVD, the growing importance of real-
`time signal processing, and the rapid advances in very large scale integration (VLSI) that
`make low-cost, high-density and fast processing memory devices available.
`There are different numerically stable methods for computing complete singular
`value and singular vector systems of dense matrices, for example, the Jacobi SVD method,
`the QR method, and the one-sided Hestenes method. For parallel implementations, the
`Jacobi SVD method is far superior in terms of simplicity, regularity, and local com-
`munications. Brent, Luk, and Van Loan have shown how the Jacobi SVD method with
`parallel ordering can be implemented by a two-dimensional systolic array [2], [3]. Various
`coordinate rotation digital computer (CORDIC) realizations of the SVD array have been
`reported by Cavallaro and Luk [4] and Delosme [5], [6].
`The Jacobi SVD method is based on, as common for all two-sided approaches,
`applying a sequence of two-sided rotations to 2 X 2 submatrices of the original matrix.
`The computational complexity is thus determined by how to compute the two—sided
`rotations. In most previous works, a two-sided rotation is evaluated in a straightforward
`manner by four plane rotations, where two of them are applied from left to the two
`column vectors of the 2 X 2 submatrix and the other ones are applied from right to the
`row vectors, respectively. In the diagonal processing elements (PEs), additional operations
`for calculating rotation angles are required. This leads to an inhomogeneous array ar-
`chitecture containing two different types of PBS.
`In this paper, we develop a two-plane rotation {TPR} method for computing two—
`sided rotations. We show that the above computational complexity can be reduced sig-
`nificantly because each two-sided rotation can be evaluated by only two plane rotations
`and a few additions. Moreover, the SVD array given by Brent and Luk becomes ho-
`mogeneous with identical diagonal and off-diagonal PEs when CORDIC processors are
`
`“‘ Received by the editors September 28, 1989; accepted for publication {in revised form) August 2, 1990.
`1' Department of Electrical Engineering, Ruhr-Universitiit Bochum, 4630 Bochum, Germany.
`'.-'l3
`
`
`
`
`
`
`
`
`
`Downloaded01f04f13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:/fwmv.siatn.org/joumalsfojsaphp
`
`
`
`
`
`
`
`
`
`1
`
`LG 1008
`
`Page 1 of 13
`
`SAMSUNG EXHIBIT 1030
`
`1
`
`LG 1008
`
`Page 1 of 13
`
`SAMSUNG EXHIBIT 1030
`
`

`

`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehup:ffww.siarn.org’joumals/ojsa.php
`
`
`
`
`
`
`
`
`
`714
`
`B. YANG AND J. F. Bonus
`
`used. In a recent work [6], Delosrne has also indicated this possibility in connection
`with “rough rotations" independently. He has taken, however, a different approach that
`is based on encoding the rotation angles. He has still required four plane rotations on
`the oflldiagonal PEs while diagonal and off-diagonal operations can be overlapped.
`Our paper is organized as follows. In § 2, we briefly reexamine Jacobi’s SVD method
`and Brent and Luk’s SVD array. Then, we develop the TPR method in § 3. The CORDIC
`algorithm is described in § 4, where in particular CORDIC scaling correction techniques
`are discussed and examples of scaling—corrected CORDIC sequences are given. In § 5, a
`unified CORDIC SVD module for all PEs of the SVD array is presented. This module
`is compared to those proposed by Cavallaro, Luk, and Delosme in § 6. Finally, we stress
`the applicability of the TPR method to several other problems.
`
`2. Jacobi SVD method. In this paper, we consider real, square, and nonsymmetric
`matrices. Let ME RNX N be a matrix of dimension N. The SVD is given by
`
`(l)
`
`M=UEVT,
`
`where U e R” x” and V e IR“r>< N are orthogonal matrices containing the left and right
`singular vectors, and 2 6 IR” X” is a diagonal matrix of singular values, respectively. The
`superscript Tdenotes matrix transpose. Based on an extension of the Jacobi eigenvalue
`algorithm [7], Kogbetliantz [8] and Forsythe and Henrici [9] proposed to diagonalize
`M by a sequence of two-sided rotations,
`
`(2)
`
`M0=M, Mt,.=U£'M,.I/,
`
`(k=0,1,2,---).
`
`Hg, and Va describe two rotations in the (i, j)-plane ( 1 é t' < j g N), where the rotation
`angles are chosen to annihilate the elements of Mk at the positions (i, j) and (j, 1').
`Usually, several sweeps are necessary to complete the SVD, where a sweep is a sequence
`of N( N -— 1 ) / 2 two-sided rotations according to a special ordering of the N(N — 1N2
`different index pairs (1’, j).
`For sequential computing on a uniprocessor system, possibly the most frequently
`used orderings are the cyclic orderings, namely, the cyclic row ordering
`
`(3)
`
`(i,j)=(1,2),(1,3),
`
`,(13N),(2,3). JAN). “uINn LN)
`
`or the equivalent cyclic column ordering. Sameh [10] and Schwiegelshohn and Thiele
`[l l] have shown how to implement the cyclic row ordering on a ring-connected or a
`mesh-connected processor array. Recently, a variety of parallel orderings have been de-
`veloped. Luk and Park [12 ] have shown that these parallel orderings are essentially equiv-
`alent to the cyclic orderings and thus share the same convergence properties.
`Brent and Luk have suggested a particular parallel ordering and developed a square
`systolic array consisting of [NJ 21 X IN; 21 PBS for implementing the Jacobi SVD method
`(Fig. 1 ). To do this, the matrix M is partitioned into 2 X 2 submatrices. Each PE contains
`one submatrix and performs a two-sided rotation
`
`(4)
`
`where
`
`(5)
`
`B=R(3:)TAR(92).
`
`A=(a“ C312)
`
`322
`
`£121
`
`and B=(bll
`
`I521
`
`512)
`
`4522
`
`2
`
`Page 2 of 13
`
`2
`
`Page 2 of 13
`
`

`

`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`715
`
`
`
`
`
`
`
`FIG. 1. The SVD array given by Brent and Litk.
`
`denote the submatrix before and after the two-sided rotation, respectively, and
`
`(6}
`
`R(9)=(
`
`cos 6
`—sin 9
`
`sin 6
`cos 6
`
`describes a plane rotation through the angle 8. At first, the diagonal PEs (symbolized by
`a double square in Fig.
`l ) generate the rotation angles to diagonalize the 2 X 2 submatrices
`(bu = [321 = 0) stored in them. This means that B. and 92 are first calculated from the
`elements ofA and then relation (4) is used to compute b“ and in. We call this the
`generation mode. Then, the rotation angles are sent to all off-diagonal PBS in the following
`way: the angles associated to the left-side rotations propagate along the rows while the
`angles associated to the right-side rotations propagate along the columns. Once these
`angles are received, the off-diagonal PEs perform the two-sided rotations (4) on their
`stored data. We call this the rotation mode. Clearly, if we compute the rotation mode
`straightforwardly. we require four plane rotations. For the generation mode, additional
`operations for calculating 61 and 62 are required.
`
`3. TPR method for computing two-sided rotations. In order to develop the TPR
`method for computing two-sided rotations more efficiently, we first discuss the com-
`mutative properties of two special types, the rotation-type and the reflection-type, of
`2 X 2 matrices. We define
`
`(7)
`
`M’°‘=[( x y)x,ye|R]
`
`
`—y x
`
`and Mmf=[(x
`
`y
`
`3’)
`
`-x
`
` x,yelR].
`
`The former is called rotation-type because it has the same matrix structure as a 2 X 2
`plane rotation matrix. Similarly, the latter is called reflection-type because it has the
`same matrix structure as a 2 X 2 Givens reflection matrix [13]. Note that x and y must
`not be normalized to x2 + y2 = 1. Using the above definitions, the following results can
`be shown by some elementary manipulations.
`LEMMA 1. KAI e .1!“ and/42 e M“, then/11142 : AZAI 6 Jim.
`LEMMA 2. HA, E of!“ andAz e ”at“, then AA; = AgA1 G Mr“.
`In particular, if we consider two plane rotations, we know the following.
`LEMMA 3. If Rail) and R(62) are plane rotations described by (6),
`R(31)R(52) = R(31+ 32) and R(61)TR(62) = RUE — 31)-
`Now, we give a theorem describing the rotation mode of the TPR method.
`THEOREM. Ifthe 2 X 2 matrix A and the two rotation angles I91 and 62 are given,
`then the two-sided rotation (4) can be computed by two piane rotations, ten additions,
`
`then
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:iiww.siam.orfljoumals/ojsaphp
`
`
`
`
`
`
`
`
`
`Page 3 of 13
`
`3
`
`Page 3 of 13
`
`

`

`716
`
`B. YANG AND J. F. BoHME
`
`andfottr sealings by %:
`
`(8)
`
`(9)
`
`“0)
`
`(11)
`
`Pl:(622+flti)12.
`(It =(flzt "5112)”,
`
`P2=(fl22"flu)/2,
`£12 =(flzt +a|2)f29
`
`6-=62_6la
`
`9+=92+5h
`
`(Trmli'l
`
`(firmlil
`
`bti=rt—r2,
`
`btz=—ti+€2.
`
`by =l1+f2,
`
`b22=r1+r2.
`
`Proof Using (8), the matrix A can be reformulated as
`
`A=A1+A2=(p‘ eqt)+(-p2 (12)
`
`‘12
`
`02
`
`(11
`
`HI
`
`Clearly, Rtfll), R(62) in (4) and A. are elements ole "m while A; belongs to Jr“. This
`leads to the following reformulation of the matrix B by using Lemmas 1fi3:
`
`B=R(6.)TAR(62)
`
`2 R(9.)TA,R(62)+ R(61)TA2R(62)
`
`= R(6,)TR(62)A. +R(6,)TR(02)TA2
`
`=R(62— EMA, +R(6;+ EMT/42
`
`=R(3_)(
`
`PI
`
`Qt
`
`’QI
`
`P1
`
`)+R(8+)T(ip2 (I2)
`
`P2
`
`92
`
`= 1’1—11+ ‘3'2
`It
`3’1
`52
`
`.
`
`12
`’2
`
`This completes the proof.
`The generation mode of the TPR method follows directly from the above theorem.
`COROLLARY. Iftlte 2 X 2 matrix A is given, we can diagonalt'ze A and cafettt'ate
`the corresponding rotation angles I9, and 62 by two Cartesian—to—polar coordinates con-
`versions, etght additions, andfottr scaffngs by %‘.
`
`(12)
`
`(13)
`
`(l4)
`
`(15)
`
`Pl=(022+att)/2,
`
`P2=(5122—flu)/2,
`
`(It :(flzl I—al2)/2!
`
`‘32: (£12: + 0.2)12.
`
`r1= sign (pt) Vpi+ at,
`
`r2= sign (1):) Vp§+q%,
`
`9—=arctan(qifpi),
`
`6+=arctan(m/pz).
`
`6. =(6+—6_)/2,
`
`62=(8++6—)/2.
`
`511:!”1—3‘2,
`
`b22=ri+72-
`
`Proof. Regarding ( l l ), on = by = O is equivalent to I. = 32 = 0. Equation (13)
`follows then from (10). This completes the proof.
`In equation (13), we choose the rotation through the smaller angle. All vectors
`lying in the first or the fourth quadrant are rotated onto the positive x—axis, and all vectors
`lying in the second and the third quadrant are rotated onto the negative x—axis. For
`vectors on the y—axis, the rotation direction is arbitrary. Thus, the generated rotation
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.orfljoumals/ojsa.php
`
`
`
`
`
`
`
`
`
`Page 4 of 13
`
`4
`
`Page 4 of 13
`
`

`

`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`717
`
`angles (L and 6+ satisfy IE. I ,
`
`|6+| g 90". This results in
`
`(t6)
`
`|al|a90°
`
`and “also:
`
`due to ( 14).
`Equation (16) is important with respect to the convergence of the Jacobi SVD
`method. Forsythe and Henrici [9] have pr0ven the convergence for cyclic orderings if
`the rotation angles 6. and 62 are restricted to a closed interval inside the open interval
`(-90“, 90" ). They have also demonstrated that this condition may fail to hold, Le, 61
`and 82 may be :90°, if the off-diagonal elements b1; and 132] in (5) have to be exactly
`annihilated. As a remedy, they suggested an under- or overrotation by computing the
`two-sided rotation (4) with angles (1 , 1061 and (l i 1033 (*1 < 7 < l) and proved
`its convergence. In practice, however, the finite machine accuracy in the real arithmetic
`allows only an approximative computation of the rotation angles and implies under- or
`overrotations. So the Jacobi SVD method converges without using under— or overrotations
`as shown by the experimental results of Brent, Luk, and Van Loan [3]. In case of CORDIC
`implementations, the effect of implicit under- or overrotations is more apparent. The
`angles 190° can never be exactly calculated because of the limited angle resolution are-
`tan (2") of the CORDIC algorithm, where 1) denotes the mantissa length.
`
`4. The CORDIC algorithm. In the previous section, we have seen that the main
`operations of the TPR-method are plane rotations and Cartesian-to-polar coordinates
`conversions. These operations can be carried out by multiplier—adder—based prooessors
`supported by software or special hardware units. An alternative approach is the use of
`dedicated processors that usually map algorithms more effectively to hardware. The
`CORDIC processor is such a powerful one for calculating trigonometric functions.
`The CORDIC algorithm was originally designed by Volder [l4] as an iterative pro-
`cedure for computing plane rotations and Cartesian-to-polar coordinates conversions. It
`was later generalized and unified by Walther [15], enabling a CORDIC processor to
`calculate more functions, including hyperbolic functions, as well as multiplications and
`divisions. In the following, we consider Volder’s CORDIC algorithm because only trig-
`onometric functions are involved in SVD applications.
`The CORDIC algorithm consists of iterative shift-add operations on a three—com-
`ponent vector,
`
`(17)
`
`(18)
`
`(251+ 1)
`
`Y: + l
`
`(xi— Marys) :
`
`3’: + 0:5in
`
`
`
`1
`
`003 (at)
`
`(
`
`005W”
`
`0: Sin (at)
`
`‘fFiSin (“0)(xi)
`
`005 (at)
`
`3/”!
`
`a
`
`2s+|=2.-—wras
`
`(0<5i<1;6i=il;c=il;i=0. Luna—1).
`
`in which the iteration stepsize 6,- is defined by
`
`(19)
`
`15,-:tan(a,-)=2"Sw.
`
`The set of integers {S(i)} parametrizing the iterations is called CORDIC sequence.
`Equation ( 17) can be interpreted, except for a scaling factor of
`
`1
`cos ((1,)
`
`(20)
`
`kg:
`
`=iil+t5%,
`
`as a rotation of (xi, ya? through the angle 02,, where the sign :1; = i1 gives the rotation
`direction. After 21 iterations, the results are given by
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.orfljournals/ojsa.php
`
`
`
`
`
`
`
`
`
`(21)
`
`(22)
`
`Page 5 of 13
`
`(xn)=K(cosa
`
`31:10:
`
`y”
`
`"sina)(x0)7
`
`COSII
`
`yo
`
`Zn=zo—sot,
`
`5
`
`5
`
`Page 5 of 13
`
`

`

`718
`
`B. YANG AND J. F. BéHME
`
`with the overall scaling factor K = H,- kl- and the total rotation angle a = Z,- afag. Now,
`if the CORDIC sequence satisfies the following convergence condition
`n 7 l
`
`(23)
`
`as— E affirm—i
`j=i+l
`
`(i=0=l,"',n—2),
`
`we can choose the sign parameter
`
`(24)
`
`at:
`
`”Sign (x-y-)
`.
`i
`r
`s1gn(az.)
`
`fory +0,
`n
`for z,,—» 0
`
`to force y,’ or 2,, to zero, provided that the input data x0, yo, and 20 lie in the conver-
`gence region
`
`(25)
`
`"—1
`C= Z of;
`1:0
`
`larctan (yo/xofi
`
`foryr-O,
`
`IZol
`
`fora->0.
`
`In this way. two difl'erent types of CORDIC trigonometric functions can be computed
`(Table 1). In the mode yn —) 0, the Cartesian coordinate (x0, yo) of a plane vector is
`convened to its polar representation, where the parameter c = :1 determines the sign
`of the phase angle calculated. When 2,. —r 0, a given plane vector is rotated through the
`angle 20, where a = :1 controls the rotation direction.
`In Table l, the principal value larctan ( yO/xn)| é 90° ofthe inverse tangent function
`is calculated when computing Cartesian-to-polar coordinates conversions. Correspond-
`ingly, x” may be positive or negative according to the sign of x0. So, it is guaranteed that
`a vector is always rotated through the smaller angle onto the x—axis in accordance with
`(13). In this case, a convergence region of C a 90° is sufficient for the generation mode
`of the two-sided rotation.
`
`One main drawback of the CORDIC algorithm is the need of correcting the scal-
`ing factor K that arises during the iterations (17). For example, if we use Volder’s
`CORDIC sequence
`
`{26)
`
`{S(i)}={0,l,2,3,---,p—l,p}.
`
`with n : p + l CORDIC iterations for a mantissa accuracy of 2 ‘p, the scaling factor is
`K 2: 1.64676. Compensating this undesired scaling effect with a minimum number of
`computations is of particular importance.
`Clearly, multiplying x” and yn in Table 1 by K" will degrade the algorithm per-
`formance substantially. Most of the scaling correction issues are based on shift-add oper‘
`ations. For a two-sided rotation that is implemented by four plane rotations, each matrix
`element undergoes two plane rotations so that the total scaling factor to be corrected
`is K2. In this case, Cavallaro and Luk [16] have pointed out that there is a simple
`systematic approach for sealing correction when using the CORDIC sequence (26).
`They proposed to use [19/41 scaling iterations of the type at 4— x — 2’2J'x with j E J =
`{1, 3, 5, -- -
`, 2|"pf4'l — 1} and one shift operation 2“. The remaining scaling error is
`
`TABLE 1
`CORDIC trigonometricfimcu‘ons (e = 1]).
`
`
`y” -|' U
`z... —I- 0
`
`x.’ = K sign (xnll’x3+yfi
`x” : Km, cos zo — ryo sin 7.0)
`
`2,, = 20 + e arctan (yolxo)
`y» = Kuxo sin 2:; + yo cos :0)
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.orfljournals/ojsa.php
`
`
`
`
`
`
`
`
`
`Page 6 of 13
`
`6
`
`6
`
`Page 6 of 13
`
`

`

`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`719
`
`bounded by 2"“ l,‘
`
`(27)
`
`l-2_'H(l---2"2")'I{2
`jEJ'
`
`
`
`J7
`
`=|1—2" [1(1 —2r2-")-H(1+2r2i) <2’P".
`
`i=0
`
`J's}
`
`This approach, however, fails in the TPR method. Here, each matrix element un-
`dergoes only one plane rotation. The scaling factor to be corrected is thus K rather than
`K2. In order to solve this more difficult problem, different techniques have been developed
`in the literature. Haviland and Tuszynski [17] used similar scaling iterations as Cavallaro
`and Luk. Ahmed [18] repeated some CORDIC iterations to force K to a power of the
`machine radix. Delosme [19] combined both methods of Haviland, Tuszynski, and
`Ahmed for minimizing the number of computations. Deprettere, Dewilde, and Udo [20]
`suggested the double-shift concept.
`We designed a computer program [21] for a systematic search of CORDIC
`sequences. We allow shifts parameters SH) (1‘ = 0, 1,
`, n — 1) with differences
`S(r’ + l) — 8(1‘) 6 {0, 1, 2} to provide more optimization freedom. For an efficient
`scaling correction, we require that the scaling factor K be corrected by a sequence of ”k
`shift-add operations,
`
`(28)
`
`2—110) fi (l+n(j)2_T‘-”)-K=1+AK
`i=1
`
`(T(j)integers,n(j)=il).
`
`These additional scaling iterations are parametrized by the set of signed integers
`{1"(0), n( l )T(1), . -
`.
`, n(nk)T{nk)} . The total number ofiterations is L = n + in.
`In (28 ), AK denotes the remaining relative scaling error after the scaling correction.
`We emphasize that this is a systematic error with a constant sign. By contrast, the other
`two types of CORDIC errors, the angular error due to the limited angle resolution and
`the rounding error, are of statistical nature because they may be positive or negative.
`The scaling error is thus more critical with respect to error accumulation when repeated
`CORDIC operations on the same data have to be computed as in SVD applications.
`Roughly speaking, the total scaling error after k CORDIC function calls increases linearly
`with k, a fact that has been verified by our numerical experiments. For this reason, we
`require IAKI to be much smaller than 2‘”.
`We found catalogues of CORDIC sequences with complexity comparable to those
`of Cavallaro and Luk. In the following, five examples for different mantissa lengths p =
`16, 20, 24, 23, and 32, including the total number of iterations L : n + m, the convergence
`region C, and the remaining scaling error AK are given:
`
`p=16: {sm}={0123w 1516},
`
`{ntleU)}={1+2—5+9+10}, L=l7+4, CwlOO°, arm—24w}
`
`p=20z
`
`{S(r‘)}={0123~~1920},
`
`{n(j)T(j)}={l+2—5+9+10+16},
`
`L=2l+5,
`
`carom,
`
`AKmIZ—n‘”,
`
`' When replacing [p/4i by [(p - ”/41 or [(p + ”/41, the upper bound in (27) becomes 2‘” or 2“”,
`respectively.
`2 When appending an additional scaling iteration with n(5)T{5) = +16, the scaling accuracy can be
`enhanced to AK m 2‘”.
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.orgfjournals/ojsaphp
`
`
`
`
`
`
`
`
`
`Page 7 of 13
`
`7
`
`Page 7 of 13
`
`

`

`720
`
`B. YANG AND J. F. Bonus
`
`p=24:
`
`{S(t)}:{1123345566788910---2324},
`
`{n(j)T(j)}={072+6},
`
`L=29+2,
`
`Cm9]°, Airflow”,
`
`p=28:
`
`{S(i)}={1123345566788910111213141415---2723},
`
`{n(j)T(j)}={0—2+6},
`
`L=34+2,
`
`Cm91°,
`
`AKmZ‘R'”,
`
`p=32:
`
`{S(i)}={001333456789910---3132},
`
`{11(le(1‘)}={1—3—8+16—25—27}.
`
`L=36+5,
`
`CmMS",
`
`Alia—2‘39”.
`
`Remember that in order to meet the convergence condition (23) and to provide a
`convergence region C E 90", the minimum number of CORDIC iterations is p + 1. So,
`for all CORDIC sequences given above, the number L — (p + 1 ) of additional iterations
`for scaling corrrection is pl4. Moreover, except for the first CORDIC sequence, the
`remaining sealing error |AK| is significantly smaller than 2"”. This leads to improved
`numerical properties compared with other CORDIC sequences reported in the literature.
`We also remark that if the symmetric eigenvalue problem is considered for which a
`convergence region of C 2 45° is sufficient [2], the total number of CORDIC iterations
`L can be further reduced. An example that is nearly identical to the last CORDIC sequence
`given above is
`
`p=32;
`
`{S(i)}={1333456789910---3132},
`
`{amen={o—3—s+16—25—27},
`
`L=34+5,
`
`CmSS",
`
`AKflflg_2e39.93.
`
`For comparison, Delosme [5] has also given an optimized CORDIC sequence for the
`same situation. His sequence requires one iteration more (L = 40) and achieves a scaling
`accuracy ofAK m 2—33-16.
`We suspect that similar results can also be obtained by using Deprettere’s double-
`shif’t concept. However, this method requires a slightly increased hardware complexity
`and will not be discussed in this paper.
`
`5. CORDIC implementation of the SVD PEs. For easy illustration, we first introduce
`a CORDIC processor symbol as shown in Fig. 2. The descriptions inside the box determine
`uniquely the function mode of the CORDIC processor according to Table l. The output
`data x and y are assumed to be scaling corrected.
`It is now simple to map the operations (8 )—( l l ) and ( l2 )-( l 5 ) of the TPR method
`onto a two CORDIC processor architecture. In Fig. 3, the diagonal PEs of the SVD array
`are implemented by two CORDIC processors and eight adders. The dotted inputs of the
`adders represent negated inputs. Because the diagonal PEs work in the generation mode,
`
`X0
`
`Yo
`
`20
`
`Y
`2—”0
`
`E=i1
`
`X
`
`Y
`
`Z
`
`FIG. 2. CORDIC symbol.
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:/fwwsiamcrgfjournals/ojsa.php
`
`
`
`
`
`
`
`
`
`Page 8 of 13
`
`8
`
`Page 8 of 13
`
`

`

`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`721
`
`
`
`FIG. 3. CORD‘IC implementation ofthe diagonal PE of the SVD array.
`
`both CORDIC processors are driven in the “y -> 0” mode for computing Cartesian-to-
`polar coordinates conversions. In Fig. 4, the oflldiagonal PEs working in the rotation
`mode are implemented by two CORDIC processors and ten adders. Here, the CORDIC
`processors are driven in the “z —I~ 0” mode for performing plane rotations.
`Obviously, both CORDIC implementations have nearly the same architecture. All
`PEs of the SVD array can thus be implemented by one unified CORDIC SVD module
`(Fig. 5) without considerably increased hardware cost. The different computation modes
`of the diagonal and olildiagonal PEs are easily “programmed" by one control bit. The
`resulting SVD array is similar to that in Fig. l, but homogeneous with identical PEs.
`We remark that Fig. 5 is more a “graphic program" describing the sequence of
`operations to be computed rather than a hardware block diagram. We show in the fol-
`lowing that the 12 adders that are paired into three pre-butterflys and three post-butterflys
`can be integrated into the two CORDIC processors without separate hardware realizations.
`The Jacobi SVD method is a recursive method. Each PE of the SVD array has to exchange
`data with its diagonal neighbors. Because of this data dependency, only recursive CORDIC
`processors can be used here. This is an arithmetic unit consisting of mainly three adders
`and two barrel-shifters. It carries out the L iterations of the CORDIC algorithm in L
`cycles by using data feedback. The two CORDIC processors contained in one CORDIC
`SVD module require six adders altogether. So, it is natural to modify the CORDIC
`processor architecture slightly and to use the existing six adders for computing both the
`pro-butterfly and the post-butterfly operations. The resulting CORDIC SVD module has
`
`
`
`FIG. 4. CORDIC implementation oftire qfiidiagoncl PE oj'the SVD array.
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:ffww.siam.orfljournals/ojsa.php
`
`
`
`
`
`
`
`
`
`Page 9 of 13
`
`9
`
`Page 9 of 13
`
`

`

`722
`
`B. YANG AND J. F. BéHME
`
`pre—butterflys
`
`post—butterflys
`
`
`
`
`
`
`
`
`
`Downloaded01104113to128.148.252.35.RedistributionsubjeettoSIAMlicenseorcopyright;seehttp:ffww.siatn.org’journals/ojsa.php
`
`FIG. 5. A unified CORDIC SVD modulebr implementing all PEs offhe SVD array.
`
`the hardware complexity of two recursive CORDIC processors and requires a total com-
`putation time of L + 2 iterations.
`In Fig. 6, the principal architecture of such a two CORDIC processor SVD module
`is shown. The dashed lines and boxes represent the additional hardware components
`
`
`
`
`
`
`
`
`
`
`
`FIG. 6. The principal architecture afthe unified CORDIC SVD module.
`
`Page 10 of 13
`
`10
`
`10
`
`Page 10 of 13
`
`

`

`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS svo ARRAY
`
`723
`
`enabling the CORDIC processors to compute the butterfly operations. It is easily verified
`that the upper four adders devoted to x and y can perform the following types of operations
`2"): -_|- 2“ y (pro-butterfly), x i y (post-butterfly), x i 2“ y (CORDIC iteration) and
`x i 2"): (sealing iteration) while the lower two adders devoted to 2 can compute 2“z,
`: 2‘12; (pro-butterfly}, z. i 22 (post-butterfly), and z i a (CORDIC iteration), re-
`spectively. The cross switch between the registers “Y.” and “X2” is needed to exchange
`data when the CORDIC SVD module switches from the pro-butterfly operations into
`the CORDIC iterations or from the CORDIC iterations into the post-butterfly operations,
`respectively. Then, we see from Fig. 3 that the output data pairs of the pre-butterflys are
`(p. , p2) and (q1 . so), while the desired input data pairs for the CORDIC iterations are
`(pl, (1. ) and (1);, q;), respectively. So, in; and q1 have to be exchanged.
`
`6. Comparisons. We now compare the new CORDIC SVD module with those pro-
`posed by Cavallaro and Luk [4] and Delosme [5]. Let Acm and Tam denote the area
`and time complexity ofa CORDIC SVD module and ADM,c and Twmic those of a CORDIC
`processor, respectively. Cavallaro and Luk have shown that their most efficient parallel
`diagonalization method requires Amd as 2Amdic and Tessa a.» 371mm,, for the diagonal
`PBS and Ac“, m 2Acmic and To“, 5:: 2 Tmmic for the off-diagonal PEs. By using the TPR
`method, we require Tm, m 2Acondic and de m Tmmic for all PEs. In other words, having
`approximately the same hardware complexity, the computation time is reduced by more
`than 50 percent.
`A comparison to Delosme’s method is more difficult because he follows a quite
`different approach. Therefore, only rough performance estimates are given here. In our
`method, we compute the rotation angles explicitly. After these computations have been
`completed in the diagonal PBS, the angles propagate to the off-diagonal ones. We assume
`that the propagation from one PE to its neighbors takes one cycle Tm.“ the time required
`for computing one CORDIC iteration. This implies local communications without
`broadcasting data. At the beginning of the second propagation cycle, the angles reach
`the diagonal neighbors of the diagonal PEs which complete their computations after
`Tum. This means that the diagonal PEs have to wait for a delay time To...” = Tcycle +
`TCM before they can exchange data with their diagonal neighbors.3 The total time elapsed
`between two adjacent activities at each PE is thus TQM + Tag.“ = 21;,” + Tm], m ZTWdic
`because Two is negligible with respect to Tom-[c = L- Tome.
`Delosme does not compute the rotation angles explicitly. He rather calculates en-
`codings of the angles, i.e., sequences of signs :1, and sends them to the cit-diagonal PEs.
`This enables overlap of diagonal and elf—diagonal rotations because the encoding signs
`are recursively obtained and become available before the completion of diagonal oper-
`ations. Accordingly, no delay time is required (Tam), = 0), provided that the SVD array
`size (the half of the matrix size) is smaller than the number of CORDIC scaling iterations
`m. (for details, see [5]). The drawback is, however, that the TPR method cannot be
`applied to the off-diagonal PEs. Four plane rotations are hence required, resulting in
`TQM = 231mm for two CORDIC processors in one module. In other words, the time
`complexities Tam. + Tdday of both methods are nearly identical and equal ZTwmic. If,
`however, multiple problems are interleaved, the fraction of idle time that is 50 percent
`in our case can be reduced to almost zero. In such a situation, our method provides the
`double speed compared with Delosme’s one.
`
`3 If the propagation time is assumed to be Tam, we get the weli~known result Tam = 2?“... given by Brent
`and Luk [2].
`
`
`
`
`
`
`
`Downloaded01/04!13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp:/fwwwsiamcrgfjournals/ojsa.php
`
`
`
`
`
`
`
`
`
`Page 11 of13
`
`11
`
`11
`
`Page 11 of 13
`
`

`

`724
`
`B. YANG AND J. F. BGHME
`
`In terms of area complexity, both CORDIC SVD modules contain two CORDlC
`processors. Our module consists of essentially six adders, four barrel shifters, and one
`ROM table containing n angle values. Delosme’s architecture requires four carry-save
`adders, four adders, and eight barrel shifters. So, as a rough estimate, both SVD modules
`have the same order of area complexities.
`Perhaps the most important advantage of Delosme’s approach is the 2-bit wide
`horizontal and vertical data connections for sending angle encodings serially rather than
`sending the full angle values in parallel. The prices are the upper bound of the SVD array
`size depending on the number of CORDIC scaling iterations, the relatively complicated
`timing, and a nonregular CORDIC architecture design. We also mention that while
`Delosme's method presumes a CORDIC implementation, the TPR method is applicable
`to other computing architectures.
`
`7. Other applications of the TPR method. Another advantage of the TPR method
`seems to be the relatively wide range of applications. We indicate some of them in the
`following.
`For the SVD of a rectangular matrix, a well-known method is first to triangularize
`the matrix by QR decomposition and then to apply the Jacobi SVD procedure to the
`triangular factor. Luk [22] has shown that both steps can be implemented by one triangular
`systolic array. Each PE contains a 2 X 2 submatrix. It applies two plane rotations (through
`the same angle) to the two column vectors at the QR step and a two-sided rotation at
`the SVD step. For computing the SVD step, the PE can be realized by the CORDIC
`SVD module, as before. On the other side, the two CORDIC processors contained in
`the module are also appropriate to perform the two—plane rotations of the QR step. The
`CORDIC SVD module presented in this paper thus provides a suitable PE for Luk’s
`triangular SVD array.
`Stewart [23] has proposed a square systolic array for computing the Schur decom-
`position (SD) of a non-Hermitian matrix which, for example, is useful for evaluating
`the eigenvalues of the matrix. His approach is similar to the Jacobi SVD method. It is
`based on applying a sequence of two-sided rotations to 2 X 2 submatrices, where the left
`and right rotation angles are identical to make the diagonal submatrices upper triangular.
`While the diagonal PEs perform operations different from those in SVD, the elf-diagonal
`PEs have exactly the same computational task as in SVD computing. Therefore, the

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket