throbber

`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp://www.siam.org/journals/ojsa.php
`
`
`
`
`
`
`
`
`
`SIAM J. MATRIX ANAL. APPL.
`Vol. 12, No. 4, pp. 713-725, October 1991
`
`© 1991 Society for Industrial and Applied Mathematics
`009
`
`REDUCING THE COMPUTATIONS OF THE SINGULAR VALUE
`DECOMPOSITION ARRAY GIVEN BY BRENT AND LUK*
`
`B. YANG} AND J. F. BOHMET
`
`Abstract. A new,efficient, two-plane rotation (TPR) method for computing two-sided rotations involved
`in singular value decomposition (SVD) is presented.It is shown that a two-sided rotation can be evaluated by
`only two plane rotations and a few additions. This leads to significantly reduced computations. Moreover, if
`coordinate rotation digital computer (CORDIC) processors are used for realizing the processing elements (PEs)
`of the SVD array given by Brent and Luk, the computational overhead of the diagonal PEs due to angle
`calculations can be avoided. The resulting SVD array has a homogeneousstructure with identical diagonal and
`off-diagonal PEs. Similar results can also be obtained if the TPR method is applied to Luk’s triangular SVD
`array and to Stewart’s Schur decomposition array.
`
`Key words. singular value decomposition, systolic arrays, CORDIC,two-sided rotations, VLSI
`
`AMS(MOS)subject classification. 15A18
`
`1. Introduction. One important problem in linear algebra and digital signal pro-
`cessing is the singular value decomposition (SVD). Typical applications arise in beam-
`forming and direction finding, spectrum analysis, digital image processing, etc. [1]. Re-
`cently, there has been a massive interest in parallel architectures for computing SVD
`because of the high computational complexity of SVD, the growing importance ofreal-
`timesignal processing, and the rapid advances in very large scale integration (VLSI) that
`make low-cost, high-density and fast processing memory devices available.
`There are different numerically stable methods for computing complete singular
`value and singular vector systems of dense matrices, for example, the Jacobi SVD method,
`the QR method, and the one-sided Hestenes method. For parallel implementations, the
`Jacobi SVD method is far superior in terms of simplicity, regularity, and local com-
`munications. Brent, Luk, and Van Loan have shown how the Jacobi SVD method with
`parallel ordering can be implemented by a two-dimensionalsystolic array [2], [3]. Various
`coordinate rotation digital computer (CORDIC) realizations ofthe SVD array have been
`reported by Cavallaro and Luk [4] and Delosme [5], [6].
`The Jacobi SVD method is based on, as common for all two-sided approaches,
`applying a sequence of two-sided rotations to 2 X 2 submatrices of the original matrix.
`The computational complexity is thus determined by how to compute the two-sided
`rotations. In most previous works, a two-sided rotation is evaluated in a straightforward
`manner by four plane rotations, where two of them are applied from left to the two
`column vectors of the 2 X 2 submatrix and the other ones are applied from right to the
`row vectors, respectively. In the diagonal processing elements (PEs), additional operations
`for calculating rotation angles are required. This leads to an inhomogeneousarray ar-
`chitecture containing twodifferent types of PEs.
`In this paper, we develop a two-plane rotation (TPR) method for computing two-
`sided rotations. We show that the above computational complexity can be reduced sig-
`nificantly because each two-sided rotation can be evaluated by only two plane rotations
`and a few additions. Moreover, the SVD array given by Brent and Luk becomes ho-
`mogeneouswith identical diagonal and off-diagonal PEs when CORDICprocessors are
`
`* Received by the editors September 28, 1989; accepted for publication (in revised form) August 2, 1990.
`+ DepartmentofElectrical Engineering, Ruhr-Universitat Bochum, 4630 Bochum, Germany.
`713
`
`1
`
`HUAWEI] 1008
`
`1
`
`HUAWEI 1008
`
`

`

`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`(2) Musi =ULMV,—(k=0,1,2, +++).Mo=M,
`
`
`
`
`
`
`
`714
`
`B. YANG AND J. F. BOHME
`
`used. In a recent work [6], Delosme has also indicated this possibility in connection
`with “rough rotations” independently. He has taken, however, a different approach that
`is based on encoding the rotation angles. He hasstill required four plane rotations on
`the off-diagonal PEs while diagonal and off-diagonal operations can be overlapped.
`Ourpaperis organized as follows. In § 2, we briefly reexamine Jacobi’s SVD method
`and Brent and Luk’s SVD array. Then, we develop the TPR methodin § 3. The CORDIC
`algorithm is described in § 4, where in particular CORDICscaling correction techniques
`are discussed and examples of scaling-corrected CORDIC sequencesare given. In § 5, a
`unified CORDIC SVD module for all PEs of the SVD array is presented. This module
`is compared to those proposed by Cavallaro, Luk, and Delosmein § 6. Finally, we stress
`the applicability of the TPR method to several other problems.
`
`2. Jacobi SVD method. In this paper, we considerreal, square, and nonsymmetric
`matrices. Let Af ¢ R***™ be a matrix of dimension N. The SVDis given by
`
`(1)
`
`M=UEV’,
`
`where U € R“*™ and V € R”™ are orthogonal matrices containing the left and right
`singular vectors, and 2 € R*** is a diagonal matrix of singular values, respectively. The
`superscript 7 denotes matrix transpose. Based on an extension of the Jacobi eigenvalue
`algorithm [7], Kogbetliantz [8] and Forsythe and Henrici [9] proposed to diagonalize
`M by a sequence of two-sided rotations,
`
`
`
`U, and V;, describe two rotations in the (7, /)-plane (1 S i<j SN), where the rotation
`angles are chosen to annihilate the elements of A4, at the positions (i, j) and (j, i).
`Usually, several sweeps are necessary to complete the SVD, where a sweep is a sequence
`of N(N — 1)/2 two-sided rotations according to a special ordering of the N(N — 1)/2
`different index pairs (i, /).
`For sequential computing on a uniprocessor system, possibly the most frequently
`used orderings are the cyclic orderings, namely, the cyclic row ordering
`
`(3)
`
`(i,j) = (1,2),C1, 3), +++ 5C1N)(2,3), 00° (2,N), 0 ONT)
`
`or the equivalent cyclic column ordering. Sameh [10] and Schwiegelshohn and Thiele
`[11] have shown how to implement the cyclic row ordering on a ring-connected or a
`mesh-connected processor array. Recently, a variety of parallel orderings have been de-
`veloped. Luk and Park [12] have shownthatthese parallel orderings are essentially equiv-
`alent to the cyclic orderings and thus share the same convergence properties.
`Brent and Luk have suggested a particular parallel ordering and developed a square
`systolic array consisting of [N/21X [N/21PEsfor implementing the Jacobi SVD method
`(Fig. 1). To dothis, the matrix V is partitioned into 2 * 2 submatrices. Each PE contains
`one submatrix and performs a two-sided rotation
`
`where
`
`(5)
`
`az,
`
`23
`
`a=(% a) and B= (5 0)
`
`ba,
`
`bap
`
`2
`
`

`

`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`715
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`
`
`
`
`Fic. 1, The SVD array given by Brent and Luk.
`
`denote the submatrix before and after the two-sided rotation, respectively, and
`
`(6)
`
`Ra) =(
`
`cos@ sind
`
`—sin@ cosdé
`
`describes a plane rotation through the angle @. At first, the diagonal PEs (symbolized by
`a double squarein Fig.
`| ) generate the rotation angles to diagonalize the 2 X 2 submatrices
`(b,. = by, = 0) stored in them. This means that 6; and @, are first calculated from the
`elements of 4 and then relation (4) is used to compute 5;, and 2. We call this the
`generation mode. Then, the rotation angles are sentto all off-diagonal PEs in the following
`way: the angles associated to theleft-side rotations propagate along the rows while the
`angles associated to the right-side rotations propagate along the columns. Once these
`angles are received, the off-diagonal PEs perform the two-sided rotations (4) on their
`stored data. Wecall this the rotation mode. Clearly, if we compute the rotation mode
`straightforwardly, we require four plane rotations. For the generation mode, additional
`operations for calculating @, and @2 are required.
`
`3. TPR method for computing two-sided rotations. In order to develop the TPR
`method for computing two-sided rotations more efficiently, we first discuss the com-
`mutative properties of two special types, the rotation-type and the reflection-type, of
`2 X 2 matrices. We define
`
`(7) | * *)[rver} and wet =|(*
`
`
`~y x
`
`vy 7x
`
`”)
`
` xveR],
`
`The formeris called rotation-type because it has the same matrix structure as a 2 X 2
`plane rotation matrix. Similarly, the latter is called reflection-type because it has the
`same matrix structure as a 2 X 2 Givensreflection matrix [13]. Note that x and » must
`not be normalized to x? + y? = 1. Using the above definitions, the following results can
`be shown by some elementary manipulations.
`LEMMAI, If A, €.4™ and A, € 4", then A, Az = A.A, €.™,
`LEMMA 2. IfA; ¢.d™ and A, ¢ M™, then AjA = ATA, € M™,
`In particular, if we consider two plane rotations, we know the following.
`LEMMA 3. Jf R(6@,) and R(62) are plane rotations described by (6),
`R(0;)R(02) = RO; + 82) and R(0;)"R(02) = R(82 — 0,).
`Now, we give a theorem describing the rotation mode of the TPR method.
`THEOREM. Ifthe 2 X 2 matrix A and the two rotation angles 0, and 02 are given,
`then the two-sided rotation (4) can be computed by two planerotations, ten additions,
`
`then
`
`3
`
`

`

`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`716
`
`B. YANG AND J. F. BOHME
`
`andfour scalings by }:
`
`(8)
`
`(9)
`
`(19)
`(11)
`
`Dy = (22 + a1) /2,
`
`Py = (dx2~— 4))/2,
`
`qi = (421 — @12)/2,
`
`G2 = (doi + @y2)/2,
`
`ry
`
`4)
`
`6_= 62-6),
`Pi
`
`6, =60.+6,,
`ro
`
`P2
`
`-nof2) (eet?)
`ace qi
`In
`(0+)
`d2
`by =ri-h,
`ba =-hth,
`by, =t) + fe,
`bo =P ths.
`
`Proof. Using (8), the matrix A can be reformulated as
`
`Ana tar=(? ele *),
`
`1
`
`Pi
`
`dz
`
`Pz
`
`Clearly, R(6,), R(82) in (4) and A, are elements of .4™while 4, belongs to... This
`leads to the following reformulation of the matrix B by using Lemmas 1-3:
`
`B= R(6,)7AR(02)
`
`= R(6,)"A,R(02)+ R(0;)"A2R(G2)
`
`= R(0,)7R(02)A, + R(9,)7R(G2)7Ao
`
`= R(0.—0,)A, + R(02+6,)7A>
`
`aq
`
`Pi
`
`=R(0("" f+R00)"
`_ ("
`eh e(% *).
`
`ty
`
`ry
`
`bo
`
`oth
`
`—P2 *G2 Pr
`
`This completes the proof.
`The generation mode of the TPR method follows directly from the above theorem.
`COROLLARY. Ifthe 2 * 2 matrix A is given, we can diagonalize A and calculate
`the corresponding rotation angles 6, and 6. by two Cartesian-to-polar coordinates con-
`versions, eight additions, andfour scalings by }:
`
`(12)
`
`(13)
`
`(14)
`
`(15)
`
`DP = (dx + a);)/2,
`
`Px = (@o2— 411)/2,
`
`dQ = (421 — ay2)/2,
`
`qz = (a2, + a)2)/2,
`
`r= sign (p:)Vpi+qi,
`
`6_ = arctan (qi/P1),
`
`r= sign (p2)Vp3+ 43,
`
`6, =arctan (q2/p2),
`
`6, = (6, —6_)/2,
`
`62 = (6,4 6_)/2,
`
`by=rn-n,
`
`by =r) tho.
`
`Proof. Regarding (11), bj. = bz, = 0 is equivalent to 7; = t = 0. Equation (13)
`follows then from (10), This completes the proof.
`In equation (13), we choose the rotation through the smaller angle. All vectors
`lyingin thefirst or the fourth quadrant are rotated onto the positive x-axis, and all vectors
`lying in the second and the third quadrant are rotated onto the negative x-axis. For
`vectors on the y-axis, the rotation direction is arbitrary. Thus, the generated rotation
`
`4
`
`

`

`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`T17
`
`angles 6_ and 6. satisfy |@_|,
`(16)
`
`|@,| = 90°. This results in
`|,)<90° and
`|@;| $90°,
`
`due to (14).
`Equation (16) is important with respect to the convergence of the Jacobi SVD
`method. Forsythe and Henrici [9] have proven the convergence for cyclic orderingsif
`the rotation angles #, and @, are restricted to a closed interval inside the open interval
`(—90°, 90°). They have also demonstrated that this condition may fail to hold, i.e., 6,
`and @, may be +90°, if the off-diagonal elements },2 and 2, in (5) have to be exactly
`annihilated. As a remedy, they suggested an under- or overrotation by computing the
`two-sided rotation (4) with angles (1 — y)@, and (1 — y)@2(—1 < y < 1) and proved
`its convergence. In practice, however, the finite machine accuracy in the real arithmetic
`allows only an approximative computation of the rotation angles and implies under- or
`overrotations. So the Jacobi SVD method converges without using under- or overrotations
`as shown by the experimental results of Brent, Luk, and Van Loan [3]. In case of CORDIC
`implementations, the effect of implicit under- or overrotations is more apparent. The
`angles +90° can never be exactly calculated because of the limited angle resolution arc-
`tan (2~”) of the CORDIC algorithm, where p denotes the mantissa length.
`
`4, The CORDICalgorithm. In the previous section, we have seen that the main
`operations of the TPR-method are plane rotations and Cartesian-to-polar coordinates
`conversions. These operations can be carried out by multiplier—-adder-based processors
`supported by software or special hardware units. An alternative approach is the use of
`dedicated processors that usually map algorithms more effectively to hardware. The
`CORDICprocessor is such a powerful one for calculating trigonometric functions.
`The CORDICalgorithm wasoriginally designed by Volder [14] as an iterative pro-
`cedure for computing plane rotations and Cartesian-to-polar coordinates conversions.It
`was later generalized and unified by Walther [15], enabling a CORDIC processor to
`calculate more functions, including hyperbolic functions, as well as multiplications and
`divisions. In the following, we consider Volder’s CORDICalgorithm because onlytrig-
`onometric functions are involved in SVD applications.
`The CORDICalgorithm consists of iterative shift-add operations on a three-com-
`ponent vector,
`
`(17) (* ) _ (* - i) __!
`
`Vir
`
`yi + 6;5;x;}
`
`cos (aj)
`
`(
`
`cos (aj) —o; sin eye)
`
`cos (a,)/\
`
`yy)?
`
`\o; sin (a)
`
`(18)
`
`Zi = Zp — €0;0;
`
`(0<6;< ljo,=+lje= 41;7=0,1,---,n— 1),
`
`in which the iteration stepsize 6, is defined by
`
`(19)
`
`§;= tan (a;)=2-,
`
`The set of integers {.S(i)} parametrizing the iterations is called CORDIC sequence.
`Equation (17) can be interpreted, except for a scaling factor of
`
`(20)
`k=
`:
`=V1+6?,
`cos (a;)
`
`as a rotation of(x;, y;)" through the angle a;, where the sign o; = +1 gives the rotation
`direction. After
`iterations, the results are given by
`
`(21)
`
`(22)
`
`(*)=«(Ore eee):
`
`sin @&
`
`cos a/\
`
`Yo
`
`Yn
`
`Zn = Zp — €Q,
`
`5
`
`

`

`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`718
`
`B. YANG AND J. F. BOHME
`
`with the overall scaling factor K = [], &; and the total rotation angle a = 2, o;a;. Now,
`if the CORDIC sequence satisfies the following convergence condition
`a-1l
`
`(23)
`
`ai- >) aSan-1
`jriel
`
`(i=0,1, +++ ,n—-2),
`
`we can choose the sign parameter
`(24)
`o;=
`
`—sign (x;;)
`
`sign (ez;)
`
`fory,—>0,

`for z,—> 0
`
`to force y, or Zz, to zero, provided that the input data xy, yo, and Zp lie in the conver-
`gence region
`
`(25)
`
`n-1
`C= > az
`i=0
`
`Jarctan (¥o/Xo)|
`
`fory,—>9,
`
`|Zol
`
`for z,—> 0.
`
`In this way, two different types of CORDIC trigonometric functions can be computed
`(Table 1). In the mode y, —~ 0, the Cartesian coordinate (xo, ¥o) of a plane vectoris
`converted to its polar representation, where the parameter e = +1 determines the sign
`of the phase angle calculated. When z,, —> 0, a given plane vectoris rotated through the
`angle z), where e = +1 controls the rotation direction.
`In Table 1, the principal value | arctan (yo/Xo)| = 90° ofthe inverse tangent function
`is calculated when computing Cartesian-to-polar coordinates conversions. Correspond-
`ingly, x, may be positive or negative according to the sign of xp. So, it is guaranteed that
`a vector is always rotated through the smaller angle onto the x-axis in accordance with
`(13). In this case, a convergence region of C 2 90° is sufficient for the generation mode
`of the two-sided rotation.
`One main drawback of the CORDICalgorithm is the need of correcting the scal-
`ing factor K that arises during the iterations (17). For example, if we use Volder’s
`CORDIC sequence
`
`(26)
`
`{S(i)} = {0,1,2,3,---,p—l,p},
`
`with n = p+ 1 CORDICiterations for a mantissa accuracy of 2~’, the scaling factoris
`K = 1.64676. Compensating this undesired scaling effect with a minimum number of
`computationsis of particular importance.
`Clearly, multiplying x, and y, in Table 1 by K7~' will degrade the algorithm per-
`formance substantially. Most of the scaling correction issues are based on shift-add oper-
`ations. For a two-sided rotation that is implemented by four plane rotations, each matrix
`element undergoes two plane rotations so that the total scaling factor to be corrected
`is K?. In this case, Cavallaro and Luk [16] have pointed out that there is a simple
`systematic approach for scaling correction when using the CORDIC sequence (26).
`They proposed to use [p/41 scalingiterations of the type x < x — 2~7/x with j e J =
`{1, 3, 5, +++, 27.p/41— 1} and oneshift operation 2~', The remaining scaling erroris
`
`TABLE 1
`CORDICtrigonometricfunctions (e = +1).
`
`
`
` Yn > 0 Zn > 0
`
`Xn = K(Xo cos 29 — e¥y sin zy)
`Xn = K sign (x9) Vx6+ yo
`
`Zy = Zo + e arctan (¥o/X) Yn = K(eXq sin Zp + Yo COS Zo)
`
`6
`
`

`

`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`719
`
`bounded by 27?7',!
`
`(27)
`
`|1-27'[] (1 -277/)- K?
`jeJ
`
`
`
`jed
`
`=|1-2" JJa-2~)-[[ +27) <2771.
`
`P
`
`i=0
`
`This approach, however, fails in the TPR method. Here, each matrix element un-
`dergoes only one planerotation. The scaling factor to be corrected is thus K rather than
`K?. In orderto solve this moredifficult problem, different techniques have been developed
`in theliterature. Haviland and Tuszynski [17] used similar scaling iterations as Cavallaro
`and Luk. Ahmed [18] repeated some CORDICiterations to force K to a power of the
`machine radix. Delosme [19] combined both methods of Haviland, Tuszynski, and
`Ahmed for minimizing the number of computations. Deprettere, Dewilde, and Udo [20]
`suggested the double-shift concept.
`We designed a computer program [21] for a systematic search of CORDIC
`sequences. We allow shifts parameters S(i) (i = 0, 1,--:, m — 1) with differences
`S(i + 1) — S(i) €
`{0, 1, 2} to provide more optimization freedom. For an efficient
`scaling correction, we require that the scaling factor K be corrected by a sequenceof nm,
`shift-add operations,
`
`(28)
`
`2-7) II (1l+(j)27™)-K=1+AK
`f=
`
`(T(j)integers, n(j) = +1).
`
`These additional scaling iterations are parametrized by the set of signed integers
`{7T(0O), n(1)TC1), «+: . n(x) T(m,)}. The total numberofiterations is L = n + my.
`In (28), AK denotes the remainingrelative scaling error after the scaling correction.
`Weemphasize that this is a systematic error with a constant sign, By contrast, the other
`two types of CORDICerrors, the angular error due to the limited angle resolution and
`the rounding error, are of statistical nature because they may be positive or negative.
`Thescaling error is thus morecritical with respect to error accumulation when repeated
`CORDICoperations on the same data have to be computed as in SVD applications.
`Roughly speaking, the total scaling error after k CORDIC function calls increases linearly
`with k, a fact that has been verified by our numerical experiments. For this reason, we
`require | AK| to be much smaller than 27’.
`Wefound catalogues of CORDIC sequences with complexity comparable to those
`of Cavallaro and Luk. In the following, five examples for different mantissa lengths p =
`16, 20, 24, 28, and 32, including the total numberofiterations L = + n;,, the convergence
`region C, and the remaining scaling error AK are given:
`
`p=16:
`
`{S(i)}={0123-++ 1516},
`{n)TU)} = {(142-5494+10}, L=17+4, C= 100°, AK=—2~'60! 2
`
`p=20:
`
`{S(i)}={0123--- 1920},
`
`{n(J)TU)} = {1+2-549+10+16},
`L=21+5,
`Cx100°,
`
`AKx2773%,
`
`' Whenreplacing [p/41 by [(p — 1)/4lorf(p + 1)/41, the upper bound in (27) becomes 2° or 2-*~?,
`respectively.
`? When appending an additional scaling iteration with (5)7T(5) = +16, the scaling accuracy can be
`enhanced to AK = 2773,
`
`7
`
`

`

`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`720
`
`p=24:
`
`B. YANG AND J. F. BOHME
`
`{S(i)}={1123345566788910 --- 2324},
`{n({)T()} ={0-24+6},
` L=294+2,
` Cx9l,
`
`AKe—2-3,
`
`p=28:
`
`p=32:
`
`{S(i)}={11233455667889 1011 1213141415 --- 2728},
`{n(j)T(/)} ={0-246},
`L=344+2, Ceol,
`AKH 27353,
`{S(i)}={001333456789910 --- 3132},
`
`{n(j) TG) } = {1-3-8 +16 —25 —27},
`L=36+5,
`Crl45°,
`
`AKw—2739%,
`
`Rememberthat in order to meet the convergence condition (23) and to provide a
`convergence region C = 90°, the minimum number of CORDICiterations is p + 1. So,
`for all CORDIC sequences given above, the number L — (p + 1) of additional iterations
`for scaling corrrection is p/4. Moreover, except for the first CORDIC sequence, the
`remaining scaling error | AK| is significantly smaller than 2~”. This leads to improved
`numerical properties compared with other CORDIC sequences reported in theliterature.
`Wealso remark that if the symmetric eigenvalue problem is considered for which a
`convergence region of C 2 45° is sufficient [2], the total number of CORDICiterations
`can be further reduced. An example thatis nearly identical to the last CORDIC sequence
`given aboveis
`
`p=32:
`
`{S(i)}={1333456789910 --- 3132},
`
`{ nj) T(s)} = {0-3 -8 +16 —25 -27},
`L=34+5,
`Cx 55°,
`
`AK = —273993,
`
`For comparison, Delosme [5] has also given an optimized CORDIC sequence for the
`samesituation. His sequence requires one iteration more ( = 40) and achieves a scaling
`accuracy of AK = 2733/6,
`Wesuspect that similar results can also be obtained by using Deprettere’s double-
`shift concept. However, this method requires a slightly increased hardware complexity
`and will not be discussed in this paper.
`
`5. CORDIC implementation of the SVD PEs. Foreasy illustration, we first introduce
`a CORDICprocessor symbolas shownin Fig. 2. The descriptionsinside the box determine
`uniquely the function mode of the CORDIC processor according to Table 1. The output
`data x and y are assumedto bescaling corrected.
`It is now simple to map the operations (8)—(11) and (12)-(15) of the TPR method
`onto a two CORDICprocessorarchitecture. In Fig. 3, the diagonal PEs of the SVD array
`are implemented by two CORDIC processors and eight adders. The dotted inputs of the
`adders represent negated inputs. Because the diagonal PEs work in the generation mode,
`
`Xo
`
`Yo
`
`Zo
`
`Y
`z 79
`
`e=+1
`
`x
`
`y
`
`Z
`
`Fic. 2. CORDIC symbol,
`
`8
`
`

`

`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`721
`
`
`
`Fic, 3, CORDIC implementation ofthe diagonal PE of the SVD array.
`
`both CORDICprocessors are driven in the “y — 0” mode for computing Cartesian-to-
`polar coordinates conversions. In Fig. 4, the off-diagonal PEs working in the rotation
`mode are implemented by two CORDICprocessors and ten adders. Here, the CORDIC
`processors are driven in the “‘z ~ 0” mode for performing planerotations.
`Obviously, both CORDIC implementations have nearly the same architecture. All
`PEs of the SVDarray can thus be implemented by one unified CORDIC SVD module
`(Fig. 5) without considerably increased hardware cost. The different computation modes
`of the diagonal and off-diagonal PEsare easily “programmed” by one control bit. The
`resulting SVD array is similar to that in Fig. 1, but homogeneous with identical PEs.
`We remark that Fig. 5 is more a “graphic program” describing the sequence of
`operations to be computed rather than a hardware block diagram. We show inthefol-
`lowing that the 12 adders that are paired into three pre-butterflys and three post-butterflys
`can be integrated into the two CORDIC processors without separate hardware realizations.
`The Jacobi SVD method is a recursive method. Each PE of the SVD array has to exchange
`data with its diagonal neighbors. Because of this data dependency, only recursive CORDIC
`processors can be used here. This is an arithmetic unit consisting of mainly three adders
`and two barrel-shifters. It carries out the L iterations of the CORDIC algorithm in L
`cycles by using data feedback. The two CORDICprocessors contained in one CORDIC
`SVD module require six adders altogether. So, it is natural to modify the CORDIC
`processor architecture slightly and to use the existing six adders for computing both the
`pre-butterfly and the post-butterfly operations. The resulting CORDIC SVD module has
`
`
`
`Fic. 4, CORDIC implementation ofthe off-diagonal PE ofthe SVD array.
`
`9
`
`

`

`rsena
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`pre-butterflys
`
`post—butterflys
`
`722
`
`B. YANG AND J. F. BOHME
`
`Fic. 5. A unified CORDIC SVD module for implementing all PEs of the SVD array.
`
`the hardware complexity of two recursive CORDIC processors and requires a total com-
`putation time of L + 2 iterations.
`In Fig. 6, the principal architecture of such a two CORDIC processor SVD module
`is shown. The dashed lines and boxes represent the additional hardware components
`
`
`
`
`
`Fic, 6, The principal architecture ofthe unified CORDIC SVD module.
`
`10
`
`10
`
`

`

`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`723
`
`enabling the CORDIC processors to compute the butterfly operations.It is easily verified
`that the upper four adders devoted to x and y can perform the following types of operations
`2~-'x + 27'y (pre-butterfly), x + y (post-butterfly), x + 2~*y (CORDIC iteration) and
`x + 27~‘x (scaling iteration) while the lower two adders devoted to z can compute 27 !z,
`+ 27'z, (pre-butterfly), z; + z> (post-butterfly), and z + a (CORDICiteration), re-
`spectively. The cross switch between the registers “Y,"’ and ‘‘ X2” is needed to exchange
`data when the CORDIC SVD module switches from the pre-butterfly operations into
`the CORDICiterations or from the CORDICiterations into the post-butterfly operations,
`respectively. Then, we see from Fig. 3 that the output data pairs of the pre-butterflys are
`(P1, P2) and (41, G2), while the desired input data pairs for the CORDICiterations are
`(Pi. G1) and (p2, g2), respectively. So, p. and g, have to be exchanged.
`
`6. Comparisons. We now compare the new CORDIC SVD module with those pro-
`posed by Cavallaro and Luk [4] and Delosme [5]. Let Aggyq and Tava denote the area
`and time complexity ofa CORDIC SVD module and A,oaic ANd TMeordic those of a CORDIC
`processor, respectively. Cavallaro and Luk have shown that their most efficient parallel
`diagonalization method requires Aggya = 2Acordic ANd Tosa = 3Tcordic for the diagonal
`PEs and Aggy & 2Acoraic ANd Tosva & 2 Teoraic for the off-diagonal PEs. By using the TPR
`method, we require Toya = 2Acordic ANG Tesva = Toric for all PEs. In other words, having
`approximately the same hardware complexity, the computation time is reduced by more
`than 50 percent.
`A comparison to Delosme’s method is more difficult because he follows a quite
`different approach. Therefore, only rough performance estimates are given here. In our
`method, we compute the rotation angles explicitly. After these computations have been
`completed in the diagonal PEs, the angles propagate to the off-diagonal ones. We assume
`that the propagation from one PEtoits neighbors takes one cycle Tyyce, the time required
`for computing one CORDIC iteration. This implies local communications without
`broadcasting data. At the beginning of the second propagation cycle, the angles reach
`the diagonal neighbors of the diagonal PEs which complete their computations after
`T.sva. This means that the diagonal PEs have to wait for a delay time Taciay = Teycte +
`Tesva before they can exchange data with their diagonal neighbors.* Thetotal time elapsed
`between two adjacent activities at each PE is thus Toya + Tactay = 2Tesva + Teycie
`2 Teordic
`because Teycie is negligible with respect to Teoraic = L* Teyce-
`Delosme does not compute the rotation angles explicitly. He rather calculates en-
`codingsof the angles,i.e., sequences ofsigns +1, and sends them to the off-diagonal PEs.
`This enables overlap of diagonal and off-diagonal rotations because the encoding signs
`are recursively obtained and become available before the completion of diagonal oper-
`ations. Accordingly, no delay timeis required ( Tyeay = 0), provided that the SVD array
`size (the half of the matrix size) is smaller than the number of CORDICscalingiterations
`n, (for details, see [5]). The drawback is, however, that the TPR method cannot be
`applied to the off-diagonal PEs. Four plane rotations are hence required, resulting in
`Tevsa = 2Toraic for two CORDIC processors in one module. In other words, the time
`complexities Toya + Taetay of both methods are nearly identical and equal 27yordic. If,
`however, multiple problems are interleaved, the fraction of idle time that is 50 percent
`in our case can be reduced to almost zero. In such a situation, our method provides the
`double speed compared with Delosme’s one.
`
`3 If the propagation time is assumed to be T..,4, we get the well-knownresult Tyga, = 27psy given by Brent
`and Luk [2].
`
`11
`
`11
`
`

`

`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`724
`
`B. YANG AND J. F. BOHME
`
`In terms of area complexity, both CORDIC SVD modules contain two CORDIC
`processors. Our module consists of essentially six adders, four barrel shifters, and one
`ROMtable containing # angle values. Delosme’s architecture requires four carry-save
`adders, four adders, and eight barrel shifters. So, as a rough estimate, both SVD modules
`have the same order of area complexities.
`Perhaps the most important advantage of Delosme’s approach is the 2-bit wide
`horizontal and vertical data connections for sending angle encodings serially rather than
`sending the full angle valuesin parallel. The prices are the upper boundof the SVDarray
`size depending on the number of CORDICscalingiterations, the relatively complicated
`timing, and a nonregular CORDICarchitecture design. We also mention that while
`Delosme’s method presumes a CORDIC implementation, the TPR method is applicable
`to other computing architectures.
`
`7. Other applications of the TPR method. Another advantage of the TPR method
`seemsto be the relatively wide range of applications. We indicate some of them in the
`following.
`For the SVD ofa rectangular matrix, a well-known methodis first to triangularize
`the matrix by QR decomposition and then to apply the Jacobi SVD procedure to the
`triangular factor. Luk [22] has shownthat both steps can be implemented by onetriangular
`systolic array. Each PE contains a 2 X 2 submatrix. It applies two plane rotations (through
`the same angle) to the two column vectors at the QR step and a two-sided rotation at
`the SVD step. For computing the SVD step, the PE can be realized by the CORDIC
`SVD module, as before. On the other side, the two CORDIC processors contained in
`the module are also appropriate to perform the two-plane rotations of the QR step. The
`CORDIC SVD module presented in this paper thus provides a suitable PE for Luk’s
`triangular SVD array.
`Stewart [23] has proposed a square systolic array for computing the Schur decom-
`position (SD) of a non-Hermitian matrix which, for example, is useful for evaluating
`the eigenvalues of the matrix. His approachis similar to the Jacobi SVD method. It is
`based on applying a sequence of two-sided rotations to 2 X 2 submatrices, where the left
`and right rotation angles are identical to make the diagonal submatrices uppertriangular.
`While the diagonal PEs perform operationsdifferent from those in SVD,the off-diagonal
`PEs have exactly the same computational task as in SVD computing. Therefore, the
`CORDIC SVD module canalso be used in Stewart’s SD array.
`Even in sequential computations on a uniprocessor system, onecanstill apply the
`TPR method to reduce the computational complexity of two-sided rotations.
`
`8. Conclusion. We have investigated a novel algorithm for computing two-sided
`rotations requiring only two plane rotations and a few additions. This results in signifi-
`cantly red

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket