`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;seehttp://www.siam.org/journals/ojsa.php
`
`
`
`
`
`
`
`
`
`SIAM J. MATRIX ANAL. APPL.
`Vol. 12, No. 4, pp. 713-725, October 1991
`
`© 1991 Society for Industrial and Applied Mathematics
`009
`
`REDUCING THE COMPUTATIONS OF THE SINGULAR VALUE
`DECOMPOSITION ARRAY GIVEN BY BRENT AND LUK*
`
`B. YANG} AND J. F. BOHMET
`
`Abstract. A new,efficient, two-plane rotation (TPR) method for computing two-sided rotations involved
`in singular value decomposition (SVD) is presented.It is shown that a two-sided rotation can be evaluated by
`only two plane rotations and a few additions. This leads to significantly reduced computations. Moreover, if
`coordinate rotation digital computer (CORDIC) processors are used for realizing the processing elements (PEs)
`of the SVD array given by Brent and Luk, the computational overhead of the diagonal PEs due to angle
`calculations can be avoided. The resulting SVD array has a homogeneousstructure with identical diagonal and
`off-diagonal PEs. Similar results can also be obtained if the TPR method is applied to Luk’s triangular SVD
`array and to Stewart’s Schur decomposition array.
`
`Key words. singular value decomposition, systolic arrays, CORDIC,two-sided rotations, VLSI
`
`AMS(MOS)subject classification. 15A18
`
`1. Introduction. One important problem in linear algebra and digital signal pro-
`cessing is the singular value decomposition (SVD). Typical applications arise in beam-
`forming and direction finding, spectrum analysis, digital image processing, etc. [1]. Re-
`cently, there has been a massive interest in parallel architectures for computing SVD
`because of the high computational complexity of SVD, the growing importance ofreal-
`timesignal processing, and the rapid advances in very large scale integration (VLSI) that
`make low-cost, high-density and fast processing memory devices available.
`There are different numerically stable methods for computing complete singular
`value and singular vector systems of dense matrices, for example, the Jacobi SVD method,
`the QR method, and the one-sided Hestenes method. For parallel implementations, the
`Jacobi SVD method is far superior in terms of simplicity, regularity, and local com-
`munications. Brent, Luk, and Van Loan have shown how the Jacobi SVD method with
`parallel ordering can be implemented by a two-dimensionalsystolic array [2], [3]. Various
`coordinate rotation digital computer (CORDIC) realizations ofthe SVD array have been
`reported by Cavallaro and Luk [4] and Delosme [5], [6].
`The Jacobi SVD method is based on, as common for all two-sided approaches,
`applying a sequence of two-sided rotations to 2 X 2 submatrices of the original matrix.
`The computational complexity is thus determined by how to compute the two-sided
`rotations. In most previous works, a two-sided rotation is evaluated in a straightforward
`manner by four plane rotations, where two of them are applied from left to the two
`column vectors of the 2 X 2 submatrix and the other ones are applied from right to the
`row vectors, respectively. In the diagonal processing elements (PEs), additional operations
`for calculating rotation angles are required. This leads to an inhomogeneousarray ar-
`chitecture containing twodifferent types of PEs.
`In this paper, we develop a two-plane rotation (TPR) method for computing two-
`sided rotations. We show that the above computational complexity can be reduced sig-
`nificantly because each two-sided rotation can be evaluated by only two plane rotations
`and a few additions. Moreover, the SVD array given by Brent and Luk becomes ho-
`mogeneouswith identical diagonal and off-diagonal PEs when CORDICprocessors are
`
`* Received by the editors September 28, 1989; accepted for publication (in revised form) August 2, 1990.
`+ DepartmentofElectrical Engineering, Ruhr-Universitat Bochum, 4630 Bochum, Germany.
`713
`
`1
`
`OnePlus Ex. 1008.0001
`IPR2022-00048
`
`1
`
`OnePlus Ex. 1008.0001
`IPR2022-00048
`
`
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`(2) Musi =ULMV,—(k=0,1,2, +++).Mo=M,
`
`
`
`
`
`
`
`714
`
`B. YANG AND J. F. BOHME
`
`used. In a recent work [6], Delosme has also indicated this possibility in connection
`with “rough rotations” independently. He has taken, however, a different approach that
`is based on encoding the rotation angles. He hasstill required four plane rotations on
`the off-diagonal PEs while diagonal and off-diagonal operations can be overlapped.
`Ourpaperis organized as follows. In § 2, we briefly reexamine Jacobi’s SVD method
`and Brent and Luk’s SVD array. Then, we develop the TPR methodin § 3. The CORDIC
`algorithm is described in § 4, where in particular CORDICscaling correction techniques
`are discussed and examples of scaling-corrected CORDIC sequencesare given. In § 5, a
`unified CORDIC SVD module for all PEs of the SVD array is presented. This module
`is compared to those proposed by Cavallaro, Luk, and Delosmein § 6. Finally, we stress
`the applicability of the TPR method to several other problems.
`
`2. Jacobi SVD method. In this paper, we considerreal, square, and nonsymmetric
`matrices. Let Af ¢ R***™ be a matrix of dimension N. The SVDis given by
`
`(1)
`
`M=UEV’,
`
`where U € R“*™ and V € R”™ are orthogonal matrices containing the left and right
`singular vectors, and 2 € R*** is a diagonal matrix of singular values, respectively. The
`superscript 7 denotes matrix transpose. Based on an extension of the Jacobi eigenvalue
`algorithm [7], Kogbetliantz [8] and Forsythe and Henrici [9] proposed to diagonalize
`M by a sequence of two-sided rotations,
`
`
`
`U, and V;, describe two rotations in the (7, /)-plane (1 S i<j SN), where the rotation
`angles are chosen to annihilate the elements of A4, at the positions (i, j) and (j, i).
`Usually, several sweeps are necessary to complete the SVD, where a sweep is a sequence
`of N(N — 1)/2 two-sided rotations according to a special ordering of the N(N — 1)/2
`different index pairs (i, /).
`For sequential computing on a uniprocessor system, possibly the most frequently
`used orderings are the cyclic orderings, namely, the cyclic row ordering
`
`(3)
`
`(i,j) = (1,2),C1, 3), +++ 5C1N)(2,3), 00° (2,N), 0 ONT)
`
`or the equivalent cyclic column ordering. Sameh [10] and Schwiegelshohn and Thiele
`[11] have shown how to implement the cyclic row ordering on a ring-connected or a
`mesh-connected processor array. Recently, a variety of parallel orderings have been de-
`veloped. Luk and Park [12] have shownthatthese parallel orderings are essentially equiv-
`alent to the cyclic orderings and thus share the same convergence properties.
`Brent and Luk have suggested a particular parallel ordering and developed a square
`systolic array consisting of [N/21X [N/21PEsfor implementing the Jacobi SVD method
`(Fig. 1). To dothis, the matrix V is partitioned into 2 * 2 submatrices. Each PE contains
`one submatrix and performs a two-sided rotation
`
`where
`
`(5)
`
`az,
`
`23
`
`a=(% a) and B= (5 0)
`
`ba,
`
`bap
`
`2
`
`OnePlus Ex. 1008.0002
`IPR2022-00048
`
`2
`
`OnePlus Ex. 1008.0002
`IPR2022-00048
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`715
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`
`
`
`
`Fic. 1, The SVD array given by Brent and Luk.
`
`denote the submatrix before and after the two-sided rotation, respectively, and
`
`(6)
`
`Ra) =(
`
`cos@ sind
`
`—sin@ cosdé
`
`describes a plane rotation through the angle @. At first, the diagonal PEs (symbolized by
`a double squarein Fig.
`| ) generate the rotation angles to diagonalize the 2 X 2 submatrices
`(b,. = by, = 0) stored in them. This means that 6; and @, are first calculated from the
`elements of 4 and then relation (4) is used to compute 5;, and 2. We call this the
`generation mode. Then, the rotation angles are sentto all off-diagonal PEs in the following
`way: the angles associated to theleft-side rotations propagate along the rows while the
`angles associated to the right-side rotations propagate along the columns. Once these
`angles are received, the off-diagonal PEs perform the two-sided rotations (4) on their
`stored data. Wecall this the rotation mode. Clearly, if we compute the rotation mode
`straightforwardly, we require four plane rotations. For the generation mode, additional
`operations for calculating @, and @2 are required.
`
`3. TPR method for computing two-sided rotations. In order to develop the TPR
`method for computing two-sided rotations more efficiently, we first discuss the com-
`mutative properties of two special types, the rotation-type and the reflection-type, of
`2 X 2 matrices. We define
`
`os)
`
`
`
`x.ver|
`
`and wet =|(*
`
`y
`
`
`
`”)-x xveR],
`
`The formeris called rotation-type because it has the same matrix structure as a 2 X 2
`plane rotation matrix. Similarly, the latter is called reflection-type because it has the
`same matrix structure as a 2 X 2 Givensreflection matrix [13]. Note that x and » must
`not be normalized to x? + y? = 1. Using the above definitions, the following results can
`be shown by some elementary manipulations.
`LEMMAI, If A, €.4™ and A, € 4", then A, Az = A.A, €.™,
`LEMMA 2. IfA; ¢.d™ and A, ¢ M™, then AjA = ATA, € M™,
`In particular, if we consider two plane rotations, we know the following.
`LEMMA 3. Jf R(6@,) and R(62) are plane rotations described by (6),
`R(0;)R(02) = RO; + 82) and R(0;)"R(02) = R(82 — 0,).
`Now, we give a theorem describing the rotation mode of the TPR method.
`THEOREM. Ifthe 2 X 2 matrix A and the two rotation angles 0, and 02 are given,
`then the two-sided rotation (4) can be computed by two planerotations, ten additions,
`
`then
`
`3
`
`OnePlus Ex. 1008.0003
`IPR2022-00048
`
`3
`
`OnePlus Ex. 1008.0003
`IPR2022-00048
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`716
`
`B. YANG AND J. F. BOHME
`
`andfour scalings by }:
`
`(8)
`
`(9)
`
`(19)
`(11)
`
`Dy = (22 + a1) /2,
`
`Py = (dx2~— 4))/2,
`
`qi = (421 — @12)/2,
`
`G2 = (doi + @y2)/2,
`
`ry
`
`4)
`
`6_=6,—6),
`Pi
`
`6, =60.+6,,
`ro
`
`P2
`
`-nof2) (eet?)
`ace qi
`In
`(0+)
`d2
`by =ri-h,
`ba =-hth,
`by, =t) + fe,
`bo =P ths.
`
`Proof. Using (8), the matrix A can be reformulated as
`
`Ana tar=(? ele *),
`
`1
`
`Pi
`
`dz
`
`Pz
`
`Clearly, R(6,), R(82) in (4) and A, are elements of .4™while 4, belongs to... This
`leads to the following reformulation of the matrix B by using Lemmas 1-3:
`
`B= R(6,)7AR(02)
`
`= R(6,)"A,R(02)+ R(0;)"A2R(G2)
`
`= R(0,)7R(02)A, + R(9,)7R(G2)7Ao
`
`= R(0.—0,)A, + R(02+6,)7A>
`
`aq
`
`Pi
`
`=R(0("" f+R00)"
`_ ("
`eh e(% *).
`
`ty
`
`ry
`
`bo
`
`oth
`
`—P2 *G2 Pr
`
`This completes the proof.
`The generation mode of the TPR method follows directly from the above theorem.
`COROLLARY. Ifthe 2 * 2 matrix A is given, we can diagonalize A and calculate
`the corresponding rotation angles 6, and 6. by two Cartesian-to-polar coordinates con-
`versions, eight additions, andfour scalings by }:
`
`(12)
`
`(13)
`
`(14)
`
`(15)
`
`DP = (dx + a);)/2,
`
`Px = (@o2— 411)/2,
`
`qi = (21 — a2) /2,
`
`qz = (a2, + a)2)/2,
`
`r =sign (p,)Vpi+qi,
`
`r= sign (p2)Vp3+ 43,
`
`6. = arctan (q,/p1),
`
`6, =arctan (q2/p2),
`
`6, =(6,—6_)/2,
`
`62 = (6,4 6_)/2,
`
`by=r-h,
`
`by =r) tho.
`
`Proof. Regarding (11), bj. = bz, = 0 is equivalent to 7; = t = 0. Equation (13)
`follows then from (10), This completes the proof.
`In equation (13), we choose the rotation through the smaller angle. All vectors
`lyingin thefirst or the fourth quadrant are rotated onto the positive x-axis, and all vectors
`lying in the second and the third quadrant are rotated onto the negative x-axis. For
`vectors on the y-axis, the rotation direction is arbitrary. Thus, the generated rotation
`
`4
`
`OnePlus Ex. 1008.0004
`IPR2022-00048
`
`4
`
`OnePlus Ex. 1008.0004
`IPR2022-00048
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`T17
`
`angles 6_ and 6. satisfy |@_|,
`(16)
`
`|@,| = 90°. This results in
`|,)<90° and
`|@;| $90°,
`
`due to (14).
`Equation (16) is important with respect to the convergence of the Jacobi SVD
`method. Forsythe and Henrici [9] have proven the convergence for cyclic orderingsif
`the rotation angles #, and @, are restricted to a closed interval inside the open interval
`(—90°, 90°). They have also demonstrated that this condition may fail to hold, i.e., 6,
`and @, may be +90°, if the off-diagonal elements },2 and 2, in (5) have to be exactly
`annihilated. As a remedy, they suggested an under- or overrotation by computing the
`two-sided rotation (4) with angles (1 — y)@, and (1 — y)@2(—1 < y < 1) and proved
`its convergence. In practice, however, the finite machine accuracy in the real arithmetic
`allows only an approximative computation of the rotation angles and implies under- or
`overrotations. So the Jacobi SVD method converges without using under- or overrotations
`as shown by the experimental results of Brent, Luk, and Van Loan [3]. In case of CORDIC
`implementations, the effect of implicit under- or overrotations is more apparent. The
`angles +90° can never be exactly calculated because of the limited angle resolution arc-
`tan (2~”) of the CORDIC algorithm, where p denotes the mantissa length.
`
`4, The CORDICalgorithm. In the previous section, we have seen that the main
`operations of the TPR-method are plane rotations and Cartesian-to-polar coordinates
`conversions. These operations can be carried out by multiplier—-adder-based processors
`supported by software or special hardware units. An alternative approach is the use of
`dedicated processors that usually map algorithms more effectively to hardware. The
`CORDICprocessor is such a powerful one for calculating trigonometric functions.
`The CORDICalgorithm wasoriginally designed by Volder [14] as an iterative pro-
`cedure for computing plane rotations and Cartesian-to-polar coordinates conversions.It
`was later generalized and unified by Walther [15], enabling a CORDIC processor to
`calculate more functions, including hyperbolic functions, as well as multiplications and
`divisions. In the following, we consider Volder’s CORDICalgorithm because onlytrig-
`onometric functions are involved in SVD applications.
`The CORDICalgorithm consists of iterative shift-add operations on a three-com-
`ponent vector,
`
`(17) (* ) _ (* - i) __!
`
`Vir
`
`yi + 6;5;x;}
`
`cos (aj)
`
`(
`
`cos (aj) —o; sin eye)
`
`cos (a,)/\
`
`yy)?
`
`\o; sin (a)
`
`(18)
`
`Zi = Zp — €0;0;
`
`(0<6;< ljo,=+lje= 41;7=0,1,---,n— 1),
`
`in which the iteration stepsize 6, is defined by
`
`(19)
`
`§;= tan (a;)=2-,
`
`The set of integers {.S(i)} parametrizing the iterations is called CORDIC sequence.
`Equation (17) can be interpreted, except for a scaling factor of
`
`(20)
`k=
`:
`=V1+6?,
`cos (a;)
`
`as a rotation of(x;, y;)" through the angle a;, where the sign o; = +1 gives the rotation
`direction. After
`iterations, the results are given by
`
`(21)
`
`(22)
`
`(*)=«(Ore eee):
`
`sin @&
`
`cos a/\
`
`Yo
`
`Yn
`
`Zn = Zp — €Q,
`
`5
`
`OnePlus Ex. 1008.0005
`IPR2022-00048
`
`5
`
`OnePlus Ex. 1008.0005
`IPR2022-00048
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`718
`
`B. YANG AND J. F. BOHME
`
`with the overall scaling factor K = [], &; and the total rotation angle a = 2, o;a;. Now,
`if the CORDIC sequence satisfies the following convergence condition
`a-1l
`
`(23)
`
`ai- >) aSan-1
`jriel
`
`(i=0,1, +++ ,n—-2),
`
`we can choose the sign parameter
`
`(24)
`
`os
`
`(ae(xvi)
`
`sign (ez;)
`
`fory,—>0,
`
`for z,—> 0
`
`to force y, or Zz, to zero, provided that the input data xy, yo, and Zp lie in the conver-
`gence region
`
`(25)
`
`n-1
`C= > az
`i=0
`
`Jarctan (¥o/Xo)|
`
`fory,—>9,
`
`|Zol
`
`for z,—> 0.
`
`In this way, two different types of CORDIC trigonometric functions can be computed
`(Table 1). In the mode y, —~ 0, the Cartesian coordinate (xo, ¥o) of a plane vectoris
`converted to its polar representation, where the parameter e = +1 determines the sign
`of the phase angle calculated. When z,, —> 0, a given plane vectoris rotated through the
`angle z), where e = +1 controls the rotation direction.
`In Table 1, the principal value | arctan (yo/Xo)| = 90° ofthe inverse tangent function
`is calculated when computing Cartesian-to-polar coordinates conversions. Correspond-
`ingly, x, may be positive or negative according to the sign of xp. So, it is guaranteed that
`a vector is always rotated through the smaller angle onto the x-axis in accordance with
`(13). In this case, a convergence region of C 2 90° is sufficient for the generation mode
`of the two-sided rotation.
`One main drawback of the CORDICalgorithm is the need of correcting the scal-
`ing factor K that arises during the iterations (17). For example, if we use Volder’s
`CORDIC sequence
`
`(26)
`
`{S(i)} = {0,1,2,3,---,p—l,p},
`
`with n = p+ 1 CORDICiterations for a mantissa accuracy of 2~’, the scaling factoris
`K = 1.64676. Compensating this undesired scaling effect with a minimum number of
`computationsis of particular importance.
`Clearly, multiplying x, and y, in Table 1 by K7~' will degrade the algorithm per-
`formance substantially. Most of the scaling correction issues are based on shift-add oper-
`ations. For a two-sided rotation that is implemented by four plane rotations, each matrix
`element undergoes two plane rotations so that the total scaling factor to be corrected
`is K?. In this case, Cavallaro and Luk [16] have pointed out that there is a simple
`systematic approach for scaling correction when using the CORDIC sequence (26).
`They proposed to use [p/41 scalingiterations of the type x < x — 2~7/x with j e J =
`{1, 3, 5, +++, 27.p/41— 1} and oneshift operation 2~', The remaining scaling erroris
`
`TABLE 1
`CORDICtrigonometricfunctions (e = +1).
`
`
`
` Yn > 0 Zn > 0
`
`Xn = K(Xo cos 29 — e¥y sin zy)
`Xn = K sign (x9) Vx6+ yo
`
`Zy = Zo + e arctan (¥o/X) Yn = K(eXq sin Zp + Yo COS Zo)
`
`6
`
`OnePlus Ex. 1008.0006
`IPR2022-00048
`
`6
`
`OnePlus Ex. 1008.0006
`IPR2022-00048
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`719
`
`bounded by 27?7',!
`
`(27)
`
`|1-27'[] (1 -277/)- K?
`jeJ
`
`
`
`jed
`
`=|1-2" JJa-2~)-[[ +27) <2771.
`
`P
`
`i=0
`
`This approach, however, fails in the TPR method. Here, each matrix element un-
`dergoes only one planerotation. The scaling factor to be corrected is thus K rather than
`K?. In orderto solve this moredifficult problem, different techniques have been developed
`in theliterature. Haviland and Tuszynski [17] used similar scaling iterations as Cavallaro
`and Luk. Ahmed [18] repeated some CORDICiterations to force K to a power of the
`machine radix. Delosme [19] combined both methods of Haviland, Tuszynski, and
`Ahmed for minimizing the number of computations. Deprettere, Dewilde, and Udo [20]
`suggested the double-shift concept.
`We designed a computer program [21] for a systematic search of CORDIC
`sequences. We allow shifts parameters S(i) (i = 0, 1,--:, m — 1) with differences
`S(i + 1) — S(i) €
`{0, 1, 2} to provide more optimization freedom. For an efficient
`scaling correction, we require that the scaling factor K be corrected by a sequenceof nm,
`shift-add operations,
`
`(28)
`
`2-7) II (1l+(j)27™)-K=1+AK
`f=
`
`(T(j)integers, n(j) = +1).
`
`These additional scaling iterations are parametrized by the set of signed integers
`{7T(0O), n(1)TC1), «+: . n(x) T(m,)}. The total numberofiterations is L = n + my.
`In (28), AK denotes the remainingrelative scaling error after the scaling correction.
`Weemphasize that this is a systematic error with a constant sign, By contrast, the other
`two types of CORDICerrors, the angular error due to the limited angle resolution and
`the rounding error, are of statistical nature because they may be positive or negative.
`Thescaling error is thus morecritical with respect to error accumulation when repeated
`CORDICoperations on the same data have to be computed as in SVD applications.
`Roughly speaking, the total scaling error after k CORDIC function calls increases linearly
`with k, a fact that has been verified by our numerical experiments. For this reason, we
`require | AK| to be much smaller than 27’.
`Wefound catalogues of CORDIC sequences with complexity comparable to those
`of Cavallaro and Luk. In the following, five examples for different mantissa lengths p =
`16, 20, 24, 28, and 32, including the total numberofiterations L = + n;,, the convergence
`region C, and the remaining scaling error AK are given:
`
`p=16:
`
`{S(i)}={0123-++ 1516},
`{n)TU)} = {(142-5494+10}, L=17+4, C= 100°, AK=—2~'60! 2
`
`p=20:
`
`{S(i)}={0123--- 1920},
`
`{n(J)TU)} = {1+2-549+10+16},
`L=21+5,
`Cx100°,
`
`AKx2773%,
`
`' Whenreplacing [p/41 by [(p — 1)/4lorf(p + 1)/41, the upper bound in (27) becomes 2° or 2-*~?,
`respectively.
`? When appending an additional scaling iteration with (5)7T(5) = +16, the scaling accuracy can be
`enhanced to AK = 2773,
`
`7
`
`OnePlus Ex. 1008.0007
`IPR2022-00048
`
`7
`
`OnePlus Ex. 1008.0007
`IPR2022-00048
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`720
`
`p=24:
`
`p=28:
`
`B. YANG AND J. F. BOHME
`
`{S(i)}={1123345566788910 --- 2324},
`{n({)T()} ={0-24+6},
` L=294+2,
` Cx9l,
`
`AKe—2-3,
`
`{S(i)}={11233455667889 1011 1213141415 --- 2728},
`{n(j)T(/)} ={0-246},
`L=344+2, Ceol,
`AKH 27353,
`
`p=32:
`
`{S(i)}={001333456789910 --- 3132},
`
`{n(j) TG) } = {1-3-8 +16 —25 —27},
`L=36+5,
`Crl45°,
`
`AKw—2739%,
`
`Rememberthat in order to meet the convergence condition (23) and to provide a
`convergence region C = 90°, the minimum number of CORDICiterations is p + 1. So,
`for all CORDIC sequences given above, the number L — (p + 1) of additional iterations
`for scaling corrrection is p/4. Moreover, except for the first CORDIC sequence, the
`remaining scaling error | AK| is significantly smaller than 2~”. This leads to improved
`numerical properties compared with other CORDIC sequences reported in theliterature.
`Wealso remark that if the symmetric eigenvalue problem is considered for which a
`convergence region of C 2 45° is sufficient [2], the total number of CORDICiterations
`can be further reduced. An example thatis nearly identical to the last CORDIC sequence
`given aboveis
`
`p=32:
`
`{S(i)}={1333456789910 --- 3132},
`
`{ nj) T(s)} = {0-3 -8 +16 —25 -27},
`L=34+5,
`Cx 55°,
`
`AK = —273993,
`
`For comparison, Delosme [5] has also given an optimized CORDIC sequence for the
`samesituation. His sequence requires one iteration more ( = 40) and achieves a scaling
`accuracy of AK = 2733/6,
`Wesuspect that similar results can also be obtained by using Deprettere’s double-
`shift concept. However, this method requires a slightly increased hardware complexity
`and will not be discussed in this paper.
`
`5. CORDIC implementation of the SVD PEs. Foreasy illustration, we first introduce
`a CORDICprocessor symbolas shownin Fig. 2. The descriptionsinside the box determine
`uniquely the function mode of the CORDIC processor according to Table 1. The output
`data x and y are assumedto bescaling corrected.
`It is now simple to map the operations (8)—(11) and (12)-(15) of the TPR method
`onto a two CORDICprocessorarchitecture. In Fig. 3, the diagonal PEs of the SVD array
`are implemented by two CORDIC processors and eight adders. The dotted inputs of the
`adders represent negated inputs. Because the diagonal PEs work in the generation mode,
`
`Xo
`
`Yo
`
`Zo
`
`Y
`z 79
`
`e=+1
`
`x
`
`y
`
`Z
`
`Fic. 2. CORDIC symbol,
`
`8
`
`OnePlus Ex. 1008.0008
`IPR2022-00048
`
`8
`
`OnePlus Ex. 1008.0008
`IPR2022-00048
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`721
`
`
`
`Fic, 3, CORDIC implementation ofthe diagonal PE of the SVD array.
`
`both CORDICprocessors are driven in the “y — 0” mode for computing Cartesian-to-
`polar coordinates conversions. In Fig. 4, the off-diagonal PEs working in the rotation
`mode are implemented by two CORDICprocessors and ten adders. Here, the CORDIC
`processors are driven in the “‘z ~ 0” mode for performing planerotations.
`Obviously, both CORDIC implementations have nearly the same architecture. All
`PEs of the SVDarray can thus be implemented by one unified CORDIC SVD module
`(Fig. 5) without considerably increased hardware cost. The different computation modes
`of the diagonal and off-diagonal PEsare easily “programmed” by one control bit. The
`resulting SVD array is similar to that in Fig. 1, but homogeneous with identical PEs.
`We remark that Fig. 5 is more a “graphic program” describing the sequence of
`operations to be computed rather than a hardware block diagram. We show inthefol-
`lowing that the 12 adders that are paired into three pre-butterflys and three post-butterflys
`can be integrated into the two CORDIC processors without separate hardware realizations.
`The Jacobi SVD method is a recursive method. Each PE of the SVD array has to exchange
`data with its diagonal neighbors. Because of this data dependency, only recursive CORDIC
`processors can be used here. This is an arithmetic unit consisting of mainly three adders
`and two barrel-shifters. It carries out the L iterations of the CORDIC algorithm in L
`cycles by using data feedback. The two CORDICprocessors contained in one CORDIC
`SVD module require six adders altogether. So, it is natural to modify the CORDIC
`processor architecture slightly and to use the existing six adders for computing both the
`pre-butterfly and the post-butterfly operations. The resulting CORDIC SVD module has
`
`
`
`Fic. 4, CORDIC implementation ofthe off-diagonal PE ofthe SVD array.
`
`9
`
`OnePlus Ex. 1008.0009
`IPR2022-00048
`
`9
`
`OnePlus Ex. 1008.0009
`IPR2022-00048
`
`
`
`rsena
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`pre-butterflys
`
`post—butterflys
`
`722
`
`B. YANG AND J. F. BOHME
`
`Fic. 5. A unified CORDIC SVD module for implementing all PEs of the SVD array.
`
`the hardware complexity of two recursive CORDIC processors and requires a total com-
`putation time of L + 2 iterations.
`In Fig. 6, the principal architecture of such a two CORDIC processor SVD module
`is shown. The dashed lines and boxes represent the additional hardware components
`
`
`
`
`
`Fic, 6, The principal architecture ofthe unified CORDIC SVD module.
`
`10
`
`OnePlus Ex. 1008.0010
`IPR2022-00048
`
`10
`
`OnePlus Ex. 1008.0010
`IPR2022-00048
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`REDUCED SVD COMPUTATIONS AND HOMOGENEOUS SVD ARRAY
`
`723
`
`enabling the CORDIC processors to compute the butterfly operations.It is easily verified
`that the upper four adders devoted to x and y can perform the following types of operations
`2~-'x + 27'y (pre-butterfly), x + y (post-butterfly), x + 2~*y (CORDIC iteration) and
`x + 27~‘x (scaling iteration) while the lower two adders devoted to z can compute 27 !z,
`+ 27'z, (pre-butterfly), z; + z> (post-butterfly), and z + a (CORDICiteration), re-
`spectively. The cross switch between the registers “Y,"’ and ‘‘ X2” is needed to exchange
`data when the CORDIC SVD module switches from the pre-butterfly operations into
`the CORDICiterations or from the CORDICiterations into the post-butterfly operations,
`respectively. Then, we see from Fig. 3 that the output data pairs of the pre-butterflys are
`(P1, P2) and (41, G2), while the desired input data pairs for the CORDICiterations are
`(Pi. G1) and (p2, g2), respectively. So, p. and g, have to be exchanged.
`
`6. Comparisons. We now compare the new CORDIC SVD module with those pro-
`posed by Cavallaro and Luk [4] and Delosme [5]. Let Aggyq and Tava denote the area
`and time complexity ofa CORDIC SVD module and A,oaic ANd TMeordic those of a CORDIC
`processor, respectively. Cavallaro and Luk have shown that their most efficient parallel
`diagonalization method requires Aggya = 2Acordic ANd Tosa = 3Tcordic for the diagonal
`PEs and Aggy & 2Acoraic ANd Tosva & 2 Teoraic for the off-diagonal PEs. By using the TPR
`method, we require Toya = 2Acordic ANG Tesva = Toric for all PEs. In other words, having
`approximately the same hardware complexity, the computation time is reduced by more
`than 50 percent.
`A comparison to Delosme’s method is more difficult because he follows a quite
`different approach. Therefore, only rough performance estimates are given here. In our
`method, we compute the rotation angles explicitly. After these computations have been
`completed in the diagonal PEs, the angles propagate to the off-diagonal ones. We assume
`that the propagation from one PEtoits neighbors takes one cycle Tyyce, the time required
`for computing one CORDIC iteration. This implies local communications without
`broadcasting data. At the beginning of the second propagation cycle, the angles reach
`the diagonal neighbors of the diagonal PEs which complete their computations after
`T.sva. This means that the diagonal PEs have to wait for a delay time Taciay = Teycte +
`Tesva before they can exchange data with their diagonal neighbors.* Thetotal time elapsed
`between two adjacent activities at each PE is thus Toya + Tactay = 2Tesva + Teycie
`2 Teordic
`because Teycie is negligible with respect to Teoraic = L* Teyce-
`Delosme does not compute the rotation angles explicitly. He rather calculates en-
`codingsof the angles,i.e., sequences ofsigns +1, and sends them to the off-diagonal PEs.
`This enables overlap of diagonal and off-diagonal rotations because the encoding signs
`are recursively obtained and become available before the completion of diagonal oper-
`ations. Accordingly, no delay timeis required ( Tyeay = 0), provided that the SVD array
`size (the half of the matrix size) is smaller than the number of CORDICscalingiterations
`n, (for details, see [5]). The drawback is, however, that the TPR method cannot be
`applied to the off-diagonal PEs. Four plane rotations are hence required, resulting in
`Tevsa = 2Toraic for two CORDIC processors in one module. In other words, the time
`complexities Toya + Taetay of both methods are nearly identical and equal 27yordic. If,
`however, multiple problems are interleaved, the fraction of idle time that is 50 percent
`in our case can be reduced to almost zero. In such a situation, our method provides the
`double speed compared with Delosme’s one.
`
`3 If the propagation time is assumed to be T..,4, we get the well-knownresult Tyga, = 27psy given by Brent
`and Luk [2].
`
`14
`
`OnePlus Ex. 1008.0011
`IPR2022-00048
`
`11
`
`OnePlus Ex. 1008.0011
`IPR2022-00048
`
`
`
`
`
`
`
`
`
`Downloaded01/04/13to128.148.252.35.RedistributionsubjecttoSIAMlicenseorcopyright;http://www.siam.org/journals/ojsa.phpsee
`
`
`
`
`
`
`
`
`
`724
`
`B. YANG AND J. F. BOHME
`
`In terms of area complexity, both CORDIC SVD modules contain two CORDIC
`processors. Our module consists of essentially six adders, four barrel shifters, and one
`ROMtable containing # angle values. Delosme’s architecture requires four carry-save
`adders, four adders, and eight barrel shifters. So, as a rough estimate, both SVD modules
`have the same order of area complexities.
`Perhaps the most important advantage of Delosme’s approach is the 2-bit wide
`horizontal and vertical data connections for sending angle encodings serially rather than
`sending the full angle valuesin parallel. The prices are the upper boundof the SVDarray
`size depending on the number of CORDICscalingiterations, the relatively complicated
`timing, and a nonregular CORDICarchitecture design. We also mention that while
`Delosme’s method presumes a CORDIC implementation, the TPR method is applicable
`to other computing architectures.
`
`7. Other applications of the TPR method. Another advantage of the TPR method
`seemsto be the relatively wide range of applications. We indicate some of them in the
`following.
`For the SVD ofa rectangular matrix, a well-known methodis first to triangularize
`the matrix by QR decomposition and then to apply the Jacobi SVD procedure to the
`triangular factor. Luk [22] has shownthat both steps can be implemented by onetriangular
`systolic array. Each PE contains a 2 X 2 submatrix. It applies two plane rotations (through
`the same angle) to the two column vectors at the QR step and a two-sided rotation at
`the SVD step. For computing the SVD step, the PE can be realized by the CORDIC
`SVD module, as before. On the other side, the two CORDIC processors contained in
`the module are also appropriate to perform the two-plane rotations of the QR step. The
`CORDIC SVD module presented in this paper thus provides a suitable PE for Luk’s
`triangular SVD array.
`Stewart [23] has proposed a square systolic array for computing the Schur decom-
`position (SD) of a non-Hermitian matrix which, for example, is useful for evaluating
`the eigenvalues