`
`This manuscript has been reproduced from the microfilm master. UMI
`films the text directly from the original or copy submitted. Thus, some
`thesis and dissertation copies are in typewriter face, while others may
`
`be from any type of computer printer.
`
`The quality of this reproduction is dependent upon the quality of the
`copy submitted. Broken or indistinct print, colored or poor quality
`illustrations and photographs, print bleedthrough, substandard margins,
`and improper alignment can adversely affect reproduction.
`
`In the unlikely event that the author did not send UMI a complete
`manuscript and there are missing pages, these will be noted. Also, if
`unauthorized copyright material had to be removed, a note will indicate
`the deletion.
`
`Oversize materials (e.g., maps, drawings, charts) are reproduced by
`sectioning the original, beginning at the upper left-hand corner and
`continuing from left to right in equal sections with small overlaps. Each
`original is also photographed in one exposure and is included in
`reduced form at the back of the book.
`
`Photographs included in the original manuscript have been reproduced
`xerographically in this copy. Higher quality 6" x 9" black and white
`photographic prints are available for any photographs or illustrations
`appearing in this copy for an additional charge. Contact UMI directly
`to order.
`
`University Miorolilms International
`A Bell E. Howell Information Company
`300 North Zeeb Road. Ann Arbor. MI 48106-1346 USA
`313:781-4700
`800621-0600
`
`Page 1 of 106
`
`SAMSUNG EXHIBIT 1042
`
`Page 1 of 106
`
`SAMSUNG EXHIBIT 1042
`
`
`
`Page 2 of 106
`
`Page 2 of 106
`
`
`
`Order Number 1346319
`
`Architectural, numerical and implementation issues in the VLSI
`design of an integrated CORDIC-SVD processor
`
`Kata, Kishore, M.S.
`
`Rice University, 1991
`
`U-M-I
`
`300 N. chb Rd.
`Ann Arbor, M48106
`
`Page 3 of 106
`
`Page 3 of 106
`
`
`
`Page 4 of 106
`
`Page 4 of 106
`
`
`
`RICE UNIVERSITY
`
`Architectural, Numerical and Implementation
`Issues in the VLSI Design of an Integrated
`CORDIC-SVD Processor
`
`by
`
`Kishore Kota'
`
`A THESIS SUBMITTED
`
`IN PARTIAL FULFILLMENT or THE
`
`REQUIREMENTS FOR THE DEGREE
`
`Master of Science
`
`APPROVED, THESIS COMMITTEE:
`
`DE.J056th R. Cavallaro, Chairman
`
`Assistant Professor
`
`Electrical and Computer Engineering
`
`{W
`
`D John K. Bennett,
`Assistant Professor
`
`Electri
`
`l a
`
`Computer Engineering
`
`
`
`Dr. Pet
`
`. Varman
`
`Associate Professor
`
`Electrical and Computer Engineering
`
`Houston, Texas
`
`April, 1991
`
`Wage 5 of 106
`
`Page 5 of 106
`
`
`
`Architectural, Numerical and Implementation
`
`Issues in the VLSI Design of an Integrated
`CORDIC-SVD Processor
`
`Kishore Kota
`
`Abstract
`
`This thesis describes the design of a systolic array for computing the Singular Value
`
`Decomposition (SVD) based on the Brent, Luk, Van Loan array. The use of 0001‘-
`
`dinate Rotation DIgital Computer (CORDIC) arithmetic results in an elficient VLSI
`
`implementation of the processor that forms the basic unit of the array. A six-chip
`
`custom VLSI chip set for the processor was initially designed, fabricated in a 2.0;;
`
`CMOS n-well process, and tested. The CORDIC Array Process Element (CAPE), a
`
`single chip implementation, incorporates several enhancements based on a detailed
`
`error analysis of fixed-point CORDIC. The analysis indicates a need to normalize in-
`
`put values for inverse tangent computations. This scheme was implemented using a.
`
`novel method that has 001”) hardware complexity. Use of previous techniques to
`
`implement such a normalization would require 0(n3) hardware. Enhanced architec-
`
`tures, which reduce idle time in the array either through pipelining or by improving
`
`on a. broadcast technique, are also presented.
`
`Tull-Jags 6 of 106
`
`Page 6 of 106
`
`
`
`Acknowledgments
`
`Special thanks are due to Dr. Joe Cavallaro for being the motivation and a constant
`
`driving force behind this Work.
`
`I am especially thankful to him for being a good
`
`friend and just being there when I needed him most. I would also like to thank Dr.
`
`John Bennett and Dr. Peter Varman for serving on the committee and having the
`
`patience to deal with me.
`
`My friend and constant companion “llfxo” Hemkumar deserves special praise for
`
`all those enlightening discussions and valuable comments, which on more than one
`
`occasion prevented me from taking the wrong decisions. Thanks are due to my
`
`roommate, Raghu, for his constant, but often futile efforts at waking me up. But for
`
`him I would have slept through the entire semester without a care. Thanks to the
`
`Common Pool for allowing me to work through even the most difficult times without
`
`compromising on the food. Special thanks to my office-mates Jay Greenwood and
`
`Jim Carson. The office would not be half as lively without them.
`
`Finally I am eternally indebted to my mother and father. Even when they are
`
`halfway across the world, nothing is possible without their blessings.
`
`N Page 7 of 106
`
`Page 7 of 106
`
`
`
`To my Grandfather,
`
`For his one statement that has been the driving force through the years:
`
`“. ..hope for the best, but be pmpared for the worst . ..”
`
`—‘-Page 8 of 106
`
`Page 8 of 106
`
`
`
`Contents
`
`Abstract
`
`Acknowledgments
`
`List of Tables
`
`List of Illustrations
`
`1 Introduction
`
`1.1
`
`Systolic Arrays for SVD .........................
`
`1.2 Contributions of the thesis ........................
`
`1.3 Overview of the thesis ..........................
`
`ii
`
`iii
`
`viii
`
`ix
`
`wwmfl
`
`2 SVD Array Architecture
`
`2.1
`
`Introduction ................................
`
`O!
`
`2.2 The SVD-Jacobi Method .........................
`
`2.3 Direct 2-Angle Method ..........................
`
`2.4 Architecture for 2 x 2 SVD ........................
`
`3 CORDIC Techniques
`
`3.1
`
`Introduction ................................
`
`3.2 CORDIC Algorithms ...........................
`
`3.3 CORDIC Operation in the Circular Mode ...............
`
`3.3.1 CORD IC Z-Reduction ......................
`
`12
`
`14
`
`14
`
`15
`
`16
`
`“page 9 of 106
`
`Page 9 of 106
`
`
`
`vi
`
`3.3.2 CORDIC Y-Reduction ......................
`
`20
`
`3.3.3 Modified Y—Reduction for the SVD processor ..........
`
`22
`
`3.4
`
`Scale Factor Correction ..........................
`
`23
`
`3.5 Error: Analysis of CORDIC Iterations ..................
`
`3.6 Error Analysis of CORDIC Z-Reduction ................
`
`3.6.1 Error Introduced by X and Y Iterations ............
`
`23
`
`25
`
`26
`
`3.6.2 Error Introduced by Z Iterations ................
`
`28
`
`3.6.3
`
`Effects of perturbations in 6 ...................
`
`3.7 Error Analysis of CORDIC Y—Reduction ................
`
`3.7.1 Error Introduced by X and Y Iterations ............
`
`3.7.2 Error Introduced by Z Iterations ................
`
`3.7.3 Perturbation in the Inverse Tangent ..............
`
`3.8 Normalization for CORDIC Y—Reduction ................
`
`3.8.1
`
`Single Cycle Normalization ....................
`
`3.8.2 Partial Normalization .......................
`
`29
`
`30
`
`31
`
`32
`
`32
`
`33
`
`35
`
`36
`
`3.8.3 Choice of a Minimal Set of Shifts ................
`
`39
`
`3.3.4 Cost of Partial Normalization Implementation .........
`
`43
`
`3.9 Summary .................................
`
`43
`
`4 The CORDIC-SVD Processor Architecture
`
`46
`
`4.1
`
`Introduction ................................
`
`46
`
`4.2 Architecture of the Prototype ......................
`
`4.2.1 Design of the XY-Chip ......................
`
`4.2.2 Design of the Z~Chip .......................
`
`4.2.3 Design of the Intra-Chip .....................
`
`46
`
`48
`
`50
`
`50
`
`4.2.4 Design of the Inter-Chip .....................
`
`54
`
`Win-Page 10 of 106
`
`Page 10 of 106
`
`
`
`4.3 The C ORDIC Array Processor Element .................
`
`4.4
`
`Issues in Loading the Array .
`
`.
`
`.
`
`.
`
`L ..................
`
`44.1 Data. Loading in CAPE ......................
`
`5 Architectures to Improve Processor Utilization
`
`5.1
`
`Introduction ................................
`
`5.2 Architecture for Systolic Starting ....................
`
`5.3 Architecture for fast SVD ........................
`
`5.4 Architecture for Collection of U and V Matrices ............
`
`6 VLSI Implementation Issues
`
`6.1 Design Methodology ...........................
`
`6.2 Reducing Skews in Control Signals ...................
`
`6.3 Design of an Adder ............................
`
`6.4 Testing ...................................
`
`7 Conclusions
`
`7.1 Summary .................................
`
`7’2 Future Work ................................
`
`Bibliography
`
`vii
`
`55
`
`60
`
`61
`
`64
`
`64
`
`65
`
`70
`
`73
`
`75
`
`75
`
`78
`
`79
`
`80
`
`85
`
`85
`
`87
`
`88
`
`h Page 11 of 106
`
`Page 11 of 106
`
`
`
`Tables
`
`Parallel Ordering .............................
`
`8
`
`3.1
`
`3.2
`
`3.3
`
`6.1
`
`6.2
`
`6.3
`
`CORDIC Functionality in the Circular Mode ..............
`
`Total Shifts Using a Novel Normalization Scheme ...........
`
`Summary and Comparison of Error Analysis ..............
`
`19
`
`41
`
`44
`
`A list of some of the problems in the 6-chip prototype .........
`
`81
`
`Breakdown of clock cycles for the various sub-tasks in the 6-chip
`
`prototype ................................. 83
`
`Breakdown of clock cycles for the various sub-tasks in CAPE .....
`
`83
`
`viii
`
`_ Page 12 of 106
`
`Page 12 of 106
`
`
`
`Illustrations
`
`2.1
`
`2.2
`
`3.1
`
`3.2
`
`3.3
`
`3.4
`
`3.5
`
`3.6
`
`3.7
`
`4.1
`
`4.2
`
`4.3
`
`4.4
`
`4.5
`
`4.6
`
`4.7
`
`4.8
`
`The Brent-Luk-Van Loan SVD Array ..................
`
`9
`
`Snapshots of Brent-Luk-Van Loan SVD Array .............
`
`11
`
`Rotation of a vector ...........................
`
`17
`
`Implementation of the CORDIC Iterations in Hardware ........
`
`18
`
`Rotation of a vector in Y-Reduction .
`
`.
`
`, ................ 21
`
`Error in Y-Reduction Without Normalization .............
`
`34
`
`Algorithm to Invoke Modified CORDIC Iterations ...........
`
`Implementation of the CORDIC Iterations in Hardware ........
`
`Error in Y-Reduction With Partial Normalization ...........
`
`40
`
`41
`
`42
`
`Block Diagram of the 6-chip Prototype CORDlC-SVD Processor .
`
`.
`
`.
`
`47
`
`Die Photograph of the XY-Chip .....................
`
`Die Photograph of the Z-Chip ......................
`
`Block Diagram of the Intra-Chip ....................
`
`Die Photograph of the Intra-Chip ....................
`
`Block Diagram of Inter-Chip .......................
`
`Die Photograph of the Inter-Chip ....................
`
`49
`
`51
`
`52
`
`53
`
`55
`
`56
`
`Internal Architecture of CAPE ......................
`
`58
`
`ix
`
`
`
`page 13 of 106
`
`Page 13 of 106
`
`
`
`4.9
`
`5.1
`
`5.2
`
`5.3
`
`5.4
`
`5.5
`
`6.1
`
`6.2
`
`6.3
`
`A Combined DSP-VLSI Array for Robotics ..... t ..........
`
`62
`
`Interconnection of Control Signals for Systolic Starting ........
`
`Interconnection of Data Lines for Systolic Starting ...........
`
`66
`
`6?
`
`Snapshots of a Systolic Starting scheme .................
`
`69
`
`Hardware Required for Fast SVD ....................
`
`71
`
`Snapshots of an Array Computing U and V ..............
`
`73
`
`Plot of the completed Single Chip Implementation ...........
`
`Clock Generation with Large Skew ...................
`
`77
`
`78
`
`Clock Generation with Small Skew ...................
`
`78
`
`“age 14 of 106
`
`Page 14 of 106
`
`
`
`Chapter 1
`
`Introduction
`
`Rapid advances made in VLSI and WSI technologies have led the way to an entirely
`
`different approach to computer design for real-time applications, using special-purpose
`
`architectures with custom chips. By migrating some of the highly compute-intensive
`
`tasks to special-purpose hardware, performance which is typically in the realm of
`
`supercomputers, can be obtained at only a fraction of the cost. High-level Computer-
`
`Aided Design (CAD) tools for VLSI silicon compilation, allow fast prototyping of
`
`custom processors by reducing the tedium of VLSI design. Special-purpose architec-
`
`tures are not burdened by the problems associated with general computers and can
`
`map the algorithmic needs of a problem to hardware. Extensive parallelism, pipelin-
`
`ing and use of special arithmetic techniques tailored for the specific application lead
`
`to designs very different from conventional computers.
`
`Systolic and wavefront arrays form a class of special-purpose architectures that
`
`hold considerable promise for real-time computations. Systolic arrays are character-
`
`ized by “multiple use of each data item, extensive concurrency, a few types of simple
`
`cells and a simple and regular data and control flow between neighbors” [21]. Systolic
`
`arrays typically consist of only a few types of simple processors, allowing easy proto-
`
`typing. The near-neighbor communication associated with systolic arrays results in
`
`short interconnections that allow higher operating speeds even in large arrays. Thus,
`
`systolic arrays offer the potential for scaling to very large arrays. Numerical problems
`
`_ Page 15 of106
`
`Page 15 of 106
`
`
`
`involving large matrices benefit from this property of systolic arrays.
`
`Many real~time signal processing, image processing and robotics applications re-
`
`quire fast computations involving large matrices. The high throughput required in
`
`these applications can in general, be obtained only through special-purpose archi-
`
`tectures. A computationally complex numerical problem, which has evoked a lot of
`
`interest is the Singular Value Decomposition (SVD) [15].
`
`It is generally acknowl-
`
`edged that the SVD is the only generally reliable method for determining the rank
`
`of a. matrix numerically. Solutions to the complex problems encountered in real-time
`
`processing, which use the SVD, exhibit a high degree of numerical stability. SVD tech-
`
`niques handle rank deficiency and ill-conditioning of matrices elegantly, obviating the
`
`need for special handling of these cases. The SVD is a very useful tool, for example,
`
`in analyzing data matrices from sensor arrays for adaptive beamforming [30], and low
`
`rank approximations to matrices in image enhancement. [2]. The wide variety of ap-
`
`plications for the SVD coupled with its computational complexity justify dedicating
`
`hardware to this computation.
`
`1.1
`
`Systolic Arrays for SVD
`
`The SVD problem has evoked a lot of attention in the past decade. Numerous archi-
`
`tectures and algorithms were proposed for computing the SVD in an efficient manner.
`
`The Hestenes1 method using Jacobi transformations has been a popular choice among
`
`researchers, due to the concurrency it allows in the computation of the SVD. Numerous
`
`algorithms and architectures for the SVD were proposed by Luk [24, 26, 25, 23]. Brent,
`
`Luk and Van Loan [5] presented an expandable square array of simple processors to
`
`compute the SVD. A processor architecture for the Brent, Luk and Van Loan array,
`
`using CORDIC techniques, was presented by Cavallaro [8].
`
`mkPage 16 of 106
`
`Page 16 of 106
`
`
`
`1.2 Contributions of the thesis
`
`A detailed study of the numerical accuracy of fixed-point CORDIC modules is provided
`
`in this thesis. This analysis indicates a need to normalize the values for CORDIC ‘1’-
`
`reduction in order to achieve meaningful results. A novel normalization scheme for
`
`fixed-point CORDIC Y-reduction that reduces the hardware complexity is developed.
`
`This thesis is chiefly concerned with the issues relating to the VLSI implemen-
`
`tation of the CORDIC-5V1) array proposed by Cavallaro. Many of the lower-level
`
`architectural issues that are not relevant at the higher levels assume a new signifi-
`
`cance when trying to implement the array. Elegant solutions have been found for all
`
`the problems encountered in the design process.
`
`An integrated VLSI chip to implement the 2 X 2 SVD processor element, which
`
`serves as the basic unit of the SVD array, has been designed. A new internal archi-
`
`tecture has been developed within the constraints imposed by VLSI.
`
`Numerous array architectures that improve on the previous architectures were
`
`developed as part of this research. These schemes can used in future implementations
`
`as improvements to the current design.
`
`1.3 Overview of the thesis
`
`Chapter 2 presents the SVD algorithm and the array proposed by Brent, Luk and Van
`
`Loan [5]. A discussion of the CORDIC-END processor is also provided. Chapter 3
`
`reviews the CORDIC arithmetic technique and then presents an improved analysis
`
`of the numerical accuracy of fixed-point CORDIC. Chapter 4 presents the improve-
`
`ments made to the architecture of the single chip VLSI implementation of the SVD
`
`processor over the various sub-units that constitute the prototype SVD processor.
`
`Chapter 5 presents improvements to the array architecture and the impact of these
`
`Page 17 of 106
`
`Page 17 of 106
`
`
`
`on the processor design. These improvements include schemes for systolic starting
`
`and architectures that reduce idle time in the array, to achieve higher throughput.
`
`r Page 18 of 106
`
`Page 18 of 106
`
`
`
`Chapter 2
`
`SVD Array Architecture
`
`2.1
`
`Introduction
`
`This chapter introduces the principles that govern the efficient computation of the
`
`Singular Value Decomposition (SVD) in a parallel array. A major portion of the work
`
`in this thesis is based on the Brent, Luk and Van Loan [5] array for computing the
`
`SVD. Later sections of this chapter provide insight into the operation of this array.
`
`The SVD has proved to be an important matrix decomposition with theoretical
`
`and practical significance. The canonical representation of a matrix provided by the
`
`SVD is very useful in determining its properties. The SVD [15] of a p x 30 matrix A is
`
`defined as
`
`where,
`
`A = UEVT,
`
`(2.1)
`
`U and V are orthogonal matrices of dimension p X p, and
`
`E = diag(al,o'2, - . - ,0?) is a diagonal matrix of singular values.
`
`The columns of U and V are called the lefi and the right singular vectors respectively.
`
`The SVD is a compute intensive operation requiring 0(p3) time on a sequential
`
`computer. For example, the SVD of an 8 x 8 matrix on a SUN 3f60 requires about
`
`1 second of CPU time using LINPACK [12], a library of linear algebra routines. A
`
`typical real-time application, the inverse kinematics engine of a robot, requires the
`
`W Page 19 of 106
`
`Page 19 of 106
`
`
`
`computation of the SVD of an 8 x 8 matrix every 400ysec. This is precisely the class
`
`of applications where the use of dedicated arrays is attractive. With a square array
`
`of processors it is possible to reduce the time complexity for the SVD to 0(plog p).
`
`2.2 The SVD-Jacobi Method
`
`The Jacobi method to compute the SVD is a popular algorithm for parallel imple-
`
`mentation, due to the concurrency it introduces to the problem. The algorithm is
`
`based on the use of orthogonal transformations, called Jacobi Rotations 1
`
`to selec—
`
`tively reduce a given matrix element to zero. A Jacobi rotation by an angle 9 in the
`
`z'j plane, is a p x p matrix, J(z',j,9), where,
`
`j“ =1
`
`Via 7e 3.33.)!
`
`jii = c036,
`
`31-J- = sinfl ,
`
`(2.2)
`
`jja' = —sin6,
`
`jjj = c030,
`
`and all other jab = 0.
`
`Intuitively, a Jacobi rotation is an identity matrix with four of its elements replaced
`
`by a vector rotation matrix as shown below:
`
`0
`
`Ji,',6= 5
`(J J
`0
`
`c056
`
`5
`-sin6
`
`0
`
`0
`
`i
`
`sing
`
`5
`c039
`
`0
`
`3'
`
`3‘
`
`.
`J
`
`0
`
`5
`0
`
`1
`
`2.3
`
`)
`
`i
`
`The parameters 1' and 3' select the matrix element that is to be reduced to zero,
`
`while the parameter 0 is computed using a few matrix elements. Since very few
`
`1also called Givens Rotations or simply plane rotations
`
`_ Page 20 of 106
`
`Page 20 of 106
`
`
`
`matrix elements are required to compute the entire transformation, local computation
`
`is possible by storing all the required data in the registers of a single processor.
`
`In
`
`addition, pre-multiplication with a Jacobi rotation matrix, called Iefi-sided rotation,
`
`modifies only the 2"" and 3"“ rows, while a right-sided rotation affects only the i”
`
`and j‘h columns. This introduces concurrency, since computations which use the
`
`unmodified data elements can be performed in parallel.
`
`The SVD algorithm consists of a. sequence of orthogonal 2 Jacobi transformations,
`
`chosen to diagonalize the given matrix. The algorithm to compute the SVD is iterative
`
`and can be written as
`
`A0 a A,
`
`Ak-H
`
`{IE/am. = aw. elf/1mm. 9.).
`
`(-2.4)
`
`(2.5)
`
`Appropriate choice of the 2-sided rotations can force Al; to converge to the diagonal
`
`matrix of singular values 3. The matrix of left singular vectors is obtained as a
`
`product of the left rotations Ur, while the matrix of right singular vectors is obtained
`
`as a product of the right-sided rotations. At each iteration k, a left rotation of i9; and
`
`a right rotation of 9,. in the plane 53' are chosen to reduce the two off~cliagonal elements
`
`(1.3 and a}.- of At to zero. A sweep consists of n(n — l)/'2 iterations that reduce every
`
`pair of off-diagonal elements to zero. Several sweeps are required to actually reduce
`
`the initial matrix to a diagonal matrix. The sequence. in which the off-diagonal pairs
`
`are annihilated within each sweep is chosen apriori according to some fixed ordering.
`
`Forsythe-Henrici [14] proved that the algorithm converges for the cyclic-by-rows and
`
`cyclic—by-columns ordering. HOWever,
`
`these orderings do not allow decoupling of
`
`successive iterations, thereby restricting the parallelism achievable. Brent and Luk [4]
`
`introduced the parallel ordering that does not suffer from this drawback.
`
`2A matrix is orthogonal when its transpose equals the inverse
`
`
`
`page 21 of 106
`
`Page 21 of 106
`
`
`
`(1,2)
`(1,4)
`(1.5)
`(1:3)
`(117)
`(1,5)
`(113)
`
`(3,4)
`(2,6)
`(4,8)
`(5,7)
`(3,5)
`(7,3)
`(5,2)
`
`(5,6)
`(3.3)
`(2,7)
`(4,5)
`(6,3)
`(3,2)
`(7‘4)
`
`(7,8)
`(5.7)
`(3:5)
`(2.3)
`(4.2)
`(6,4)
`(816)
`
`Table 2.1: Parallel Ordering data exchange with 8 elements. This is the
`ordering used when eight chess players have to play a round-robin tour-
`nament, each player playing one game a day.
`
`An iteration that annihilates an and a2; does not affect any of the terms required
`
`for annihilating [133 and a“, allowing the twu sets of rotation angles to be computed
`
`concurrently. Thus parallel ordering allows decoupling of the computation of the
`
`Jacobi transformations of all the iterations in a single row of the ordering (Table 2.1).
`
`Brent, Luk and Van Loan [5] proposed a square array of processors based on this
`
`scheme. The ordering has an additional preperty that it requires only near-neighbor
`
`communication in an array and is hence amenable to systolic implementation.
`
`2.3 Direct Z-Angle Method
`
`The Brent, Luk and Van Loan systolic array [5] consists of a 1.1/2 x p/2 array of
`
`processors (Figure 2.1) to compute the SVD of a p x p matrix. The matrix dimension,
`
`p, is assumed to be even. Each processor stores a 2 x 2 submatrix and has the ability
`
`to compute the SVD of a 2 X 2 matrix. The special structure of the Jacobi rotation
`
`matrix allows computing the angles for a p x p two-sided rotation in terms of a basic
`
`2 x 2 rotation. The application of the transformation on a p x p matrix can also be
`
`expressed as a set of 2 x 2 transformations. Thus the 2 x 2 SVD forms the basic step
`
`for the p x p SVD.
`
`“age 22 of 106
`
`Page 22 of 106
`
`
`
`
`
`Figure 2.1: The Brent-Luk-Van Loan SVD Array for Computing the SVD
`of an 8 x 8 Matrix
`
`page 23 of 106
`
`Page 23 of 106
`
`
`
`A 2 >( 2 SVD can be described as
`
`12on
`
`a.
`
`b
`
`C d
`
`
`
`3(a) =
`
`‘1’)1
`
`0
`
`0
`
`1.92
`
`,
`
`10
`
`(2.6)
`
`where 9; and 9.. are the left and right rotation angles, respectively. The rotation
`
`matrix is
`
`and the input matrix is
`
`mo=
`
`cost}i
`_
`— sm 6'
`
`sinO
`
`cos 9
`
`.
`
`M =
`
`a
`
`c
`
`b
`
`d
`
`.
`
`on
`
`The angles 9; and 9.. necessary to diagonalize the matrix M can be shown to be
`
`the inverse tangents of the data elements of M:
`
`
`
`6r+6; = tan‘1[c+b],
`0..
`9; — tan
`[d+a]
`
`d—o
`
`
`
`(2.8)
`
`The left and right rotation angles in equation 2.5 can be computed using the
`a,“
`a.”
`
`J J. It may he noted that the pf?
`I
`equations 2.8 on a 2 x 2 matrix formed by [
`diagonal processors store precisely the matridlds: refriiired for p/‘Z successive iterations,
`
`corresponding to a single row of parallel ordering in Table 2.1. All the transformations
`
`that annihilate the off-diagonal elements in the diagonal processors are independent
`
`and hence can be computed in parallel. An update of the entire array requires the left
`
`angles to be propagated from a diagonal processor to all the processors on the same
`
`row, and the right angle to be propagated to all the processors on the same column.
`
`This propagation is systolic. The update of the matrix is followed by a. permutation
`
`of the rows and the columns of the matrix according to the parallel ordering. An 8-
`
`connected mesh is required to perform this data exchange. The permutation results in
`
`the next set of sub-matrices, required to compute the new set of angles, to be formed
`
`#HPage 24 of 106
`
`Page 24 of 106
`
`
`
`11
`
`
`
`Figure 2.2: Snapshots of the Brent-Luk—Van Loan SVD Array Showing
`Diagonal Waves of Activity
`-
`
`at the diagonal processors. Since the diagonal processors require data only from their
`
`diagonal neighbors, new computation can start as soon as these neighbors complete
`
`their portion of the computation. Thus, the updates of the matrix, corresponding
`
`to separate iterations in equation 2.5 are pipelined. This leads to diagonal waves of
`
`activity originating at the main diagonal and propagating outwards. This is shown
`
`in Figure 2.2. The various steps in the operation of this array are listed below:
`
`I The 30/2 diagonal processors compute the required transformation angles using
`
`the 2 x 2 sub-matrix stored in them.
`
`a Each diagonal processor propagates the left angle, which it computed, to its east
`
`and west neighbor. Similarly the right rotation angles are propagated north and
`
`south.
`
`.. Page 25 of 106
`
`Page 25 of 106
`
`
`
`12
`
`o The non-diagonal processors receive the left and right angles and apply the 2-
`
`sided transformation to their sub-matrices. After the computation, they forward
`
`the left and right rotation angles to the processor next in the row or column.
`
`0 Processor P22 in Figure 2.1 receives its next set of data elements from processors
`
`P11, P33, P13 and P31. Hence, it has to wait till processors P31 and P13 finish
`
`their computations. Similarly, all the other diagonal processors need to wait
`
`for their diagonal neighbors to complete the transformations before exchanging
`
`data. Exchange of data in difi'erent parts of the array is staggered in time.
`
`a After the data exchange, a new set of angles is computed by the diagonal pro-
`
`cessors and the same sequence of operations is repeated.
`
`Each diagonal processor requires (p—l) iterations to complete a sweep. The number
`
`of sweeps required for convergence is 0(log 19). For most matrices up to a size of
`
`100 x 100, the algorithm has been observed to converge within 10 sweeps. Since
`
`each iteration for a diagonal processor is completed in constant time, the total SVD
`
`requires 00;: log 13).
`
`2.4 Architecture for 2 x 2 SVD
`
`The array proposed by Brent, Luk and Van Loan was converted into an architecture
`
`suitable for VLSI implementation by Cavallaro and Luk [8]. By using CORDIC arith-
`
`metic, a simple architecture suitable for systolic implementation was developed. The
`
`basic functionality required in each processor is:
`
`0 Ability to compute inverse tangents is required for the computation of the 2-
`
`sided transformations from matrix data,
`
`page 26 of 106
`
`Page 26 of 106
`
`
`
`13
`
`0 Ability to perform 2—sided rotations of a 2 x 2 matrix that is required when
`
`performing the transformation,
`
`a Control to perform the parallel ordering data. exchange.
`
`CORDIC allows efficient implementation of the required arithmetic operations in
`
`hardware. Chapter 3 discusses the CORDIC algorithm in depth. The architecture of
`
`the individual processor element is discussed in Chapter 4.
`
`Page 27 of 106
`
`
`
`Chapter 3
`
`CORDIC Techniques
`
`3.1
`
`Introduction
`
`In a special— purpose VLSI processor it is not necessary to include a general purpose
`
`Arithmetic and Logic Unit (ALU). An efficient design with optimal area and time
`
`complexity can be obtained by using special arithmetic techniques that map the
`
`desired computations to simplified hardware. The CORDIC technique allows efficient
`
`implementation of functions, like vector rotations and inverse tangents, in hardware.
`
`These constitute the principal computations in the SVD algorithm. Additional ALU
`
`operations (addition, subtraction and divide-by-2) required for the SVD are more
`
`basic operations and can reuse the adders and shifters that constitute the CORDIC
`
`modules.
`
`The Coordinate Rotation Digital Computer (CORDIC) technique was initially de—
`
`veloped by Volder [35] as an algorithm to solve the trignometric relationships that
`
`arise in navigation problems. Involving only a. fixed sequence of additions or subtrac-
`
`tions, and binary shifts, this scheme was used to quickly and systematically approx-
`
`imate the value of a trignometric function or its inverse. This algorithm was later
`
`unified for severalelementary functions by Walther [36]. The algorithm has been used
`
`extensively in a number of calculators [16] to perform multiplications, divisions, to
`
`calculate square roots, to evaluate sine, cosine, tangent, arctangent1 sinh, cosh, tanh,
`
`arctanh, 1n and exp functions, and to convert between binary and mixed radix num-
`
`14
`
`Page 28 of 106
`
`Page 28 of 106
`
`
`
`ber systems [10].
`
`3.2
`
`CORDIC Algorithms
`
`A variety of functions are computed in 0011ch by approximating a. given angle
`
`in terms of a sequence of fixed angles. The algorithm for such an approximation
`
`is iterative and always converges in a fixed number of iterations. The basic result
`
`of CORDIC concerns the convergence of the algorithm and is discussed in depth by
`
`Walther [36]. The convergence theorem states the conditions that affect convergence
`
`and essentially gives an algorithm for such an approximation. The theorem is restated
`
`here.
`
`Walther [36] gives the proof for this theorem.
`
`Theorem 3.1 (Convergence ofCORDIC) Suppose
`
`$0290125022"'2fi0n-1>0i
`
`is a finite sequence of real numbers such that
`
`n-—1
`
`(ms 2 ‘Pj‘i‘tPn—h forOSiSn—l.
`j=i+1
`
`and suppose r is a real number such that
`11—]
`
`[TI 5 2 (Pi
`j=D
`
`If 50 = 0, and si+1 = s; + Jim, for 0 S 2' 5 n — l, where
`
`l,
`
`if1‘231'
`
`—1,
`
`if 1' < 31-,
`
`then,
`
`n—l
`
`Ir—Stls Esoj+wn-1.forDSkSn—L
`j=k
`
`and so in particular Ir -— 5,1] < 99.3-1.
`
`"page 29 of 106
`
`Page 29 of 106
`
`
`
`16
`
`If cogs represent angles, the theorem provides a simple algorithm for approximating
`
`any angle 1', in terms of a set of fixed angles cm, with a small error (pad. The versatility
`
`of the CORDIC algorithm allows its use in several different modes:
`
`linear, circular
`
`and hyperbolic. The principal mode of interest in the CORDIC-SVD processor is the
`
`circular mode of CORDIC, due to its ability to compute vector rotations and inverse
`
`tangents efficient 15:.
`
`3.3 CORDIC Operation in the Circular Mode
`
`The anticlockwise rotation of a vector [56,-3ng through an angle or; (Figure 3.1)
`
`is given by
`
`5,-4.1
`
`37.4.;
`
`=
`
`cos a;
`
`sin 0c.-
`
`— sin n.-
`
`cos or;
`
`in
`
`5?;
`
`.
`
`The same relation can be rewritten as
`
`51-.” see oz,-
`
`1
`
`tan a.-
`
`=
`
`37m see a.-
`
`— tan or;
`
`1
`
`:33;
`
`3}.-
`
`.
`
`(3.1)
`
`(3.2)
`
`All computations in the circular mode of CORDIC involve a fixed number of iter-
`
`ations, which are similar to equation 3.2, but for a scaling,
`
`3 {+1
`
`:t‘,’ -I- 6.1;; tan a.-
`
`gin-+1
`
`y.- — 5m.- tan or,
`
`(3.3)
`
`(3.4)
`
`where an and y; are states of variables a: and y at the start of the 1"" iteration,
`
`0 S 2' < n, and 6.- E {—1,+1}. The values 1,. and y“ are the values obtained at the
`
`end of the CORDIC iterations and differ from the desired values a. and 3}“ due to a
`
`scaling as shown in Figure 3.1. Along with the a: and y iterations, iterations of the
`
`form
`
`z.-+1 = art—Era;
`
`(3.5)
`
`L Page 30 of 106
`
`Page 30 of 106
`
`
`
`17
`
`
`
`Figure 3.1: Rotation of a vector
`
`are performed on a variable 2: to store different states of an angle. The angles or; are
`
`chosen to be tan—1Q"), which reduces equation 3.3 and equation 3.4 to
`
`$i+1 = $i+6fyi2_i
`
`yi+l
`
`y; -- §iwi3'£-
`
`(3.6)
`
`(3-7)
`
`These iterations, called the a: and the y iterations, can be implemented as simple
`
`additions or subtractions, and shifts. The angles or; are stored in a ROM, allowing the
`
`z iterations, given by equation 3.5, to be implemented as additions or subtractions.
`
`The actual hardware required to implement these iterations in hardware is shown in
`
`Figure 3.2.
`
`Rotation of a vector through any angle 9 is achieved by decomposing the angle
`
`as a sum of ms, using the CORDIC convergence theorem, and rotating the initial
`
`vector through this sequence of angles. Different functions are implemented by a
`
`different choice of 5; at each iteration. If 5.-
`
`is chosen to force the initial 20 to 0, it
`
`ifage 31 of 106
`
`Page 31 of 106
`
`
`
`18
`
`
`
`Figure 3.2: Implementation of the CORDIC Iterations in Hardware
`
`h Page 32 of 106
`
`Page 32 of 106
`
`
`
`_ for Variables
`
`-
`
`Initial Values
`
`.
`
`.
`
`Final values for variables
`
`19
`
`Functions
`
`Tangents
`
`Z-Reduction
`
`Y-Reduction
`
`2 (3:0 cos 6 + yo cos 9)/I{n
`= (_30 sin ,9 + 3):: cos ”/19
`
`Vector
`Rotations,
`sme, COSIDB
`
`In verse
`
`Table 3.1: CORDIC Functionality in the Circular Mode
`
`is termed Rotation or Z-rednctz'on. This is used to perform efficient vector rotations
`
`given an initial angle. A choice of 6,-5 to reduce the initial yo to 0 is called