`
`This manuscript has been reproduced from the microfilm master. UMI
`films the text directly from the original or copy submitted. Thus, some
`thesis and dissertation copies are in typewriter face, while others may
`be from any type of computerprinter.
`
`The quality of this reproduction is dependent upon the quality of the
`copy submitted. Broken or indistinct print, colored or poor quality
`illustrations and photographs, print bleedthrough, substandard margins,
`and improper alignmentcan adversely affect reproduction.
`
`In the unlikely event that the author did not send UMI a complete
`manuscript and there are missing pages, these will be noted. Also, if
`unauthorized copyright material had to be removed,a note will indicate
`the deletion.
`
`Oversize materials (e.g., maps, drawings, charts) are reproduced by
`sectioning the original, beginning at the upperleft-hand corner and
`continuing from left to right in equal sections with small overlaps. Each
`original is also photographed in one exposure andis included in
`reduced form at the back of the book.
`
`Photographs includedin the original manuscript have been reproduced
`xerographically in this copy. Higher quality 6" x 9" black and white
`photographic prints are available for any photographsorillustrations
`appearingin this copy for an additional charge. Contact UMI directly
`to order.
`
`University Microfilms International
`A Bell & Howell Information Company
`300 North Zeeb Road. Ann Arbor, MI 48106-1346 USA
`313/761-4700
`800/521-0600
`
`Page 1 of 106
`
`SAMSUNG EXHIBIT 1042
`
`Page 1 of 106
`
`SAMSUNG EXHIBIT 1042
`
`
`
`Page 2 of 106
`
`Page 2 of 106
`
`
`
`Order Number 1845319
`
`Architectural, numerical and implementation issues in the VLSI
`design of an integrated CORDIC-SVD processor
`
`Kota, Kishore, M.S.
`
`Rice University, 1991
`
`U-M-I
`
`300 N, Zeeb Rd.
`Ann Arbor, MI 48106
`
`Page 3 of 106
`
`Page 3 of 106
`
`
`
`Page 4 of 106
`
`Page 4 of 106
`
`
`
`RICE UNIVERSITY
`
`Architectural, Numerical and Implementation
`Issues in the VLSI Design of an Integrated
`CORDIC-SVD Processor
`
`by
`
`Kishore Kota
`
`A THESIS SUBMITTED
`IN PARTIAL FULFILLMENT OF THE
`REQUIREMENTS FOR THE DEGREE
`
`Master of Science
`
`APPROVED, THESIS COMMITTEE:
`
`aJwatt R. Cavallaro, Chairman
`
`Assistant Professor
`Electrical and Computer Engineering
`K Bett
`John K. Bennett,
`Drj
`Assistant Professor
`Electrical and
`Computer Engineering
`
`
`
`Associate Professor
`Electrical and Computer Engineering
`
`Houston, Texas
`
`April, 1991
`
`Page 5 of 106
`
`Page 5 of 106
`
`
`
`Architectural, Numerical and Implementation
`Issues in the VLSI Design of an Integrated
`CORDIC-SVD Processor
`
`Kishore Kota
`
`Abstract
`
`This thesis describes the design of a systolic array for computing the Singular Value
`
`Decomposition (SVD) based on the Brent, Luk, Van Loan array. The use of COor-
`
`dinate Rotation Dlgital Computer (CORDIC) arithmetic results in an efficient VLSI
`
`implementation of the processor that forms the basic unit of the array. A six-chip
`
`custom VLSI chip set for the processor was initially designed, fabricated in a 2.0
`
`CMOS n-well process, and tested. The CORDIC Array Process Element (CAPE), a
`
`single chip implementation, incorporates several enhancements based on a detailed
`
`error analysis of fixed-point CORDIC. The analysis indicates a need to normalizein-
`
`put values for inverse tangent computations. This scheme was implemented using a
`
`novel method that has O(n'*) hardware complexity. Use of previous techniques to
`
`implement such a normalization would require O(n?) hardware. Enhanced architec-
`
`tures, which reduce idle time in the array either through pipelining or by improving
`
`on a broadcast technique, are also presented.
`
`Page 6 of 106
`
`Page 6 of 106
`
`
`
`Acknowledgments
`
`Special thanks are due to Dr. Joe Cavallaro for being the motivation and a constant
`
`driving force behind this work.
`
`I am especially thankful to him for being a good
`
`friend and just being there when I needed him most. I would also like to thank Dr.
`
`John Bennett and Dr. Peter Varman for serving on the committee and having the
`
`patience to deal with me.
`
`Myfriend and constant companion “yo” Hemkumar deserves special praise for
`
`all those enlightening discussions and valuable comments, which on more than one
`
`occasion prevented me from taking the wrong decisions. Thanks are due to my
`
`roommate, Raghu, for his constant, but often futile efforts at waking me up. But for
`
`him I would have slept through the entire semester without a care. Thanks to the
`
`Common Pool for allowing me to work through even the most difficult times without
`
`compromising on the food. Special thanks to my office-mates Jay Greenwood and
`Jim Carson. The office would not be half as lively without them.
`
`Finally I am eternally indebted to my mother and father. Even when they are
`
`halfway across the world, nothing is possible without their blessings.
`
`- Page 7 of 106
`
`Page 7 of 106
`
`
`
`To my Grandfather,
`
`For his one statement that has been the driving force through the years:
`
`“|. hope for the best, but be prepared for the worst...”
`
`Page 8 of 106
`
`Page 8 of 106
`
`
`
`Contents
`
`Abstract
`
`Acknowledgments
`
`|
`
`List of Tables
`
`List of Illustrations
`
`Introduction
`
`ii
`
`iti
`
`viii
`
`ix
`
`1
`
`1.1
`
`Systolic ArraysforSVD 2... ee ee ee 2
`
`1.2 Contributions of the thesis... 2... ee ee ee eee
`
`1.3 Overview of the thesis
`
`.. 0... 0... eee ee ee ee
`
`SVD Array Architecture
`
`2.1
`
`Introduction... ..... ee ee ee ee ee
`
`3
`
`3
`
`5
`
`5
`
`2.2 TheSVD-Jacobi Method ........... 0.0. ee eee eee 6
`
`2.3 Direct 2-Angle Method ... 1... 2. eee ee ee es
`
`2.4 Architecturefor2x2SVD......... ee eee ee ee ee ee es
`
`CORDIC Techniques
`
`8
`
`12
`
`14
`
`3.1
`
`Introduction... .. ee ee es 14
`
`3.2 CORDIC Algorithms .. 0... ee ee ee ee ees
`
`3.3 CORDIC Operation in the Circular Mode
`
`.. 1.2... 20+ eevee
`
`15
`
`16
`
`3.3.1 CORDIC Z-Reduction .. 0... 0... 2 ee eee ee ee 19
`
`~ Page 9 of 106
`
`Page 9 of 106
`
`
`
`3.3.2
`
`CORDIC Y-Reduction .... 2.2.0... 0. eee ee eee
`
`3.3.3 Modified Y-Reduction for the SVD processor. .........
`
`3.4
`
`Scale Factor Correction... . 0. ee ee ee ee
`
`3.5 Error Analysis of CORDIC Iterations ............22000
`
`3.6 Error Analysis of CORDIC Z-Reduction ...........0. 0005
`
`3.6.1 Error Introduced by X and Y Iterations ............
`
`3.6.2 Error Introduced by Z Iterations
`
`...........2.2048.%
`
`3.6.3
`
`Effects of perturbations in@ ...... 0. ee ee
`
`3.7 Error Analysis of CORDIC Y-Reduction ..........-.0 0005
`
`3.7.1 Error Introduced by X and Y Iterations ............
`
`3.7.2 Error Introduced by Z Iterations ....... 0.00. eens
`
`3.7.3 Perturbation in the Inverse Tangent
`
`..........200-
`
`3.8 Normalization for CORDIC Y-Reduction...............,
`
`3.8.1
`
`Single Cycle Normalization... ............00005
`
`3.8.2 Partial Normalization... ..........---2--4008,
`
`3.8.3 Choice of a Minimal Set of Shifts ..............24.
`
`3.8.4 Cost of Partial Normalization Implementation .........
`
`vi
`
`20
`
`22
`
`23
`
`23
`
`25
`
`26
`
`28
`
`29
`
`30
`
`31
`
`32
`
`32
`
`33
`
`35
`
`36
`
`39
`
`43
`
`3.9 Summary 2... . ec ee te ee eee
`
`43
`
`4 The CORDIC-SVD Processor Architecture
`
`46
`
`4.1
`
`Introduction. 2... ee ee ee
`
`46
`
`4.2 Architecture of the Prototype ............0. 00000 e eae
`
`4.2.1 Designofthe XY-Chip....... ce ue ue renee ewes
`
`4.2.2 Design of the Z-Chip .... 1... eee eee ee et ee
`
`4.2.3 Design of the Intra-Chip ..... 2... 00 eee eee ee
`
`4.2.4 Design of the Inter-Chip .........0 0000 eee eeee
`
`46
`
`48
`
`50
`
`50
`
`54
`
`Page 10 of 106
`
`Page 10 of 106
`
`
`
`4.3 The CORDIC Array Processor Element..............+.-.
`
`4.4
`
`Issues in Loading the Array . 1... eee ee ee ee
`
`4.4.1 Data Loading in CAPE............ 0.000202 eee
`
`5 Architectures to Improve Processor Utilization
`5.1
`Introduction. 2... ee eee
`
`5.2 Architecture for Systolic Starting 2... ..... ee eee eee
`
`5.3 Architecturefor fast SVD .. 1... ee ee ee ee te et ee
`
`5.4 Architecture for Collection of U and V Matrices ........00 00s
`
`6 VLSI Implementation Issues
`
`6.1 Design Methodology ...... 0... eee ee ee ee es
`
`6.2 Reducing Skews in Control Signals 2.1... .. eee eee eens
`
`6.3 Design ofan Adder... 1... ee ee es
`
`Vil
`
`55
`
`60
`
`61
`
`64
`
`64
`
`65
`
`70
`
`73
`
`75
`
`75
`
`18
`
`79
`
`6.4 Testing... 1... ee ee ee ee ee ees
`
`80
`
`7 Conclusions
`
`7.1 Summary ..we ee ee ee
`
`7.2 Future Work... 2... ee te ee ees
`
`Bibliography
`
`85
`
`85
`
`87
`
`88
`
`: Page 11 of 106
`
`Page 11 of 106
`
`
`
`Tables
`
`2.1
`
`Parallel Ordering ©... ce ee ee ee es
`
`8
`
`3.1 CORDIC Functionality in the Circular Mode. ..........004.
`
`3.2 Total Shifts Using a Novel Normalization Scheme ...........
`
`19
`
`41
`
`3.3 Summary and Comparison of Error Analysis ...........005 44
`
`6.1 A list of some of the problemsin the 6-chip prototype.........
`
`81
`
`6.2 Breakdown of clock cycles for the various sub-tasks in the 6-chip
`
`prototype 2.ee 83
`
`6.3 Breakdown of clock cycles for the various sub-tasks in CAPE .....
`
`83
`
`Vili
`
`. Page 12 of 106
`
`Page 12 of 106
`
`
`
`Illustrations
`
`2.1
`
`2.2
`
`The Brent-Luk-Van Loan SVD Array... .. 0.2.0... 00000048
`
`Snapshots of Brent-Luk-Van Loan SVD Array ..........0.04
`
`9
`
`ll
`
`3.1
`
`Rotation of a vector 6... ee
`
`3.2
`
`Implementation of the CORDIC Iterations in Hardware........
`
`18
`
`3.3
`
`Rotation of a vector in Y-Reduction. .. 6... eevee eee es
`
`3.4
`
`Error in Y-Reduction Without Normalization ..........00..
`
`3.5
`
`3.6
`
`Algorithm to Invoke Modified CORDIC Iterations ...........
`
`Implementation of the CORDICIterations in Hardware........
`
`3.7
`
`Error in Y-Reduction With Partial Normalization ...,......0..
`
`21
`
`34
`
`40
`
`4]
`
`42
`
`4.1
`
`4.2
`
`Block Diagram of the 6-chip Prototype CORDIC-SVD Processor .
`
`.
`
`.
`
`47
`
`Die Photograph of the XY-Chip .. 2... 6. ee ee ee ee es
`
`4.3
`
`Die Photograph of the Z-Chip .. 1... ce ee ee eee ee ee
`
`4.4
`
`Block Diagram of the Intra-Chip
`
`.. 1... 0. ee eee ee eens
`
`4.5
`
`4.6
`
`Die Photograph of the Intra-Chip .. 1... 2... - eee eee eee
`
`Block Diagram of Inter-Chip... 0... ee ee eee ee eee
`
`4.7
`
`Die Photograph of the Inter-Chip ........ 0... 00 e uae
`
`4.8
`
`Internal Architectureof CAPE... 1.1... eee eee ees
`
`ix
`
`
`
`Page 13 of 106
`
`49
`
`51
`
`52
`
`53
`
`55
`
`56
`
`58
`
`Page 13 of 106
`
`
`
`4.9
`
`A Combined DSP-VLSI Array for Robotics... .. ‘beeen wanes
`
`62
`
`5.1
`
`5.2
`
`5.3
`
`5.4
`
`5.5
`
`6.1
`
`6.2
`
`Interconnection of Control Signals for Systolic Starting ........
`
`Interconnection of Data Lines for Systolic Starting. ..........
`
`Snapshots of a Systolic Starting scheme... ....... 2.0000 ee
`
`66
`
`67
`
`69
`
`Hardware Required for Fast SVD .. 2... 2... ee ee ee eee 71
`
`Snapshots of an Array Computing U andV ..........-44.
`
`73
`
`Plot of the completed Single Chip Implementation. ..........
`
`77
`
`Clock Generation with Large Skew ...... 0. eee uuu eee 78
`
`6.3
`
`Clock Generation with Small Skew ..........0. 000 eee 78
`
`Page 14 of 106
`
`Page 14 of 106
`
`
`
`Chapter 1
`
`Introduction
`
`Rapid advances made in VLSI and WSI technologies have led the way to an entirely
`
`different approach to computerdesignforreal-time applications, using special-purpose
`
`architectures with custom chips. By migrating some of the highly compute-intensive
`
`tasks to special-purpose hardware, performance which is typically in the realm of
`
`supercomputers, can be obtained at only a fraction of the cost. High-level Computer-
`
`Aided Design (CAD) tools for VLSI silicon compilation, allow fast prototyping of
`
`custom processors by reducing the tedium of VLSI design. Special-purpose architec-
`
`tures are not burdened by the problemsassociated with general computers and can
`
`map the algorithmic needs of a problem to hardware. Extensive parallelism, pipelin-
`
`ing and use of special arithmetic techniques tailored for the specific application lead
`
`to designs very different from conventional computers.
`
`Systolic and wavefront arrays form a class of special-purpose architectures that
`
`hold considerable promise for real-time computations. Systolic arrays are character-
`
`ized by “multiple use of each data item, extensive concurrency, a few types of simple
`
`cells and a simple and regular data and controlflow between neighbors” [21]. Systolic
`
`arrays typically consist of only a few types of simple processors, allowing easy proto-
`
`typing. The near-neighbor communication associated with systolic arrays results in
`
`short interconnections that allow higher operating speeds even in large arrays. Thus,
`
`systolic arrays offer the potential for scaling to very large arrays. Numerical problems
`
`. Page 15 of 106
`
`Page 15 of 106
`
`
`
`involving large matrices benefit from this property of systolic arrays.
`
`Manyreal-time signal processing, image processing and robotics applicationsre-
`
`quire fast computations involving large matrices. The high throughput required in
`
`these applications can in general, be obtained only through special-purpose archi-
`
`tectures. A computationally complex numerical problem, which has evoked a lot of
`
`interest is the Singular Value Decomposition (SVD) [15].
`
`It is generally acknowl-
`
`edged that the SVD is the only generally reliable method for determining the rank
`
`of a matrix numerically. Solutions to the complex problems encountered in real-time
`
`processing, which use the SVD,exhibit a high degree of numerical stability. SVD tech-
`
`niques handle rank deficiency andill-conditioning of matrices elegantly, obviating the
`
`need for special handling of these cases. The SVD is a very useful tool, for example,
`
`in analyzing data matrices from sensor arrays for adaptive beamforming [30], and low
`
`rank approximations to matrices in image enhancement [2]. The wide variety of ap-
`
`plications for the SVD coupled with its computational complexity justify dedicating
`
`hardware to this computation.
`
`1.1
`
`Systolic Arrays for SVD
`
`The SVD problem has evoked a lot of attention in the past decade. Numerousarchi-
`
`tectures and algorithms were proposed for computing the SVD in an efficient manner.
`
`The Hestenes’ method using Jacobi transformations has been a popular choice among
`
`researchers, due to the concurrencyit allows in the computation of the SVD. Numerous
`
`algorithms and architectures for the SVD were proposed by Luk [24, 26, 25, 23]. Brent,
`
`Luk and Van Loan [5] presented an expandable square array of simple processors to
`
`compute the SVD. A processor architecture for the Brent, Luk and Van Loan array,
`
`using CORDIC techniques, was presented by Cavallaro [8].
`
`Page 16 of 106
`
`Page 16 of 106
`
`
`
`1.2 Contributions of the thesis
`
`A detailed study of the numerical accuracy of fixed-point CORDIC modules is provided
`
`in this thesis. This analysis indicates a need to normalize the values for CORDIC Y-
`
`reduction in order to achieve meaningful results. A novel normalization scheme for
`
`fixed-point CORDIC Y-reduction that reduces the hardware complexity is developed.
`
`This thesis is chiefly concerned with the issues relating to the VLSI implemen-
`
`tation of the CORDIC-SVD array proposed by Cavallaro. Many of the lower-level
`
`architectural issues that are not relevant at the higher levels assume a new signifi-
`
`cance when trying to implement the array. Elegant solutions have been foundforall
`
`the problems encountered in the design process.
`
`An integrated VLSI chip to implement the 2 x 2 SVD processor element, which
`
`serves as the basic unit of the SVD array, has been designed. A new internal archi-
`
`tecture has been developed within the constraints imposed by VLSI.
`
`Numerous array architectures that improve on the previous architectures were
`
`developed as part of this research. These schemes can used in future implementations
`
`as improvements to the current design.
`
`1.3. Overview of the thesis
`
`Chapter 2 presents the SVD algorithm and the array proposed by Brent, Luk and Van
`
`Loan [5]. A discussion of the CORDIC-SVD processor is also provided. Chapter 3
`
`reviews the CORDIC arithmetic technique and then presents an improved analysis
`
`of the numerical accuracy of fixed-point CORDIC. Chapter 4 presents the improve-
`
`ments made to the architecture of the single chip VLSI implementation of the SVD
`
`processor over the various sub-units that constitute the prototype SVD processor.
`
`Chapter 5 presents improvements to the array architecture and the impact of these
`
`Page 17 of 106
`
`Page 17 of 106
`
`
`
`on the processor design. These improvements include schemes for systolic starting
`
`and architectures that reduce idle time in the array, to achieve higher throughput.
`
`~Page 18 of 106
`
`Page 18 of 106
`
`
`
`Chapter 2
`
`SVD Array Architecture
`
`2.1
`
`Introduction
`
`This chapter introduces the principles that govern the efficient computation of the
`
`Singular Value Decomposition (SVD)in a parallel array. A major portion of the work
`in this thesis is based on the Brent, Luk and Van Loan [5] array for computing the
`
`SVD. Later sections of this chapter provideinsight into the operation ofthis array.
`
`The SVD has proved to be an important matrix decomposition with theoretical
`
`and practical significance. The canonical representation of a matrix provided by the
`
`SVD is very useful in determining its properties. The SVD [15] of a p x p matrix A is
`
`defined as
`
`where,
`
`A=ULV?,
`
`(2.1)
`
`U and V are orthogonal matrices of dimension p x p, and
`
`& = diag(o1,02,-++, ap) is a diagonal matrix of singular values.
`
`The columnsof U andVare called the left and the right singular vectors respectively.
`
`The SVD is a compute intensive operation requiring O(p*) time on a sequential
`
`computer, For example, the SVD of an 8 x 8 matrix on a SUN 3/60 requires about
`
`1 second of CPU time using LINPACK [12], a library of linear algebra routines. A
`
`typical real-time application, the inverse kinematics engine of a robot, requires the
`
`- Page 19 of 106
`
`Page 19 of 106
`
`
`
`computation of the SVD of an 8 x 8 matrix every 400ysec. This is precisely the class
`
`of applications where the use of dedicated arrays is attractive. With a square array
`
`of processors it is possible to reduce the time complexity for the SVD to O(plog p).
`
`2.2 The SVD-Jacobi Method
`
`The Jacobi method to compute the SVD is a popular algorithm for parallel imple-
`
`mentation, due to the concurrency it introduces to the problem. Thealgorithm is
`
`based on the use of orthogonal transformations, called Jacobi Rotations *
`
`to selec-
`
`tively reduce a given matrix element to zero. A Jacobi rotation by an angle @ in the
`
`ij plane, is a p x p matrix, J(2, 7,9), where,
`
`Jaa = 1
`
`ji = cos,
`
`V(a # 23),
`tI
`
`sin6 ,
`
`fiz
`
`(2.2)
`
`jit = —sind,
`
`jj; = cosé,
`
`and all other j,, = 0.
`
`Intuitively, a Jacobi rotation is an identity matrix with four of its elements replaced
`
`by a vector rotation matrix as shown below:
`
`J(i,j,0) =
`(155,9)
`
`0
`
`~— sin @
`
`:
`cos @
`
`0]
`
`7
`
`2.3
`(2.3)
`
`0
`J
`i
`The parameters i and jselect the matrix element that is to be reduced to zero,
`
`0
`
`0
`
`1
`
`while the parameter @ is computed using a few matrix elements. Since very few
`
`lalso called Givens Rotations or simply plane rotations
`
`- Page 20 of 106
`
`Page 20 of 106
`
`
`
`matrix elements are required to compute the entire transformation, local computation
`
`is possible by storing all the required data in the registers of a single processor.
`
`In
`
`addition, pre-multiplication with a Jacobi rotation matrix, called left-sided rotation,
`
`modifies only the i** and j** rows, while a right-sided rotation affects only the it*
`
`and 7** columns. This introduces concurrency, since computations which use the
`
`unmodified data elements can be performed in parallel.
`
`The SVDalgorithm consists of a sequence of orthogonal ? Jacobi transformations,
`
`chosen to diagonalize the given matrix. The algorithm to compute the SVDis iterative
`
`and can be written as
`
`Ao = A,
`
`Angry = UPAV, = Juli, 7,01)?Ande(i, J, Or)»
`
`(2.4)
`
`(2.5)
`
`Appropriate choice of the 2-sided rotations can force A; to converge to the diagonal
`
`matrix of singular values £. The matrix of left singular vectors is obtained as a
`
`product of the left rotations U;,, while the matrix of right singular vectors is obtained
`
`as a productof the right-sided rotations. At each iteration k,a left rotation of 4 and
`
`a right rotation of 9, in the plane 7j are chosen to reduce the twooff-diagonal elements
`
`a;; and a;; of A, to zero. A sweep consists of n(n — 1)/2 iterations that reduce every
`
`pair of off-diagonal elements to zero. Several sweeps are required to actually reduce
`
`the initial matrix to a diagonal matrix. The sequence, in which the off-diagonal pairs
`
`are annihilated within each sweep is chosen apriori according to some fixed ordering.
`
`Forsythe-Henrici [14] proved that the algorithm converges for the cyclic-by-rows and
`
`cyclic-by-columns ordering. However,
`
`these orderings do not allow decoupling of
`
`successive iterations, thereby restricting the parallelism achievable. Brent and Luk [4]
`
`introduced the parallel ordering that does not suffer from this drawback.
`
`24 matrix is orthogonal when its transpose equals the inverse
`
`
`
`Page 21 of 106
`
`Page 21 of 106
`
`
`
`(1,2)
`(1,4)
`(1,6)
`(1,8)
`(1,7)
`(1,5)
`(1,3)
`
`(3,4)
`(2,6)
`(4,8)
`(6,7)
`(8,5)
`(7,3)
`(5,2)
`
`(5,6)
`(3,8)
`(2,7)
`(4,5)
`(6,3)
`(8,2)
`(7,4)
`
`(7,8)
`(5,7)
`(3,5)
`(2,3)
`(4,2)
`(6,4)
`(8,6)
`
`Table 2.1: Parallel Ordering data exchange with 8 elements. This is the
`ordering used when eight chess players have to play a round-robin tour-
`nament, each player playing one gamea day.
`
`An iteration that annihilates a1, and az does not affect any of the terms required
`
`for annihilating a33 and a4q, allowing the two sets of rotation angles to be computed
`
`concurrently. Thus parallel ordering allows decoupling of the computation of the
`
`Jacobi transformationsofall the iterations in a single row of the ordering (Table 2.1).
`
`Brent, Luk and Van Loan [5] proposed a square array of processors based on this
`
`scheme. The ordering has an additional property that it requires only near-neighbor
`
`communication in an array and is hence amenable to systolic implementation.
`
`2.3 Direct 2-Angle Method
`
`The Brent, Luk and Van Loan systolic array [5] consists of a p/2 x p/2 array of
`
`processors (Figure 2.1) to compute the SVD of a px p matrix. The matrix dimension,
`
`p, is assumed to be even. Each processor stores a 2 x 2 submatrix and has the ability
`
`to compute the SVD of a 2 x 2 matrix. The special structure of the Jacobi rotation
`
`matrix allows computing the angles for a p x p two-sided rotation in termsof a basic
`
`2 x 2 rotation. The application of the transformation on a p x p matrix can also be
`
`expressed as a set of 2 x 2 transformations. Thus the 2 x 2 SVD formsthe basic step
`
`for the p x p SVD.
`
`Page 22 of 106
`
`Page 22 of 106
`
`
`
`
`
`Figure 2.1: The Brent-Luk-Van Loan SVD Array for Computing the SVD
`of an 8 x 8 Matrix
`
`Page 23 of 106
`
`Page 23 of 106
`
`
`
`A 2x2 SVD can be described as
`
`R(O1)"
`
`a
`
`b
`
`
`od RG) =
`
`10
`
`(2.6)
`
`vi
`
`0
`
`0
`
`|
`
`where 9 and 6, are the left and right rotation angles, respectively. The rotation
`
`matrix is
`
`and the input matrix is
`
`R(0) =
`
`cos@ sin#@
`
`-—sin@ cos
`
`;
`
`e(]
`
`(2.7)
`
`The angles 6; and 6, necessary to diagonalize the matrix M can be shown to be
`
`the inverse tangents of the data elements of W:
`
`
`
`4, + A = tan7! [=] ’
`6,—0; = tan =|
`
`d—a
`
`(2.8)
`
`Theleft and right rotation angles in equation 2.5 can be computed using the
`ay
`ai;
`
`equations 2.8on a 2 x2 matrix formed by |
`
`| It may benoted that thep/2
`
`aj, 3;
`diagonal processors store precisely the matrices required for p/2 successive iterations,
`
`correspondingto a single row of parallel ordering in Table 2.1. All the transformations
`
`that annihilate the off-diagonal elements in the diagonal processors are independent
`
`and hence can be computed in parallel. An update of the entire array requires theleft
`
`angles to be propagated from a diagonal processor to all the processors on the same
`
`row, and the right angle to be propagated to all the processors on the same column.
`
`This propagation is systolic. The update of the matrix is followed by a permutation
`
`of the rows and the columnsof the matrix according to the parallel ordering. An 8-
`
`connected meshis required to perform this data exchange. The permutation results in
`
`the next set of sub-matrices, required to compute the new set of angles, to be formed
`
`Page 24 of 106
`
`Page 24 of 106
`
`
`
`ll
`
`
`
`Figure 2.2: Snapshots of the Brent-Luk-Van Loan SVD Array Showing
`Diagonal Wavesof Activity
`
`at the diagonal processors. Since the diagonal processors require data only from their
`
`diagonal neighbors, new computation can start as soon as these neighbors complete
`
`their portion of the computation. Thus, the updates of the matrix, corresponding
`
`to separate iterations in equation 2.5 are pipelined. This leads to diagonal waves of
`
`activity originating at the main diagonal and propagating outwards. This is shown
`
`in Figure 2.2. The various steps in the operation of this array are listed below:
`
`e The p/2 diagonal processors compute the required transformation angles using
`
`the 2 x 2 sub-matrix stored in them.
`
`e Each diagonal processor propagates the left angle, which it computed, to its east
`
`and west neighbor. Similarly the right rotation angles are propagated north and
`
`south,
`
`Page 25 of 106
`
`Page 25 of 106
`
`
`
`12
`
`e The non-diagonal processors receive the left and right angles and apply the 2-
`
`sided transformation to their sub-matrices. After the computation, they forward
`
`the left and right rotation angles to the processor next in the row or column.
`
`e Processor P22 in Figure 2.1 receivesits next set of data elements from processors
`
`P11, P33, P13 and P31. Hence, it has to wait till processors P31 and P13 finish
`
`their computations. Similarly, all the other diagonal processors need to wait
`
`for their diagonal neighbors to complete the transformations before exchanging
`
`data. Exchangeof data in different parts of the array is staggered in time.
`
`e After the data exchange, a new set of angles is computed by the diagonal pro-
`
`cessors and the same sequence of operations is repeated.
`
`Each diagonal processor requires (p-1) iterations to complete a sweep. The number
`
`of sweeps required for convergence is O(log p). For most matrices up to a size of
`
`100 x 100, the algorithm has been observed to converge within 10 sweeps. Since
`each iteration for a diagonal processor is completed in constant time, the total SVD
`requires O(plog p).
`
`2.4 Architecture for 2x 2SVD
`
`The array proposed by Brent, Luk and Van Loan was converted into an architecture
`
`suitable for VLSI implementation by Cavallaro and Luk [8]. By using CORDICarith-
`
`metic, a simple architecture suitable for systolic implementation was developed. The
`
`basic functionality required in each processoris:
`
`e Ability to compute inverse tangents is required for the computation of the 2-
`
`sided transformations from matrix data,
`
`Page 26 of 106
`
`Page 26 of 106
`
`
`
`13
`
`e Ability to perform 2-sided rotations of a 2 x 2 matrix that is required when
`
`performing the transformation,
`
`e Control to perform the parallel ordering data exchange.
`
`CORDIC allowsefficient implementation of the required arithmetic operations in
`
`hardware. Chapter 3 discusses the CORDIC algorithm in depth. The architecture of
`
`the individual processor element is discussed in Chapter 4.
`
`Page 27 of 106
`
`
`
`Chapter 3
`
`CORDIC Techniques
`
`3.1
`
`Introduction
`
`In a special-purpose VLSI processor it is not necessary to include a general purpose
`
`Arithmetic and Logic Unit (ALU). An efficient design with optimal area and time
`
`complexity can be obtained by using special arithmetic techniques that map the
`
`desired computations to simplified hardware. The CORDIC technique allowsefficient
`
`implementation of functions, like vector rotations and inverse tangents, in hardware.
`
`These constitute the principal computations in the SVD algorithm. Additional ALU
`
`operations (addition, subtraction and divide-by-2) required for the SVD are more
`
`basic operations and can reuse the adders andshifters that constitute the CORDIC
`
`modules.
`
`The Coordinate Rotation Digital Computer (CORDIC) technique wasinitially de-
`
`veloped by Volder [35] as an algorithm to solve the trignometric relationships that
`
`arise in navigation problems. Involving only a fixed sequence of additions or subtrac-
`
`tions, and binary shifts, this scheme was used to quickly and systematically approx-
`
`imate the value of a trignometric function or its inverse. This algorithm was later
`
`unified for several elementary functions by Walther [36]. The algorithm has been used
`
`extensively in a numberofcalculators [16] to perform multiplications, divisions, to
`
`calculate square roots, to evaluate sine, cosine, tangent, arctangent, sinh, cosh, tanh,
`
`arctanh, In and exp functions, and to convert between binary and mixed radix num-
`
`14
`
`Page 28 of 106
`
`Page 28 of 106
`
`
`
`15
`
`ber systems [10].
`
`3.2 CORDIC Algorithms
`
`A variety of functions are computed in CORDIC by approximating a given angle
`
`in terms of a sequence of fixed angles. The algorithm for such an approximation
`
`is iterative and always converges in a fixed numberofiterations. The basic result
`
`of CORDIC concerns the convergence of the algorithm and is discussed in depth by
`
`Walther [36]. The convergence theorem states the conditions that affect convergence
`
`and essentially gives an algorithm for such an approximation. The theorem is restated
`
`here. Walther [36] gives the proof for this theorem.
`
`Theorem 3.1 (Convergence of CORDIC) Suppose
`
`Yo 2 ¥1 2 $2 2°+-2 Yn > 0,
`
`is a finite sequence of real numbers such that
`
`n=1
`gis DY) opt n1 for0Sign—l,
`jeitl
`
`and supposer is a real number such that
`n-1
`Irs oe;
`j=0
`
`If so = 0, and si41 = 3; + d:y;, for 0 <i <n —1, where
`
`1,
`
`ifr>s3;
`
`-l,
`
`ifr<4s;,
`
`then,
`
`n-1
`lr — se] S$ D0 ps +on-1, forOSksn-l,
`j=k
`and so in particular |r — sn] < ppt.
`
`~ Page 29 of 106
`
`Page 29 of 106
`
`
`
`16
`
`If js represent angles, the theorem provides a simple algorithm for approximating
`
`any angler, in termsof a set of fixed angles y;, with a small error (~,_1. The versatility
`
`of the CORDIC algorithm allows its use in several different modes:
`
`linear, circular
`
`and hyperbolic. The principal mode of interest in the CORDIC-SVD processoris the
`
`circular mode of CORDIC, duetoits ability to compute vector rotations and inverse
`
`tangents efficiently.
`
`3.3>CORDIC Operation in the Circular Mode
`
`The anticlockwise rotation of a vector [Z;, 9; 7 through an angle a; (Figure 3.1)
`
`is given by
`
`Eigy
`
`cosa;
`
`sina;
`
`—sina;
`Fiat
`The same relation can be rewritten as
`
`cosa;
`
`=
`
`Z;
`
`OF
`
`.
`
`(3.1)
`
`
`
`
`
`Ti41 SEC A; tana;||z;1
`=
`.
`(3.2)
`
`Fi-1 SEC Q;
`
`— tan a;
`
`1
`
`Vi
`
`All computations in the circular mode of CORDIC involve a fized number ofiter-
`
`ations, which are similar to equation 3.2, but for a scaling,
`
`Tit1
`
`Vir
`
`x; + 6;y; tan a;
`
`y; — 6:2; tana;,
`
`(3.3)
`
`(3.4)
`
`where 2; and y; are states of variables x and y at the start of the 7 iteration,
`
`0<2z<n,and 6;
`
`€ {-1,+1}. The values z, and y, are the values obtained at the
`
`end of the CORDIC iterations and differ from the desired values Z, and g, due to a
`
`scaling as shown in Figure 3.1. Along with the z and y iterations, iterations of the
`
`form
`
`Zinn = 2; + 6:0;
`
`(3.5)
`
`Page 30 of 106
`
`Page 30 of 106
`
`
`
`17
`
`(Xy, Yn)
`
`Figure 3.1: Rotation of a vector
`
`are performed on a variable z to store different states of an angle. The angles a; are
`
`chosen to be tan~!(2-'), which reduces equation 3.3 and equation 3.4 to
`
`Ti+. = ay + 6:y;27'
`
`Yinr = Ye — 52,274
`
`(3.6)
`
`(3.7)
`
`These iterations, called the z and the y iterations, can be implemented as simple
`
`additions or subtractions, and shifts. The angles a; are stored in a ROM,allowing the
`
`z iterations, given by equation 3.5, to be implemented as additions or subtractions.
`
`The actual hardware required to implement these iterations in hardware is shown in
`
`Figure 3.2.
`
`Rotation of a vector through any angle @ is achieved by decomposing the angle
`
`as a sum of ajs, using the CORDIC convergence theorem, and rotating the initial
`
`vector through this sequence of angles. Different functions are implemented by a
`
`different choice of 6; at each iteration. If 6;
`
`is chosen to force theinitial zg to 0, it
`
`~Page 31 of 106
`
`Page 31 of 106
`
`
`
`18
`
`
`
`Control
`Unit
`PLA
`
`Figure 3.2: Implementation of the CORDIC Iterations in Hardware
`
`: Page 32 of 106
`
`Page 32 of 106
`
`
`
`|Operation|for Variables Final values for variables
`
`.
`
`Functions
`
`Y-Reduction
`
`
`
`.
`
`Initial Values
`
`.
`
`19
`
`Z-Reduction
`
`=
`
`Ip = (x9 cos 0 + yocos9)/Ky
`Yn =
`(—aosin@ + yocos 6)/K,,
`(
`Zn = O
`
`Vector
`Rotations,
`
`sine, cosine
`
`= 03 + y3/Kn
`= 0
`
`=
`
`= tan7? (#To
`
`)
`
`Inverse
`
`Tangents
`
`Table 3.1: CORDIC Functionality in the Circular Mode
`
`is termed Rotation or Z-reduction. This is used to perform efficient vector rotations
`
`given an initial angle. A choice of 6;s to reduce the initial yo to 0 is called Vectoring
`
`or Y-reduction. Iterations of this form allow the computation of the inverse tangent
`
`function. Table 3.1 summarizes the various operations performed in the CORDIC
`
`circular mode, as they pertain to the SVD processor.
`
`3.3.1 CORDIC Z-Reduction
`
`Given an initial vector [%0, fo]” = [o, yol” and an angle zo = 0, through which to
`
`rotate, the ro