throbber
INFORMATION TO USERS
`
`This manuscript has been reproduced from the microfilm master. UMI
`films the text directly from the original or copy submitted. Thus, some
`thesis and dissertation copies are in typewriter face, while others may
`be from any type of computerprinter.
`
`The quality of this reproduction is dependent upon the quality of the
`copy submitted. Broken or indistinct print, colored or poor quality
`illustrations and photographs, print bleedthrough, substandard margins,
`and improper alignmentcan adversely affect reproduction.
`
`In the unlikely event that the author did not send UMI a complete
`manuscript and there are missing pages, these will be noted. Also, if
`unauthorized copyright material had to be removed,a note will indicate
`the deletion.
`
`Oversize materials (e.g., maps, drawings, charts) are reproduced by
`sectioning the original, beginning at the upperleft-hand corner and
`continuing from left to right in equal sections with small overlaps. Each
`original is also photographed in one exposure andis included in
`reduced form at the back of the book.
`
`Photographs includedin the original manuscript have been reproduced
`xerographically in this copy. Higher quality 6" x 9" black and white
`photographic prints are available for any photographsorillustrations
`appearingin this copy for an additional charge. Contact UMI directly
`to order.
`
`University Microfilms International
`A Bell & Howell Information Company
`300 North Zeeb Road. Ann Arbor, MI 48106-1346 USA
`313/761-4700
`800/521-0600
`
`Page 1 of 106
`
`SAMSUNG EXHIBIT 1042
`
`Page 1 of 106
`
`SAMSUNG EXHIBIT 1042
`
`

`

`Page 2 of 106
`
`Page 2 of 106
`
`

`

`Order Number 1845319
`
`Architectural, numerical and implementation issues in the VLSI
`design of an integrated CORDIC-SVD processor
`
`Kota, Kishore, M.S.
`
`Rice University, 1991
`
`U-M-I
`
`300 N, Zeeb Rd.
`Ann Arbor, MI 48106
`
`Page 3 of 106
`
`Page 3 of 106
`
`

`

`Page 4 of 106
`
`Page 4 of 106
`
`

`

`RICE UNIVERSITY
`
`Architectural, Numerical and Implementation
`Issues in the VLSI Design of an Integrated
`CORDIC-SVD Processor
`
`by
`
`Kishore Kota
`
`A THESIS SUBMITTED
`IN PARTIAL FULFILLMENT OF THE
`REQUIREMENTS FOR THE DEGREE
`
`Master of Science
`
`APPROVED, THESIS COMMITTEE:
`
`aJwatt R. Cavallaro, Chairman
`
`Assistant Professor
`Electrical and Computer Engineering
`K Bett
`John K. Bennett,
`Drj
`Assistant Professor
`Electrical and
`Computer Engineering
`
`
`
`Associate Professor
`Electrical and Computer Engineering
`
`Houston, Texas
`
`April, 1991
`
`Page 5 of 106
`
`Page 5 of 106
`
`

`

`Architectural, Numerical and Implementation
`Issues in the VLSI Design of an Integrated
`CORDIC-SVD Processor
`
`Kishore Kota
`
`Abstract
`
`This thesis describes the design of a systolic array for computing the Singular Value
`
`Decomposition (SVD) based on the Brent, Luk, Van Loan array. The use of COor-
`
`dinate Rotation Dlgital Computer (CORDIC) arithmetic results in an efficient VLSI
`
`implementation of the processor that forms the basic unit of the array. A six-chip
`
`custom VLSI chip set for the processor was initially designed, fabricated in a 2.0
`
`CMOS n-well process, and tested. The CORDIC Array Process Element (CAPE), a
`
`single chip implementation, incorporates several enhancements based on a detailed
`
`error analysis of fixed-point CORDIC. The analysis indicates a need to normalizein-
`
`put values for inverse tangent computations. This scheme was implemented using a
`
`novel method that has O(n'*) hardware complexity. Use of previous techniques to
`
`implement such a normalization would require O(n?) hardware. Enhanced architec-
`
`tures, which reduce idle time in the array either through pipelining or by improving
`
`on a broadcast technique, are also presented.
`
`Page 6 of 106
`
`Page 6 of 106
`
`

`

`Acknowledgments
`
`Special thanks are due to Dr. Joe Cavallaro for being the motivation and a constant
`
`driving force behind this work.
`
`I am especially thankful to him for being a good
`
`friend and just being there when I needed him most. I would also like to thank Dr.
`
`John Bennett and Dr. Peter Varman for serving on the committee and having the
`
`patience to deal with me.
`
`Myfriend and constant companion “yo” Hemkumar deserves special praise for
`
`all those enlightening discussions and valuable comments, which on more than one
`
`occasion prevented me from taking the wrong decisions. Thanks are due to my
`
`roommate, Raghu, for his constant, but often futile efforts at waking me up. But for
`
`him I would have slept through the entire semester without a care. Thanks to the
`
`Common Pool for allowing me to work through even the most difficult times without
`
`compromising on the food. Special thanks to my office-mates Jay Greenwood and
`Jim Carson. The office would not be half as lively without them.
`
`Finally I am eternally indebted to my mother and father. Even when they are
`
`halfway across the world, nothing is possible without their blessings.
`
`- Page 7 of 106
`
`Page 7 of 106
`
`

`

`To my Grandfather,
`
`For his one statement that has been the driving force through the years:
`
`“|. hope for the best, but be prepared for the worst...”
`
`Page 8 of 106
`
`Page 8 of 106
`
`

`

`Contents
`
`Abstract
`
`Acknowledgments
`
`|
`
`List of Tables
`
`List of Illustrations
`
`Introduction
`
`ii
`
`iti
`
`viii
`
`ix
`
`1
`
`1.1
`
`Systolic ArraysforSVD 2... ee ee ee 2
`
`1.2 Contributions of the thesis... 2... ee ee ee eee
`
`1.3 Overview of the thesis
`
`.. 0... 0... eee ee ee ee
`
`SVD Array Architecture
`
`2.1
`
`Introduction... ..... ee ee ee ee ee
`
`3
`
`3
`
`5
`
`5
`
`2.2 TheSVD-Jacobi Method ........... 0.0. ee eee eee 6
`
`2.3 Direct 2-Angle Method ... 1... 2. eee ee ee es
`
`2.4 Architecturefor2x2SVD......... ee eee ee ee ee ee es
`
`CORDIC Techniques
`
`8
`
`12
`
`14
`
`3.1
`
`Introduction... .. ee ee es 14
`
`3.2 CORDIC Algorithms .. 0... ee ee ee ee ees
`
`3.3 CORDIC Operation in the Circular Mode
`
`.. 1.2... 20+ eevee
`
`15
`
`16
`
`3.3.1 CORDIC Z-Reduction .. 0... 0... 2 ee eee ee ee 19
`
`~ Page 9 of 106
`
`Page 9 of 106
`
`

`

`3.3.2
`
`CORDIC Y-Reduction .... 2.2.0... 0. eee ee eee
`
`3.3.3 Modified Y-Reduction for the SVD processor. .........
`
`3.4
`
`Scale Factor Correction... . 0. ee ee ee ee
`
`3.5 Error Analysis of CORDIC Iterations ............22000
`
`3.6 Error Analysis of CORDIC Z-Reduction ...........0. 0005
`
`3.6.1 Error Introduced by X and Y Iterations ............
`
`3.6.2 Error Introduced by Z Iterations
`
`...........2.2048.%
`
`3.6.3
`
`Effects of perturbations in@ ...... 0. ee ee
`
`3.7 Error Analysis of CORDIC Y-Reduction ..........-.0 0005
`
`3.7.1 Error Introduced by X and Y Iterations ............
`
`3.7.2 Error Introduced by Z Iterations ....... 0.00. eens
`
`3.7.3 Perturbation in the Inverse Tangent
`
`..........200-
`
`3.8 Normalization for CORDIC Y-Reduction...............,
`
`3.8.1
`
`Single Cycle Normalization... ............00005
`
`3.8.2 Partial Normalization... ..........---2--4008,
`
`3.8.3 Choice of a Minimal Set of Shifts ..............24.
`
`3.8.4 Cost of Partial Normalization Implementation .........
`
`vi
`
`20
`
`22
`
`23
`
`23
`
`25
`
`26
`
`28
`
`29
`
`30
`
`31
`
`32
`
`32
`
`33
`
`35
`
`36
`
`39
`
`43
`
`3.9 Summary 2... . ec ee te ee eee
`
`43
`
`4 The CORDIC-SVD Processor Architecture
`
`46
`
`4.1
`
`Introduction. 2... ee ee ee
`
`46
`
`4.2 Architecture of the Prototype ............0. 00000 e eae
`
`4.2.1 Designofthe XY-Chip....... ce ue ue renee ewes
`
`4.2.2 Design of the Z-Chip .... 1... eee eee ee et ee
`
`4.2.3 Design of the Intra-Chip ..... 2... 00 eee eee ee
`
`4.2.4 Design of the Inter-Chip .........0 0000 eee eeee
`
`46
`
`48
`
`50
`
`50
`
`54
`
`Page 10 of 106
`
`Page 10 of 106
`
`

`

`4.3 The CORDIC Array Processor Element..............+.-.
`
`4.4
`
`Issues in Loading the Array . 1... eee ee ee ee
`
`4.4.1 Data Loading in CAPE............ 0.000202 eee
`
`5 Architectures to Improve Processor Utilization
`5.1
`Introduction. 2... ee eee
`
`5.2 Architecture for Systolic Starting 2... ..... ee eee eee
`
`5.3 Architecturefor fast SVD .. 1... ee ee ee ee te et ee
`
`5.4 Architecture for Collection of U and V Matrices ........00 00s
`
`6 VLSI Implementation Issues
`
`6.1 Design Methodology ...... 0... eee ee ee ee es
`
`6.2 Reducing Skews in Control Signals 2.1... .. eee eee eens
`
`6.3 Design ofan Adder... 1... ee ee es
`
`Vil
`
`55
`
`60
`
`61
`
`64
`
`64
`
`65
`
`70
`
`73
`
`75
`
`75
`
`18
`
`79
`
`6.4 Testing... 1... ee ee ee ee ee ees
`
`80
`
`7 Conclusions
`
`7.1 Summary ..we ee ee ee
`
`7.2 Future Work... 2... ee te ee ees
`
`Bibliography
`
`85
`
`85
`
`87
`
`88
`
`: Page 11 of 106
`
`Page 11 of 106
`
`

`

`Tables
`
`2.1
`
`Parallel Ordering ©... ce ee ee ee es
`
`8
`
`3.1 CORDIC Functionality in the Circular Mode. ..........004.
`
`3.2 Total Shifts Using a Novel Normalization Scheme ...........
`
`19
`
`41
`
`3.3 Summary and Comparison of Error Analysis ...........005 44
`
`6.1 A list of some of the problemsin the 6-chip prototype.........
`
`81
`
`6.2 Breakdown of clock cycles for the various sub-tasks in the 6-chip
`
`prototype 2.ee 83
`
`6.3 Breakdown of clock cycles for the various sub-tasks in CAPE .....
`
`83
`
`Vili
`
`. Page 12 of 106
`
`Page 12 of 106
`
`

`

`Illustrations
`
`2.1
`
`2.2
`
`The Brent-Luk-Van Loan SVD Array... .. 0.2.0... 00000048
`
`Snapshots of Brent-Luk-Van Loan SVD Array ..........0.04
`
`9
`
`ll
`
`3.1
`
`Rotation of a vector 6... ee
`
`3.2
`
`Implementation of the CORDIC Iterations in Hardware........
`
`18
`
`3.3
`
`Rotation of a vector in Y-Reduction. .. 6... eevee eee es
`
`3.4
`
`Error in Y-Reduction Without Normalization ..........00..
`
`3.5
`
`3.6
`
`Algorithm to Invoke Modified CORDIC Iterations ...........
`
`Implementation of the CORDICIterations in Hardware........
`
`3.7
`
`Error in Y-Reduction With Partial Normalization ...,......0..
`
`21
`
`34
`
`40
`
`4]
`
`42
`
`4.1
`
`4.2
`
`Block Diagram of the 6-chip Prototype CORDIC-SVD Processor .
`
`.
`
`.
`
`47
`
`Die Photograph of the XY-Chip .. 2... 6. ee ee ee ee es
`
`4.3
`
`Die Photograph of the Z-Chip .. 1... ce ee ee eee ee ee
`
`4.4
`
`Block Diagram of the Intra-Chip
`
`.. 1... 0. ee eee ee eens
`
`4.5
`
`4.6
`
`Die Photograph of the Intra-Chip .. 1... 2... - eee eee eee
`
`Block Diagram of Inter-Chip... 0... ee ee eee ee eee
`
`4.7
`
`Die Photograph of the Inter-Chip ........ 0... 00 e uae
`
`4.8
`
`Internal Architectureof CAPE... 1.1... eee eee ees
`
`ix
`
`
`
`Page 13 of 106
`
`49
`
`51
`
`52
`
`53
`
`55
`
`56
`
`58
`
`Page 13 of 106
`
`

`

`4.9
`
`A Combined DSP-VLSI Array for Robotics... .. ‘beeen wanes
`
`62
`
`5.1
`
`5.2
`
`5.3
`
`5.4
`
`5.5
`
`6.1
`
`6.2
`
`Interconnection of Control Signals for Systolic Starting ........
`
`Interconnection of Data Lines for Systolic Starting. ..........
`
`Snapshots of a Systolic Starting scheme... ....... 2.0000 ee
`
`66
`
`67
`
`69
`
`Hardware Required for Fast SVD .. 2... 2... ee ee ee eee 71
`
`Snapshots of an Array Computing U andV ..........-44.
`
`73
`
`Plot of the completed Single Chip Implementation. ..........
`
`77
`
`Clock Generation with Large Skew ...... 0. eee uuu eee 78
`
`6.3
`
`Clock Generation with Small Skew ..........0. 000 eee 78
`
`Page 14 of 106
`
`Page 14 of 106
`
`

`

`Chapter 1
`
`Introduction
`
`Rapid advances made in VLSI and WSI technologies have led the way to an entirely
`
`different approach to computerdesignforreal-time applications, using special-purpose
`
`architectures with custom chips. By migrating some of the highly compute-intensive
`
`tasks to special-purpose hardware, performance which is typically in the realm of
`
`supercomputers, can be obtained at only a fraction of the cost. High-level Computer-
`
`Aided Design (CAD) tools for VLSI silicon compilation, allow fast prototyping of
`
`custom processors by reducing the tedium of VLSI design. Special-purpose architec-
`
`tures are not burdened by the problemsassociated with general computers and can
`
`map the algorithmic needs of a problem to hardware. Extensive parallelism, pipelin-
`
`ing and use of special arithmetic techniques tailored for the specific application lead
`
`to designs very different from conventional computers.
`
`Systolic and wavefront arrays form a class of special-purpose architectures that
`
`hold considerable promise for real-time computations. Systolic arrays are character-
`
`ized by “multiple use of each data item, extensive concurrency, a few types of simple
`
`cells and a simple and regular data and controlflow between neighbors” [21]. Systolic
`
`arrays typically consist of only a few types of simple processors, allowing easy proto-
`
`typing. The near-neighbor communication associated with systolic arrays results in
`
`short interconnections that allow higher operating speeds even in large arrays. Thus,
`
`systolic arrays offer the potential for scaling to very large arrays. Numerical problems
`
`. Page 15 of 106
`
`Page 15 of 106
`
`

`

`involving large matrices benefit from this property of systolic arrays.
`
`Manyreal-time signal processing, image processing and robotics applicationsre-
`
`quire fast computations involving large matrices. The high throughput required in
`
`these applications can in general, be obtained only through special-purpose archi-
`
`tectures. A computationally complex numerical problem, which has evoked a lot of
`
`interest is the Singular Value Decomposition (SVD) [15].
`
`It is generally acknowl-
`
`edged that the SVD is the only generally reliable method for determining the rank
`
`of a matrix numerically. Solutions to the complex problems encountered in real-time
`
`processing, which use the SVD,exhibit a high degree of numerical stability. SVD tech-
`
`niques handle rank deficiency andill-conditioning of matrices elegantly, obviating the
`
`need for special handling of these cases. The SVD is a very useful tool, for example,
`
`in analyzing data matrices from sensor arrays for adaptive beamforming [30], and low
`
`rank approximations to matrices in image enhancement [2]. The wide variety of ap-
`
`plications for the SVD coupled with its computational complexity justify dedicating
`
`hardware to this computation.
`
`1.1
`
`Systolic Arrays for SVD
`
`The SVD problem has evoked a lot of attention in the past decade. Numerousarchi-
`
`tectures and algorithms were proposed for computing the SVD in an efficient manner.
`
`The Hestenes’ method using Jacobi transformations has been a popular choice among
`
`researchers, due to the concurrencyit allows in the computation of the SVD. Numerous
`
`algorithms and architectures for the SVD were proposed by Luk [24, 26, 25, 23]. Brent,
`
`Luk and Van Loan [5] presented an expandable square array of simple processors to
`
`compute the SVD. A processor architecture for the Brent, Luk and Van Loan array,
`
`using CORDIC techniques, was presented by Cavallaro [8].
`
`Page 16 of 106
`
`Page 16 of 106
`
`

`

`1.2 Contributions of the thesis
`
`A detailed study of the numerical accuracy of fixed-point CORDIC modules is provided
`
`in this thesis. This analysis indicates a need to normalize the values for CORDIC Y-
`
`reduction in order to achieve meaningful results. A novel normalization scheme for
`
`fixed-point CORDIC Y-reduction that reduces the hardware complexity is developed.
`
`This thesis is chiefly concerned with the issues relating to the VLSI implemen-
`
`tation of the CORDIC-SVD array proposed by Cavallaro. Many of the lower-level
`
`architectural issues that are not relevant at the higher levels assume a new signifi-
`
`cance when trying to implement the array. Elegant solutions have been foundforall
`
`the problems encountered in the design process.
`
`An integrated VLSI chip to implement the 2 x 2 SVD processor element, which
`
`serves as the basic unit of the SVD array, has been designed. A new internal archi-
`
`tecture has been developed within the constraints imposed by VLSI.
`
`Numerous array architectures that improve on the previous architectures were
`
`developed as part of this research. These schemes can used in future implementations
`
`as improvements to the current design.
`
`1.3. Overview of the thesis
`
`Chapter 2 presents the SVD algorithm and the array proposed by Brent, Luk and Van
`
`Loan [5]. A discussion of the CORDIC-SVD processor is also provided. Chapter 3
`
`reviews the CORDIC arithmetic technique and then presents an improved analysis
`
`of the numerical accuracy of fixed-point CORDIC. Chapter 4 presents the improve-
`
`ments made to the architecture of the single chip VLSI implementation of the SVD
`
`processor over the various sub-units that constitute the prototype SVD processor.
`
`Chapter 5 presents improvements to the array architecture and the impact of these
`
`Page 17 of 106
`
`Page 17 of 106
`
`

`

`on the processor design. These improvements include schemes for systolic starting
`
`and architectures that reduce idle time in the array, to achieve higher throughput.
`
`~Page 18 of 106
`
`Page 18 of 106
`
`

`

`Chapter 2
`
`SVD Array Architecture
`
`2.1
`
`Introduction
`
`This chapter introduces the principles that govern the efficient computation of the
`
`Singular Value Decomposition (SVD)in a parallel array. A major portion of the work
`in this thesis is based on the Brent, Luk and Van Loan [5] array for computing the
`
`SVD. Later sections of this chapter provideinsight into the operation ofthis array.
`
`The SVD has proved to be an important matrix decomposition with theoretical
`
`and practical significance. The canonical representation of a matrix provided by the
`
`SVD is very useful in determining its properties. The SVD [15] of a p x p matrix A is
`
`defined as
`
`where,
`
`A=ULV?,
`
`(2.1)
`
`U and V are orthogonal matrices of dimension p x p, and
`
`& = diag(o1,02,-++, ap) is a diagonal matrix of singular values.
`
`The columnsof U andVare called the left and the right singular vectors respectively.
`
`The SVD is a compute intensive operation requiring O(p*) time on a sequential
`
`computer, For example, the SVD of an 8 x 8 matrix on a SUN 3/60 requires about
`
`1 second of CPU time using LINPACK [12], a library of linear algebra routines. A
`
`typical real-time application, the inverse kinematics engine of a robot, requires the
`
`- Page 19 of 106
`
`Page 19 of 106
`
`

`

`computation of the SVD of an 8 x 8 matrix every 400ysec. This is precisely the class
`
`of applications where the use of dedicated arrays is attractive. With a square array
`
`of processors it is possible to reduce the time complexity for the SVD to O(plog p).
`
`2.2 The SVD-Jacobi Method
`
`The Jacobi method to compute the SVD is a popular algorithm for parallel imple-
`
`mentation, due to the concurrency it introduces to the problem. Thealgorithm is
`
`based on the use of orthogonal transformations, called Jacobi Rotations *
`
`to selec-
`
`tively reduce a given matrix element to zero. A Jacobi rotation by an angle @ in the
`
`ij plane, is a p x p matrix, J(2, 7,9), where,
`
`Jaa = 1
`
`ji = cos,
`
`V(a # 23),
`tI
`
`sin6 ,
`
`fiz
`
`(2.2)
`
`jit = —sind,
`
`jj; = cosé,
`
`and all other j,, = 0.
`
`Intuitively, a Jacobi rotation is an identity matrix with four of its elements replaced
`
`by a vector rotation matrix as shown below:
`
`J(i,j,0) =
`(155,9)
`
`0
`
`~— sin @
`
`:
`cos @
`
`0]
`
`7
`
`2.3
`(2.3)
`
`0
`J
`i
`The parameters i and jselect the matrix element that is to be reduced to zero,
`
`0
`
`0
`
`1
`
`while the parameter @ is computed using a few matrix elements. Since very few
`
`lalso called Givens Rotations or simply plane rotations
`
`- Page 20 of 106
`
`Page 20 of 106
`
`

`

`matrix elements are required to compute the entire transformation, local computation
`
`is possible by storing all the required data in the registers of a single processor.
`
`In
`
`addition, pre-multiplication with a Jacobi rotation matrix, called left-sided rotation,
`
`modifies only the i** and j** rows, while a right-sided rotation affects only the it*
`
`and 7** columns. This introduces concurrency, since computations which use the
`
`unmodified data elements can be performed in parallel.
`
`The SVDalgorithm consists of a sequence of orthogonal ? Jacobi transformations,
`
`chosen to diagonalize the given matrix. The algorithm to compute the SVDis iterative
`
`and can be written as
`
`Ao = A,
`
`Angry = UPAV, = Juli, 7,01)?Ande(i, J, Or)»
`
`(2.4)
`
`(2.5)
`
`Appropriate choice of the 2-sided rotations can force A; to converge to the diagonal
`
`matrix of singular values £. The matrix of left singular vectors is obtained as a
`
`product of the left rotations U;,, while the matrix of right singular vectors is obtained
`
`as a productof the right-sided rotations. At each iteration k,a left rotation of 4 and
`
`a right rotation of 9, in the plane 7j are chosen to reduce the twooff-diagonal elements
`
`a;; and a;; of A, to zero. A sweep consists of n(n — 1)/2 iterations that reduce every
`
`pair of off-diagonal elements to zero. Several sweeps are required to actually reduce
`
`the initial matrix to a diagonal matrix. The sequence, in which the off-diagonal pairs
`
`are annihilated within each sweep is chosen apriori according to some fixed ordering.
`
`Forsythe-Henrici [14] proved that the algorithm converges for the cyclic-by-rows and
`
`cyclic-by-columns ordering. However,
`
`these orderings do not allow decoupling of
`
`successive iterations, thereby restricting the parallelism achievable. Brent and Luk [4]
`
`introduced the parallel ordering that does not suffer from this drawback.
`
`24 matrix is orthogonal when its transpose equals the inverse
`
`
`
`Page 21 of 106
`
`Page 21 of 106
`
`

`

`(1,2)
`(1,4)
`(1,6)
`(1,8)
`(1,7)
`(1,5)
`(1,3)
`
`(3,4)
`(2,6)
`(4,8)
`(6,7)
`(8,5)
`(7,3)
`(5,2)
`
`(5,6)
`(3,8)
`(2,7)
`(4,5)
`(6,3)
`(8,2)
`(7,4)
`
`(7,8)
`(5,7)
`(3,5)
`(2,3)
`(4,2)
`(6,4)
`(8,6)
`
`Table 2.1: Parallel Ordering data exchange with 8 elements. This is the
`ordering used when eight chess players have to play a round-robin tour-
`nament, each player playing one gamea day.
`
`An iteration that annihilates a1, and az does not affect any of the terms required
`
`for annihilating a33 and a4q, allowing the two sets of rotation angles to be computed
`
`concurrently. Thus parallel ordering allows decoupling of the computation of the
`
`Jacobi transformationsofall the iterations in a single row of the ordering (Table 2.1).
`
`Brent, Luk and Van Loan [5] proposed a square array of processors based on this
`
`scheme. The ordering has an additional property that it requires only near-neighbor
`
`communication in an array and is hence amenable to systolic implementation.
`
`2.3 Direct 2-Angle Method
`
`The Brent, Luk and Van Loan systolic array [5] consists of a p/2 x p/2 array of
`
`processors (Figure 2.1) to compute the SVD of a px p matrix. The matrix dimension,
`
`p, is assumed to be even. Each processor stores a 2 x 2 submatrix and has the ability
`
`to compute the SVD of a 2 x 2 matrix. The special structure of the Jacobi rotation
`
`matrix allows computing the angles for a p x p two-sided rotation in termsof a basic
`
`2 x 2 rotation. The application of the transformation on a p x p matrix can also be
`
`expressed as a set of 2 x 2 transformations. Thus the 2 x 2 SVD formsthe basic step
`
`for the p x p SVD.
`
`Page 22 of 106
`
`Page 22 of 106
`
`

`

`
`
`Figure 2.1: The Brent-Luk-Van Loan SVD Array for Computing the SVD
`of an 8 x 8 Matrix
`
`Page 23 of 106
`
`Page 23 of 106
`
`

`

`A 2x2 SVD can be described as
`
`R(O1)"
`
`a
`
`b
`
`
`od RG) =
`
`10
`
`(2.6)
`
`vi
`
`0
`
`0
`
`|
`
`where 9 and 6, are the left and right rotation angles, respectively. The rotation
`
`matrix is
`
`and the input matrix is
`
`R(0) =
`
`cos@ sin#@
`
`-—sin@ cos
`
`;
`
`e(]
`
`(2.7)
`
`The angles 6; and 6, necessary to diagonalize the matrix M can be shown to be
`
`the inverse tangents of the data elements of W:
`
`
`
`4, + A = tan7! [=] ’
`6,—0; = tan =|
`
`d—a
`
`(2.8)
`
`Theleft and right rotation angles in equation 2.5 can be computed using the
`ay
`ai;
`
`equations 2.8on a 2 x2 matrix formed by |
`
`| It may benoted that thep/2
`
`aj, 3;
`diagonal processors store precisely the matrices required for p/2 successive iterations,
`
`correspondingto a single row of parallel ordering in Table 2.1. All the transformations
`
`that annihilate the off-diagonal elements in the diagonal processors are independent
`
`and hence can be computed in parallel. An update of the entire array requires theleft
`
`angles to be propagated from a diagonal processor to all the processors on the same
`
`row, and the right angle to be propagated to all the processors on the same column.
`
`This propagation is systolic. The update of the matrix is followed by a permutation
`
`of the rows and the columnsof the matrix according to the parallel ordering. An 8-
`
`connected meshis required to perform this data exchange. The permutation results in
`
`the next set of sub-matrices, required to compute the new set of angles, to be formed
`
`Page 24 of 106
`
`Page 24 of 106
`
`

`

`ll
`
`
`
`Figure 2.2: Snapshots of the Brent-Luk-Van Loan SVD Array Showing
`Diagonal Wavesof Activity
`
`at the diagonal processors. Since the diagonal processors require data only from their
`
`diagonal neighbors, new computation can start as soon as these neighbors complete
`
`their portion of the computation. Thus, the updates of the matrix, corresponding
`
`to separate iterations in equation 2.5 are pipelined. This leads to diagonal waves of
`
`activity originating at the main diagonal and propagating outwards. This is shown
`
`in Figure 2.2. The various steps in the operation of this array are listed below:
`
`e The p/2 diagonal processors compute the required transformation angles using
`
`the 2 x 2 sub-matrix stored in them.
`
`e Each diagonal processor propagates the left angle, which it computed, to its east
`
`and west neighbor. Similarly the right rotation angles are propagated north and
`
`south,
`
`Page 25 of 106
`
`Page 25 of 106
`
`

`

`12
`
`e The non-diagonal processors receive the left and right angles and apply the 2-
`
`sided transformation to their sub-matrices. After the computation, they forward
`
`the left and right rotation angles to the processor next in the row or column.
`
`e Processor P22 in Figure 2.1 receivesits next set of data elements from processors
`
`P11, P33, P13 and P31. Hence, it has to wait till processors P31 and P13 finish
`
`their computations. Similarly, all the other diagonal processors need to wait
`
`for their diagonal neighbors to complete the transformations before exchanging
`
`data. Exchangeof data in different parts of the array is staggered in time.
`
`e After the data exchange, a new set of angles is computed by the diagonal pro-
`
`cessors and the same sequence of operations is repeated.
`
`Each diagonal processor requires (p-1) iterations to complete a sweep. The number
`
`of sweeps required for convergence is O(log p). For most matrices up to a size of
`
`100 x 100, the algorithm has been observed to converge within 10 sweeps. Since
`each iteration for a diagonal processor is completed in constant time, the total SVD
`requires O(plog p).
`
`2.4 Architecture for 2x 2SVD
`
`The array proposed by Brent, Luk and Van Loan was converted into an architecture
`
`suitable for VLSI implementation by Cavallaro and Luk [8]. By using CORDICarith-
`
`metic, a simple architecture suitable for systolic implementation was developed. The
`
`basic functionality required in each processoris:
`
`e Ability to compute inverse tangents is required for the computation of the 2-
`
`sided transformations from matrix data,
`
`Page 26 of 106
`
`Page 26 of 106
`
`

`

`13
`
`e Ability to perform 2-sided rotations of a 2 x 2 matrix that is required when
`
`performing the transformation,
`
`e Control to perform the parallel ordering data exchange.
`
`CORDIC allowsefficient implementation of the required arithmetic operations in
`
`hardware. Chapter 3 discusses the CORDIC algorithm in depth. The architecture of
`
`the individual processor element is discussed in Chapter 4.
`
`Page 27 of 106
`
`

`

`Chapter 3
`
`CORDIC Techniques
`
`3.1
`
`Introduction
`
`In a special-purpose VLSI processor it is not necessary to include a general purpose
`
`Arithmetic and Logic Unit (ALU). An efficient design with optimal area and time
`
`complexity can be obtained by using special arithmetic techniques that map the
`
`desired computations to simplified hardware. The CORDIC technique allowsefficient
`
`implementation of functions, like vector rotations and inverse tangents, in hardware.
`
`These constitute the principal computations in the SVD algorithm. Additional ALU
`
`operations (addition, subtraction and divide-by-2) required for the SVD are more
`
`basic operations and can reuse the adders andshifters that constitute the CORDIC
`
`modules.
`
`The Coordinate Rotation Digital Computer (CORDIC) technique wasinitially de-
`
`veloped by Volder [35] as an algorithm to solve the trignometric relationships that
`
`arise in navigation problems. Involving only a fixed sequence of additions or subtrac-
`
`tions, and binary shifts, this scheme was used to quickly and systematically approx-
`
`imate the value of a trignometric function or its inverse. This algorithm was later
`
`unified for several elementary functions by Walther [36]. The algorithm has been used
`
`extensively in a numberofcalculators [16] to perform multiplications, divisions, to
`
`calculate square roots, to evaluate sine, cosine, tangent, arctangent, sinh, cosh, tanh,
`
`arctanh, In and exp functions, and to convert between binary and mixed radix num-
`
`14
`
`Page 28 of 106
`
`Page 28 of 106
`
`

`

`15
`
`ber systems [10].
`
`3.2 CORDIC Algorithms
`
`A variety of functions are computed in CORDIC by approximating a given angle
`
`in terms of a sequence of fixed angles. The algorithm for such an approximation
`
`is iterative and always converges in a fixed numberofiterations. The basic result
`
`of CORDIC concerns the convergence of the algorithm and is discussed in depth by
`
`Walther [36]. The convergence theorem states the conditions that affect convergence
`
`and essentially gives an algorithm for such an approximation. The theorem is restated
`
`here. Walther [36] gives the proof for this theorem.
`
`Theorem 3.1 (Convergence of CORDIC) Suppose
`
`Yo 2 ¥1 2 $2 2°+-2 Yn > 0,
`
`is a finite sequence of real numbers such that
`
`n=1
`gis DY) opt n1 for0Sign—l,
`jeitl
`
`and supposer is a real number such that
`n-1
`Irs oe;
`j=0
`
`If so = 0, and si41 = 3; + d:y;, for 0 <i <n —1, where
`
`1,
`
`ifr>s3;
`
`-l,
`
`ifr<4s;,
`
`then,
`
`n-1
`lr — se] S$ D0 ps +on-1, forOSksn-l,
`j=k
`and so in particular |r — sn] < ppt.
`
`~ Page 29 of 106
`
`Page 29 of 106
`
`

`

`16
`
`If js represent angles, the theorem provides a simple algorithm for approximating
`
`any angler, in termsof a set of fixed angles y;, with a small error (~,_1. The versatility
`
`of the CORDIC algorithm allows its use in several different modes:
`
`linear, circular
`
`and hyperbolic. The principal mode of interest in the CORDIC-SVD processoris the
`
`circular mode of CORDIC, duetoits ability to compute vector rotations and inverse
`
`tangents efficiently.
`
`3.3>CORDIC Operation in the Circular Mode
`
`The anticlockwise rotation of a vector [Z;, 9; 7 through an angle a; (Figure 3.1)
`
`is given by
`
`Eigy
`
`cosa;
`
`sina;
`
`—sina;
`Fiat
`The same relation can be rewritten as
`
`cosa;
`
`=
`
`Z;
`
`OF
`
`.
`
`(3.1)
`
`
`
`
`
`Ti41 SEC A; tana;||z;1
`=
`.
`(3.2)
`
`Fi-1 SEC Q;
`
`— tan a;
`
`1
`
`Vi
`
`All computations in the circular mode of CORDIC involve a fized number ofiter-
`
`ations, which are similar to equation 3.2, but for a scaling,
`
`Tit1
`
`Vir
`
`x; + 6;y; tan a;
`
`y; — 6:2; tana;,
`
`(3.3)
`
`(3.4)
`
`where 2; and y; are states of variables x and y at the start of the 7 iteration,
`
`0<2z<n,and 6;
`
`€ {-1,+1}. The values z, and y, are the values obtained at the
`
`end of the CORDIC iterations and differ from the desired values Z, and g, due to a
`
`scaling as shown in Figure 3.1. Along with the z and y iterations, iterations of the
`
`form
`
`Zinn = 2; + 6:0;
`
`(3.5)
`
`Page 30 of 106
`
`Page 30 of 106
`
`

`

`17
`
`(Xy, Yn)
`
`Figure 3.1: Rotation of a vector
`
`are performed on a variable z to store different states of an angle. The angles a; are
`
`chosen to be tan~!(2-'), which reduces equation 3.3 and equation 3.4 to
`
`Ti+. = ay + 6:y;27'
`
`Yinr = Ye — 52,274
`
`(3.6)
`
`(3.7)
`
`These iterations, called the z and the y iterations, can be implemented as simple
`
`additions or subtractions, and shifts. The angles a; are stored in a ROM,allowing the
`
`z iterations, given by equation 3.5, to be implemented as additions or subtractions.
`
`The actual hardware required to implement these iterations in hardware is shown in
`
`Figure 3.2.
`
`Rotation of a vector through any angle @ is achieved by decomposing the angle
`
`as a sum of ajs, using the CORDIC convergence theorem, and rotating the initial
`
`vector through this sequence of angles. Different functions are implemented by a
`
`different choice of 6; at each iteration. If 6;
`
`is chosen to force theinitial zg to 0, it
`
`~Page 31 of 106
`
`Page 31 of 106
`
`

`

`18
`
`
`
`Control
`Unit
`PLA
`
`Figure 3.2: Implementation of the CORDIC Iterations in Hardware
`
`: Page 32 of 106
`
`Page 32 of 106
`
`

`

`|Operation|for Variables Final values for variables
`
`.
`
`Functions
`
`Y-Reduction
`
`
`
`.
`
`Initial Values
`
`.
`
`19
`
`Z-Reduction
`
`=
`
`Ip = (x9 cos 0 + yocos9)/Ky
`Yn =
`(—aosin@ + yocos 6)/K,,
`(
`Zn = O
`
`Vector
`Rotations,
`
`sine, cosine
`
`= 03 + y3/Kn
`= 0
`
`=
`
`= tan7? (#To
`
`)
`
`Inverse
`
`Tangents
`
`Table 3.1: CORDIC Functionality in the Circular Mode
`
`is termed Rotation or Z-reduction. This is used to perform efficient vector rotations
`
`given an initial angle. A choice of 6;s to reduce the initial yo to 0 is called Vectoring
`
`or Y-reduction. Iterations of this form allow the computation of the inverse tangent
`
`function. Table 3.1 summarizes the various operations performed in the CORDIC
`
`circular mode, as they pertain to the SVD processor.
`
`3.3.1 CORDIC Z-Reduction
`
`Given an initial vector [%0, fo]” = [o, yol” and an angle zo = 0, through which to
`
`rotate, the ro

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket