throbber
INFORMATION TO USERS
`
`This manuscript has been reproduced from the microfilm master. UMI
`films the text directly from the original or copy submitted. Thus, some
`thesis and dissertation copies are in typewriter face, while others may
`
`be from any type of computer printer.
`
`The quality of this reproduction is dependent upon the quality of the
`copy submitted. Broken or indistinct print, colored or poor quality
`illustrations and photographs, print bleedthrough, substandard margins,
`and improper alignment can adversely affect reproduction.
`
`In the unlikely event that the author did not send UMI a complete
`manuscript and there are missing pages, these will be noted. Also, if
`unauthorized copyright material had to be removed, a note will indicate
`the deletion.
`
`Oversize materials (e.g., maps, drawings, charts) are reproduced by
`sectioning the original, beginning at the upper left-hand corner and
`continuing from left to right in equal sections with small overlaps. Each
`original is also photographed in one exposure and is included in
`reduced form at the back of the book.
`
`Photographs included in the original manuscript have been reproduced
`xerographically in this copy. Higher quality 6" x 9" black and white
`photographic prints are available for any photographs or illustrations
`appearing in this copy for an additional charge. Contact UMI directly
`to order.
`
`University Miorolilms International
`A Bell E. Howell Information Company
`300 North Zeeb Road. Ann Arbor. MI 48106-1346 USA
`313:781-4700
`800621-0600
`
`Page 1 of 106
`
`SAMSUNG EXHIBIT 1042
`
`Page 1 of 106
`
`SAMSUNG EXHIBIT 1042
`
`

`

`Page 2 of 106
`
`Page 2 of 106
`
`

`

`Order Number 1346319
`
`Architectural, numerical and implementation issues in the VLSI
`design of an integrated CORDIC-SVD processor
`
`Kata, Kishore, M.S.
`
`Rice University, 1991
`
`U-M-I
`
`300 N. chb Rd.
`Ann Arbor, M48106
`
`Page 3 of 106
`
`Page 3 of 106
`
`

`

`Page 4 of 106
`
`Page 4 of 106
`
`

`

`RICE UNIVERSITY
`
`Architectural, Numerical and Implementation
`Issues in the VLSI Design of an Integrated
`CORDIC-SVD Processor
`
`by
`
`Kishore Kota'
`
`A THESIS SUBMITTED
`
`IN PARTIAL FULFILLMENT or THE
`
`REQUIREMENTS FOR THE DEGREE
`
`Master of Science
`
`APPROVED, THESIS COMMITTEE:
`
`DE.J056th R. Cavallaro, Chairman
`
`Assistant Professor
`
`Electrical and Computer Engineering
`
`{W
`
`D John K. Bennett,
`Assistant Professor
`
`Electri
`
`l a
`
`Computer Engineering
`
`
`
`Dr. Pet
`
`. Varman
`
`Associate Professor
`
`Electrical and Computer Engineering
`
`Houston, Texas
`
`April, 1991
`
`Wage 5 of 106
`
`Page 5 of 106
`
`

`

`Architectural, Numerical and Implementation
`
`Issues in the VLSI Design of an Integrated
`CORDIC-SVD Processor
`
`Kishore Kota
`
`Abstract
`
`This thesis describes the design of a systolic array for computing the Singular Value
`
`Decomposition (SVD) based on the Brent, Luk, Van Loan array. The use of 0001‘-
`
`dinate Rotation DIgital Computer (CORDIC) arithmetic results in an elficient VLSI
`
`implementation of the processor that forms the basic unit of the array. A six-chip
`
`custom VLSI chip set for the processor was initially designed, fabricated in a 2.0;;
`
`CMOS n-well process, and tested. The CORDIC Array Process Element (CAPE), a
`
`single chip implementation, incorporates several enhancements based on a detailed
`
`error analysis of fixed-point CORDIC. The analysis indicates a need to normalize in-
`
`put values for inverse tangent computations. This scheme was implemented using a.
`
`novel method that has 001”) hardware complexity. Use of previous techniques to
`
`implement such a normalization would require 0(n3) hardware. Enhanced architec-
`
`tures, which reduce idle time in the array either through pipelining or by improving
`
`on a. broadcast technique, are also presented.
`
`Tull-Jags 6 of 106
`
`Page 6 of 106
`
`

`

`Acknowledgments
`
`Special thanks are due to Dr. Joe Cavallaro for being the motivation and a constant
`
`driving force behind this Work.
`
`I am especially thankful to him for being a good
`
`friend and just being there when I needed him most. I would also like to thank Dr.
`
`John Bennett and Dr. Peter Varman for serving on the committee and having the
`
`patience to deal with me.
`
`My friend and constant companion “llfxo” Hemkumar deserves special praise for
`
`all those enlightening discussions and valuable comments, which on more than one
`
`occasion prevented me from taking the wrong decisions. Thanks are due to my
`
`roommate, Raghu, for his constant, but often futile efforts at waking me up. But for
`
`him I would have slept through the entire semester without a care. Thanks to the
`
`Common Pool for allowing me to work through even the most difficult times without
`
`compromising on the food. Special thanks to my office-mates Jay Greenwood and
`
`Jim Carson. The office would not be half as lively without them.
`
`Finally I am eternally indebted to my mother and father. Even when they are
`
`halfway across the world, nothing is possible without their blessings.
`
`N Page 7 of 106
`
`Page 7 of 106
`
`

`

`To my Grandfather,
`
`For his one statement that has been the driving force through the years:
`
`“. ..hope for the best, but be pmpared for the worst . ..”
`
`—‘-Page 8 of 106
`
`Page 8 of 106
`
`

`

`Contents
`
`Abstract
`
`Acknowledgments
`
`List of Tables
`
`List of Illustrations
`
`1 Introduction
`
`1.1
`
`Systolic Arrays for SVD .........................
`
`1.2 Contributions of the thesis ........................
`
`1.3 Overview of the thesis ..........................
`
`ii
`
`iii
`
`viii
`
`ix
`
`wwmfl
`
`2 SVD Array Architecture
`
`2.1
`
`Introduction ................................
`
`O!
`
`2.2 The SVD-Jacobi Method .........................
`
`2.3 Direct 2-Angle Method ..........................
`
`2.4 Architecture for 2 x 2 SVD ........................
`
`3 CORDIC Techniques
`
`3.1
`
`Introduction ................................
`
`3.2 CORDIC Algorithms ...........................
`
`3.3 CORDIC Operation in the Circular Mode ...............
`
`3.3.1 CORD IC Z-Reduction ......................
`
`12
`
`14
`
`14
`
`15
`
`16
`
`“page 9 of 106
`
`Page 9 of 106
`
`

`

`vi
`
`3.3.2 CORDIC Y-Reduction ......................
`
`20
`
`3.3.3 Modified Y—Reduction for the SVD processor ..........
`
`22
`
`3.4
`
`Scale Factor Correction ..........................
`
`23
`
`3.5 Error: Analysis of CORDIC Iterations ..................
`
`3.6 Error Analysis of CORDIC Z-Reduction ................
`
`3.6.1 Error Introduced by X and Y Iterations ............
`
`23
`
`25
`
`26
`
`3.6.2 Error Introduced by Z Iterations ................
`
`28
`
`3.6.3
`
`Effects of perturbations in 6 ...................
`
`3.7 Error Analysis of CORDIC Y—Reduction ................
`
`3.7.1 Error Introduced by X and Y Iterations ............
`
`3.7.2 Error Introduced by Z Iterations ................
`
`3.7.3 Perturbation in the Inverse Tangent ..............
`
`3.8 Normalization for CORDIC Y—Reduction ................
`
`3.8.1
`
`Single Cycle Normalization ....................
`
`3.8.2 Partial Normalization .......................
`
`29
`
`30
`
`31
`
`32
`
`32
`
`33
`
`35
`
`36
`
`3.8.3 Choice of a Minimal Set of Shifts ................
`
`39
`
`3.3.4 Cost of Partial Normalization Implementation .........
`
`43
`
`3.9 Summary .................................
`
`43
`
`4 The CORDIC-SVD Processor Architecture
`
`46
`
`4.1
`
`Introduction ................................
`
`46
`
`4.2 Architecture of the Prototype ......................
`
`4.2.1 Design of the XY-Chip ......................
`
`4.2.2 Design of the Z~Chip .......................
`
`4.2.3 Design of the Intra-Chip .....................
`
`46
`
`48
`
`50
`
`50
`
`4.2.4 Design of the Inter-Chip .....................
`
`54
`
`Win-Page 10 of 106
`
`Page 10 of 106
`
`

`

`4.3 The C ORDIC Array Processor Element .................
`
`4.4
`
`Issues in Loading the Array .
`
`.
`
`.
`
`.
`
`L ..................
`
`44.1 Data. Loading in CAPE ......................
`
`5 Architectures to Improve Processor Utilization
`
`5.1
`
`Introduction ................................
`
`5.2 Architecture for Systolic Starting ....................
`
`5.3 Architecture for fast SVD ........................
`
`5.4 Architecture for Collection of U and V Matrices ............
`
`6 VLSI Implementation Issues
`
`6.1 Design Methodology ...........................
`
`6.2 Reducing Skews in Control Signals ...................
`
`6.3 Design of an Adder ............................
`
`6.4 Testing ...................................
`
`7 Conclusions
`
`7.1 Summary .................................
`
`7’2 Future Work ................................
`
`Bibliography
`
`vii
`
`55
`
`60
`
`61
`
`64
`
`64
`
`65
`
`70
`
`73
`
`75
`
`75
`
`78
`
`79
`
`80
`
`85
`
`85
`
`87
`
`88
`
`h Page 11 of 106
`
`Page 11 of 106
`
`

`

`Tables
`
`Parallel Ordering .............................
`
`8
`
`3.1
`
`3.2
`
`3.3
`
`6.1
`
`6.2
`
`6.3
`
`CORDIC Functionality in the Circular Mode ..............
`
`Total Shifts Using a Novel Normalization Scheme ...........
`
`Summary and Comparison of Error Analysis ..............
`
`19
`
`41
`
`44
`
`A list of some of the problems in the 6-chip prototype .........
`
`81
`
`Breakdown of clock cycles for the various sub-tasks in the 6-chip
`
`prototype ................................. 83
`
`Breakdown of clock cycles for the various sub-tasks in CAPE .....
`
`83
`
`viii
`
`_ Page 12 of 106
`
`Page 12 of 106
`
`

`

`Illustrations
`
`2.1
`
`2.2
`
`3.1
`
`3.2
`
`3.3
`
`3.4
`
`3.5
`
`3.6
`
`3.7
`
`4.1
`
`4.2
`
`4.3
`
`4.4
`
`4.5
`
`4.6
`
`4.7
`
`4.8
`
`The Brent-Luk-Van Loan SVD Array ..................
`
`9
`
`Snapshots of Brent-Luk-Van Loan SVD Array .............
`
`11
`
`Rotation of a vector ...........................
`
`17
`
`Implementation of the CORDIC Iterations in Hardware ........
`
`18
`
`Rotation of a vector in Y-Reduction .
`
`.
`
`, ................ 21
`
`Error in Y-Reduction Without Normalization .............
`
`34
`
`Algorithm to Invoke Modified CORDIC Iterations ...........
`
`Implementation of the CORDIC Iterations in Hardware ........
`
`Error in Y-Reduction With Partial Normalization ...........
`
`40
`
`41
`
`42
`
`Block Diagram of the 6-chip Prototype CORDlC-SVD Processor .
`
`.
`
`.
`
`47
`
`Die Photograph of the XY-Chip .....................
`
`Die Photograph of the Z-Chip ......................
`
`Block Diagram of the Intra-Chip ....................
`
`Die Photograph of the Intra-Chip ....................
`
`Block Diagram of Inter-Chip .......................
`
`Die Photograph of the Inter-Chip ....................
`
`49
`
`51
`
`52
`
`53
`
`55
`
`56
`
`Internal Architecture of CAPE ......................
`
`58
`
`ix
`
`
`
`page 13 of 106
`
`Page 13 of 106
`
`

`

`4.9
`
`5.1
`
`5.2
`
`5.3
`
`5.4
`
`5.5
`
`6.1
`
`6.2
`
`6.3
`
`A Combined DSP-VLSI Array for Robotics ..... t ..........
`
`62
`
`Interconnection of Control Signals for Systolic Starting ........
`
`Interconnection of Data Lines for Systolic Starting ...........
`
`66
`
`6?
`
`Snapshots of a Systolic Starting scheme .................
`
`69
`
`Hardware Required for Fast SVD ....................
`
`71
`
`Snapshots of an Array Computing U and V ..............
`
`73
`
`Plot of the completed Single Chip Implementation ...........
`
`Clock Generation with Large Skew ...................
`
`77
`
`78
`
`Clock Generation with Small Skew ...................
`
`78
`
`“age 14 of 106
`
`Page 14 of 106
`
`

`

`Chapter 1
`
`Introduction
`
`Rapid advances made in VLSI and WSI technologies have led the way to an entirely
`
`different approach to computer design for real-time applications, using special-purpose
`
`architectures with custom chips. By migrating some of the highly compute-intensive
`
`tasks to special-purpose hardware, performance which is typically in the realm of
`
`supercomputers, can be obtained at only a fraction of the cost. High-level Computer-
`
`Aided Design (CAD) tools for VLSI silicon compilation, allow fast prototyping of
`
`custom processors by reducing the tedium of VLSI design. Special-purpose architec-
`
`tures are not burdened by the problems associated with general computers and can
`
`map the algorithmic needs of a problem to hardware. Extensive parallelism, pipelin-
`
`ing and use of special arithmetic techniques tailored for the specific application lead
`
`to designs very different from conventional computers.
`
`Systolic and wavefront arrays form a class of special-purpose architectures that
`
`hold considerable promise for real-time computations. Systolic arrays are character-
`
`ized by “multiple use of each data item, extensive concurrency, a few types of simple
`
`cells and a simple and regular data and control flow between neighbors” [21]. Systolic
`
`arrays typically consist of only a few types of simple processors, allowing easy proto-
`
`typing. The near-neighbor communication associated with systolic arrays results in
`
`short interconnections that allow higher operating speeds even in large arrays. Thus,
`
`systolic arrays offer the potential for scaling to very large arrays. Numerical problems
`
`_ Page 15 of106
`
`Page 15 of 106
`
`

`

`involving large matrices benefit from this property of systolic arrays.
`
`Many real~time signal processing, image processing and robotics applications re-
`
`quire fast computations involving large matrices. The high throughput required in
`
`these applications can in general, be obtained only through special-purpose archi-
`
`tectures. A computationally complex numerical problem, which has evoked a lot of
`
`interest is the Singular Value Decomposition (SVD) [15].
`
`It is generally acknowl-
`
`edged that the SVD is the only generally reliable method for determining the rank
`
`of a. matrix numerically. Solutions to the complex problems encountered in real-time
`
`processing, which use the SVD, exhibit a high degree of numerical stability. SVD tech-
`
`niques handle rank deficiency and ill-conditioning of matrices elegantly, obviating the
`
`need for special handling of these cases. The SVD is a very useful tool, for example,
`
`in analyzing data matrices from sensor arrays for adaptive beamforming [30], and low
`
`rank approximations to matrices in image enhancement. [2]. The wide variety of ap-
`
`plications for the SVD coupled with its computational complexity justify dedicating
`
`hardware to this computation.
`
`1.1
`
`Systolic Arrays for SVD
`
`The SVD problem has evoked a lot of attention in the past decade. Numerous archi-
`
`tectures and algorithms were proposed for computing the SVD in an efficient manner.
`
`The Hestenes1 method using Jacobi transformations has been a popular choice among
`
`researchers, due to the concurrency it allows in the computation of the SVD. Numerous
`
`algorithms and architectures for the SVD were proposed by Luk [24, 26, 25, 23]. Brent,
`
`Luk and Van Loan [5] presented an expandable square array of simple processors to
`
`compute the SVD. A processor architecture for the Brent, Luk and Van Loan array,
`
`using CORDIC techniques, was presented by Cavallaro [8].
`
`mkPage 16 of 106
`
`Page 16 of 106
`
`

`

`1.2 Contributions of the thesis
`
`A detailed study of the numerical accuracy of fixed-point CORDIC modules is provided
`
`in this thesis. This analysis indicates a need to normalize the values for CORDIC ‘1’-
`
`reduction in order to achieve meaningful results. A novel normalization scheme for
`
`fixed-point CORDIC Y-reduction that reduces the hardware complexity is developed.
`
`This thesis is chiefly concerned with the issues relating to the VLSI implemen-
`
`tation of the CORDIC-5V1) array proposed by Cavallaro. Many of the lower-level
`
`architectural issues that are not relevant at the higher levels assume a new signifi-
`
`cance when trying to implement the array. Elegant solutions have been found for all
`
`the problems encountered in the design process.
`
`An integrated VLSI chip to implement the 2 X 2 SVD processor element, which
`
`serves as the basic unit of the SVD array, has been designed. A new internal archi-
`
`tecture has been developed within the constraints imposed by VLSI.
`
`Numerous array architectures that improve on the previous architectures were
`
`developed as part of this research. These schemes can used in future implementations
`
`as improvements to the current design.
`
`1.3 Overview of the thesis
`
`Chapter 2 presents the SVD algorithm and the array proposed by Brent, Luk and Van
`
`Loan [5]. A discussion of the CORDIC-END processor is also provided. Chapter 3
`
`reviews the CORDIC arithmetic technique and then presents an improved analysis
`
`of the numerical accuracy of fixed-point CORDIC. Chapter 4 presents the improve-
`
`ments made to the architecture of the single chip VLSI implementation of the SVD
`
`processor over the various sub-units that constitute the prototype SVD processor.
`
`Chapter 5 presents improvements to the array architecture and the impact of these
`
`Page 17 of 106
`
`Page 17 of 106
`
`

`

`on the processor design. These improvements include schemes for systolic starting
`
`and architectures that reduce idle time in the array, to achieve higher throughput.
`
`r Page 18 of 106
`
`Page 18 of 106
`
`

`

`Chapter 2
`
`SVD Array Architecture
`
`2.1
`
`Introduction
`
`This chapter introduces the principles that govern the efficient computation of the
`
`Singular Value Decomposition (SVD) in a parallel array. A major portion of the work
`
`in this thesis is based on the Brent, Luk and Van Loan [5] array for computing the
`
`SVD. Later sections of this chapter provide insight into the operation of this array.
`
`The SVD has proved to be an important matrix decomposition with theoretical
`
`and practical significance. The canonical representation of a matrix provided by the
`
`SVD is very useful in determining its properties. The SVD [15] of a p x 30 matrix A is
`
`defined as
`
`where,
`
`A = UEVT,
`
`(2.1)
`
`U and V are orthogonal matrices of dimension p X p, and
`
`E = diag(al,o'2, - . - ,0?) is a diagonal matrix of singular values.
`
`The columns of U and V are called the lefi and the right singular vectors respectively.
`
`The SVD is a compute intensive operation requiring 0(p3) time on a sequential
`
`computer. For example, the SVD of an 8 x 8 matrix on a SUN 3f60 requires about
`
`1 second of CPU time using LINPACK [12], a library of linear algebra routines. A
`
`typical real-time application, the inverse kinematics engine of a robot, requires the
`
`W Page 19 of 106
`
`Page 19 of 106
`
`

`

`computation of the SVD of an 8 x 8 matrix every 400ysec. This is precisely the class
`
`of applications where the use of dedicated arrays is attractive. With a square array
`
`of processors it is possible to reduce the time complexity for the SVD to 0(plog p).
`
`2.2 The SVD-Jacobi Method
`
`The Jacobi method to compute the SVD is a popular algorithm for parallel imple-
`
`mentation, due to the concurrency it introduces to the problem. The algorithm is
`
`based on the use of orthogonal transformations, called Jacobi Rotations 1
`
`to selec—
`
`tively reduce a given matrix element to zero. A Jacobi rotation by an angle 9 in the
`
`z'j plane, is a p x p matrix, J(z',j,9), where,
`
`j“ =1
`
`Via 7e 3.33.)!
`
`jii = c036,
`
`31-J- = sinfl ,
`
`(2.2)
`
`jja' = —sin6,
`
`jjj = c030,
`
`and all other jab = 0.
`
`Intuitively, a Jacobi rotation is an identity matrix with four of its elements replaced
`
`by a vector rotation matrix as shown below:
`
`0
`
`Ji,',6= 5
`(J J
`0
`
`c056
`
`5
`-sin6
`
`0
`
`0
`
`i
`
`sing
`
`5
`c039
`
`0
`
`3'
`
`3‘
`
`.
`J
`
`0
`
`5
`0
`
`1
`
`2.3
`
`)
`
`i
`
`The parameters 1' and 3' select the matrix element that is to be reduced to zero,
`
`while the parameter 0 is computed using a few matrix elements. Since very few
`
`1also called Givens Rotations or simply plane rotations
`
`_ Page 20 of 106
`
`Page 20 of 106
`
`

`

`matrix elements are required to compute the entire transformation, local computation
`
`is possible by storing all the required data in the registers of a single processor.
`
`In
`
`addition, pre-multiplication with a Jacobi rotation matrix, called Iefi-sided rotation,
`
`modifies only the 2"" and 3"“ rows, while a right-sided rotation affects only the i”
`
`and j‘h columns. This introduces concurrency, since computations which use the
`
`unmodified data elements can be performed in parallel.
`
`The SVD algorithm consists of a. sequence of orthogonal 2 Jacobi transformations,
`
`chosen to diagonalize the given matrix. The algorithm to compute the SVD is iterative
`
`and can be written as
`
`A0 a A,
`
`Ak-H
`
`{IE/am. = aw. elf/1mm. 9.).
`
`(-2.4)
`
`(2.5)
`
`Appropriate choice of the 2-sided rotations can force Al; to converge to the diagonal
`
`matrix of singular values 3. The matrix of left singular vectors is obtained as a
`
`product of the left rotations Ur, while the matrix of right singular vectors is obtained
`
`as a product of the right-sided rotations. At each iteration k, a left rotation of i9; and
`
`a right rotation of 9,. in the plane 53' are chosen to reduce the two off~cliagonal elements
`
`(1.3 and a}.- of At to zero. A sweep consists of n(n — l)/'2 iterations that reduce every
`
`pair of off-diagonal elements to zero. Several sweeps are required to actually reduce
`
`the initial matrix to a diagonal matrix. The sequence. in which the off-diagonal pairs
`
`are annihilated within each sweep is chosen apriori according to some fixed ordering.
`
`Forsythe-Henrici [14] proved that the algorithm converges for the cyclic-by-rows and
`
`cyclic—by-columns ordering. HOWever,
`
`these orderings do not allow decoupling of
`
`successive iterations, thereby restricting the parallelism achievable. Brent and Luk [4]
`
`introduced the parallel ordering that does not suffer from this drawback.
`
`2A matrix is orthogonal when its transpose equals the inverse
`
`
`
`page 21 of 106
`
`Page 21 of 106
`
`

`

`(1,2)
`(1,4)
`(1.5)
`(1:3)
`(117)
`(1,5)
`(113)
`
`(3,4)
`(2,6)
`(4,8)
`(5,7)
`(3,5)
`(7,3)
`(5,2)
`
`(5,6)
`(3.3)
`(2,7)
`(4,5)
`(6,3)
`(3,2)
`(7‘4)
`
`(7,8)
`(5.7)
`(3:5)
`(2.3)
`(4.2)
`(6,4)
`(816)
`
`Table 2.1: Parallel Ordering data exchange with 8 elements. This is the
`ordering used when eight chess players have to play a round-robin tour-
`nament, each player playing one game a day.
`
`An iteration that annihilates an and a2; does not affect any of the terms required
`
`for annihilating [133 and a“, allowing the twu sets of rotation angles to be computed
`
`concurrently. Thus parallel ordering allows decoupling of the computation of the
`
`Jacobi transformations of all the iterations in a single row of the ordering (Table 2.1).
`
`Brent, Luk and Van Loan [5] proposed a square array of processors based on this
`
`scheme. The ordering has an additional preperty that it requires only near-neighbor
`
`communication in an array and is hence amenable to systolic implementation.
`
`2.3 Direct Z-Angle Method
`
`The Brent, Luk and Van Loan systolic array [5] consists of a 1.1/2 x p/2 array of
`
`processors (Figure 2.1) to compute the SVD of a p x p matrix. The matrix dimension,
`
`p, is assumed to be even. Each processor stores a 2 x 2 submatrix and has the ability
`
`to compute the SVD of a 2 X 2 matrix. The special structure of the Jacobi rotation
`
`matrix allows computing the angles for a p x p two-sided rotation in terms of a basic
`
`2 x 2 rotation. The application of the transformation on a p x p matrix can also be
`
`expressed as a set of 2 x 2 transformations. Thus the 2 x 2 SVD forms the basic step
`
`for the p x p SVD.
`
`“age 22 of 106
`
`Page 22 of 106
`
`

`

`
`
`Figure 2.1: The Brent-Luk-Van Loan SVD Array for Computing the SVD
`of an 8 x 8 Matrix
`
`page 23 of 106
`
`Page 23 of 106
`
`

`

`A 2 >( 2 SVD can be described as
`
`12on
`
`a.
`
`b
`
`C d
`
`
`
`3(a) =
`
`‘1’)1
`
`0
`
`0
`
`1.92
`
`,
`
`10
`
`(2.6)
`
`where 9; and 9.. are the left and right rotation angles, respectively. The rotation
`
`matrix is
`
`and the input matrix is
`
`mo=
`
`cost}i
`_
`— sm 6'
`
`sinO
`
`cos 9
`
`.
`
`M =
`
`a
`
`c
`
`b
`
`d
`
`.
`
`on
`
`The angles 9; and 9.. necessary to diagonalize the matrix M can be shown to be
`
`the inverse tangents of the data elements of M:
`
`
`
`6r+6; = tan‘1[c+b],
`0..
`9; — tan
`[d+a]
`
`d—o
`
`
`
`(2.8)
`
`The left and right rotation angles in equation 2.5 can be computed using the
`a,“
`a.”
`
`J J. It may he noted that the pf?
`I
`equations 2.8 on a 2 x 2 matrix formed by [
`diagonal processors store precisely the matridlds: refriiired for p/‘Z successive iterations,
`
`corresponding to a single row of parallel ordering in Table 2.1. All the transformations
`
`that annihilate the off-diagonal elements in the diagonal processors are independent
`
`and hence can be computed in parallel. An update of the entire array requires the left
`
`angles to be propagated from a diagonal processor to all the processors on the same
`
`row, and the right angle to be propagated to all the processors on the same column.
`
`This propagation is systolic. The update of the matrix is followed by a. permutation
`
`of the rows and the columns of the matrix according to the parallel ordering. An 8-
`
`connected mesh is required to perform this data exchange. The permutation results in
`
`the next set of sub-matrices, required to compute the new set of angles, to be formed
`
`#HPage 24 of 106
`
`Page 24 of 106
`
`

`

`11
`
`
`
`Figure 2.2: Snapshots of the Brent-Luk—Van Loan SVD Array Showing
`Diagonal Waves of Activity
`-
`
`at the diagonal processors. Since the diagonal processors require data only from their
`
`diagonal neighbors, new computation can start as soon as these neighbors complete
`
`their portion of the computation. Thus, the updates of the matrix, corresponding
`
`to separate iterations in equation 2.5 are pipelined. This leads to diagonal waves of
`
`activity originating at the main diagonal and propagating outwards. This is shown
`
`in Figure 2.2. The various steps in the operation of this array are listed below:
`
`I The 30/2 diagonal processors compute the required transformation angles using
`
`the 2 x 2 sub-matrix stored in them.
`
`a Each diagonal processor propagates the left angle, which it computed, to its east
`
`and west neighbor. Similarly the right rotation angles are propagated north and
`
`south.
`
`.. Page 25 of 106
`
`Page 25 of 106
`
`

`

`12
`
`o The non-diagonal processors receive the left and right angles and apply the 2-
`
`sided transformation to their sub-matrices. After the computation, they forward
`
`the left and right rotation angles to the processor next in the row or column.
`
`0 Processor P22 in Figure 2.1 receives its next set of data elements from processors
`
`P11, P33, P13 and P31. Hence, it has to wait till processors P31 and P13 finish
`
`their computations. Similarly, all the other diagonal processors need to wait
`
`for their diagonal neighbors to complete the transformations before exchanging
`
`data. Exchange of data in difi'erent parts of the array is staggered in time.
`
`a After the data exchange, a new set of angles is computed by the diagonal pro-
`
`cessors and the same sequence of operations is repeated.
`
`Each diagonal processor requires (p—l) iterations to complete a sweep. The number
`
`of sweeps required for convergence is 0(log 19). For most matrices up to a size of
`
`100 x 100, the algorithm has been observed to converge within 10 sweeps. Since
`
`each iteration for a diagonal processor is completed in constant time, the total SVD
`
`requires 00;: log 13).
`
`2.4 Architecture for 2 x 2 SVD
`
`The array proposed by Brent, Luk and Van Loan was converted into an architecture
`
`suitable for VLSI implementation by Cavallaro and Luk [8]. By using CORDIC arith-
`
`metic, a simple architecture suitable for systolic implementation was developed. The
`
`basic functionality required in each processor is:
`
`0 Ability to compute inverse tangents is required for the computation of the 2-
`
`sided transformations from matrix data,
`
`page 26 of 106
`
`Page 26 of 106
`
`

`

`13
`
`0 Ability to perform 2—sided rotations of a 2 x 2 matrix that is required when
`
`performing the transformation,
`
`a Control to perform the parallel ordering data. exchange.
`
`CORDIC allows efficient implementation of the required arithmetic operations in
`
`hardware. Chapter 3 discusses the CORDIC algorithm in depth. The architecture of
`
`the individual processor element is discussed in Chapter 4.
`
`Page 27 of 106
`
`

`

`Chapter 3
`
`CORDIC Techniques
`
`3.1
`
`Introduction
`
`In a special— purpose VLSI processor it is not necessary to include a general purpose
`
`Arithmetic and Logic Unit (ALU). An efficient design with optimal area and time
`
`complexity can be obtained by using special arithmetic techniques that map the
`
`desired computations to simplified hardware. The CORDIC technique allows efficient
`
`implementation of functions, like vector rotations and inverse tangents, in hardware.
`
`These constitute the principal computations in the SVD algorithm. Additional ALU
`
`operations (addition, subtraction and divide-by-2) required for the SVD are more
`
`basic operations and can reuse the adders and shifters that constitute the CORDIC
`
`modules.
`
`The Coordinate Rotation Digital Computer (CORDIC) technique was initially de—
`
`veloped by Volder [35] as an algorithm to solve the trignometric relationships that
`
`arise in navigation problems. Involving only a. fixed sequence of additions or subtrac-
`
`tions, and binary shifts, this scheme was used to quickly and systematically approx-
`
`imate the value of a trignometric function or its inverse. This algorithm was later
`
`unified for severalelementary functions by Walther [36]. The algorithm has been used
`
`extensively in a number of calculators [16] to perform multiplications, divisions, to
`
`calculate square roots, to evaluate sine, cosine, tangent, arctangent1 sinh, cosh, tanh,
`
`arctanh, 1n and exp functions, and to convert between binary and mixed radix num-
`
`14
`
`Page 28 of 106
`
`Page 28 of 106
`
`

`

`ber systems [10].
`
`3.2
`
`CORDIC Algorithms
`
`A variety of functions are computed in 0011ch by approximating a. given angle
`
`in terms of a sequence of fixed angles. The algorithm for such an approximation
`
`is iterative and always converges in a fixed number of iterations. The basic result
`
`of CORDIC concerns the convergence of the algorithm and is discussed in depth by
`
`Walther [36]. The convergence theorem states the conditions that affect convergence
`
`and essentially gives an algorithm for such an approximation. The theorem is restated
`
`here.
`
`Walther [36] gives the proof for this theorem.
`
`Theorem 3.1 (Convergence ofCORDIC) Suppose
`
`$0290125022"'2fi0n-1>0i
`
`is a finite sequence of real numbers such that
`
`n-—1
`
`(ms 2 ‘Pj‘i‘tPn—h forOSiSn—l.
`j=i+1
`
`and suppose r is a real number such that
`11—]
`
`[TI 5 2 (Pi
`j=D
`
`If 50 = 0, and si+1 = s; + Jim, for 0 S 2' 5 n — l, where
`
`l,
`
`if1‘231'
`
`—1,
`
`if 1' < 31-,
`
`then,
`
`n—l
`
`Ir—Stls Esoj+wn-1.forDSkSn—L
`j=k
`
`and so in particular Ir -— 5,1] < 99.3-1.
`
`"page 29 of 106
`
`Page 29 of 106
`
`

`

`16
`
`If cogs represent angles, the theorem provides a simple algorithm for approximating
`
`any angle 1', in terms of a set of fixed angles cm, with a small error (pad. The versatility
`
`of the CORDIC algorithm allows its use in several different modes:
`
`linear, circular
`
`and hyperbolic. The principal mode of interest in the CORDIC-SVD processor is the
`
`circular mode of CORDIC, due to its ability to compute vector rotations and inverse
`
`tangents efficient 15:.
`
`3.3 CORDIC Operation in the Circular Mode
`
`The anticlockwise rotation of a vector [56,-3ng through an angle or; (Figure 3.1)
`
`is given by
`
`5,-4.1
`
`37.4.;
`
`=
`
`cos a;
`
`sin 0c.-
`
`— sin n.-
`
`cos or;
`
`in
`
`5?;
`
`.
`
`The same relation can be rewritten as
`
`51-.” see oz,-
`
`1
`
`tan a.-
`
`=
`
`37m see a.-
`
`— tan or;
`
`1
`
`:33;
`
`3}.-
`
`.
`
`(3.1)
`
`(3.2)
`
`All computations in the circular mode of CORDIC involve a fixed number of iter-
`
`ations, which are similar to equation 3.2, but for a scaling,
`
`3 {+1
`
`:t‘,’ -I- 6.1;; tan a.-
`
`gin-+1
`
`y.- — 5m.- tan or,
`
`(3.3)
`
`(3.4)
`
`where an and y; are states of variables a: and y at the start of the 1"" iteration,
`
`0 S 2' < n, and 6.- E {—1,+1}. The values 1,. and y“ are the values obtained at the
`
`end of the CORDIC iterations and differ from the desired values a. and 3}“ due to a
`
`scaling as shown in Figure 3.1. Along with the a: and y iterations, iterations of the
`
`form
`
`z.-+1 = art—Era;
`
`(3.5)
`
`L Page 30 of 106
`
`Page 30 of 106
`
`

`

`17
`
`
`
`Figure 3.1: Rotation of a vector
`
`are performed on a variable 2: to store different states of an angle. The angles or; are
`
`chosen to be tan—1Q"), which reduces equation 3.3 and equation 3.4 to
`
`$i+1 = $i+6fyi2_i
`
`yi+l
`
`y; -- §iwi3'£-
`
`(3.6)
`
`(3-7)
`
`These iterations, called the a: and the y iterations, can be implemented as simple
`
`additions or subtractions, and shifts. The angles or; are stored in a ROM, allowing the
`
`z iterations, given by equation 3.5, to be implemented as additions or subtractions.
`
`The actual hardware required to implement these iterations in hardware is shown in
`
`Figure 3.2.
`
`Rotation of a vector through any angle 9 is achieved by decomposing the angle
`
`as a sum of ms, using the CORDIC convergence theorem, and rotating the initial
`
`vector through this sequence of angles. Different functions are implemented by a
`
`different choice of 5; at each iteration. If 5.-
`
`is chosen to force the initial 20 to 0, it
`
`ifage 31 of 106
`
`Page 31 of 106
`
`

`

`18
`
`
`
`Figure 3.2: Implementation of the CORDIC Iterations in Hardware
`
`h Page 32 of 106
`
`Page 32 of 106
`
`

`

`_ for Variables
`
`-
`
`Initial Values
`
`.
`
`.
`
`Final values for variables
`
`19
`
`Functions
`
`Tangents
`
`Z-Reduction
`
`Y-Reduction
`
`2 (3:0 cos 6 + yo cos 9)/I{n
`= (_30 sin ,9 + 3):: cos ”/19
`
`Vector
`Rotations,
`sme, COSIDB
`
`In verse
`
`Table 3.1: CORDIC Functionality in the Circular Mode
`
`is termed Rotation or Z-rednctz'on. This is used to perform efficient vector rotations
`
`given an initial angle. A choice of 6,-5 to reduce the initial yo to 0 is called

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket