throbber
Proceedings of the
`International Conference on
`
`APPLICATION SPECIFIC
`
`ARRAY PROCESSORS
`
`September 5-7, 1990
`
`Princeton, New Jersey
`
`Sponsored by
`Princeton University
`
`Co-Sponsored by
`Industrial Development Board for Northern Ireland
`NEG Reseamh Institute
`
`In Cooperation with
`IEEE Computer Society
`
`Edited by
`
`Sun-Yuan Kung
`Princeton University
`
`lose A.B. Fortes
`Purdue University
`
`Earl E. Swartzlander, Jr.
`University of Texas at Austin
`
`K. Woitek Przytula
`Hughes Research Laboratories
`

`
`IEEE Computer Society Press
`Los Alomltos, Colifornio
`
`Washington - Brussels - Tokyo
`
`Page 1 of 18
`
`SAMSUNG EXHIBIT 1041
`
`Page 1 of 18
`
`SAMSUNG EXHIBIT 1041
`
`

`

`The papers In this book comprise the proceedings of the meeting mentioned on the
`cover and title page. They reflect the authors' opinions and are ptblished as presented
`and without change.
`In the Interests oi timely dissemination. Their Inclusion In this
`publication does not necessarily constitute endorsement by the editors.
`the IEEE
`Computer Society Press. or The Institute of Electrical and Electronics Engineers. Inc.
`
`Published by
`
`IEEE Computer Society Press
`10662 Lee Vaqueros Circle
`9.0. Box 3014
`Los Alamltos. CA 90720-1264
`
`Copyright c 1990 by the Institute of Electrical and Electronics Engineers. Inc.
`
`Cover designed by Jack I. Ballestero
`
`Printed In United States of America
`
`Copyright and Reprint Permissions: Abstracting Is permitted with credit to the source.
`Ubraries are permitted to photocopy beyond the limits of U.S. Wm law tor private
`use of patrons those articles In this volume that carry a code at the bottom of the first
`page. provided the per-copy tee Indicated In the code Is paid through the Copyright
`Clearance Center. 29 Congress Street. Salem. MA 01970.
`Instructors are permitted to
`photocopy isolated articles tor noncommercial classroom use without fee. For other
`copying. reprint or republication permission. write to Director. Publishing Services. IEEE.
`345 East 47th Sheet. New York. NY 10017. All rights reserved.
`
`IEEE Computer Society Press Order Number 2089
`Library of Congress Number 90-82602
`IEEE Catalog Number 900H2920-T
`ISBN 0-8186-9089-5 (case)
`ISBN 0-8186-6089-9 (microfiche)
`SAN 264-620):
`
`Additional copies can be ordered from:
`
`IEEE Computer Sodom
`I! Emitter Society Prue
`menu-cm ta.Avenuodsi'Aqr.iion
`use: l.- quler Chi
`$1200 Brussels
`so. In!“
`BELGIUM
`Lnllllltltu. or. “I'll-1204
`
`IEEE Compuur Society
`Datum-mu
`2-19-1 "semi-Myrna.
`mats-Kn
`Tokyo 107. JAPAN
`
`IEEE Santos Conn
`445mm
`P.O. Barr rest
`Plucstswsy. NJ 08005-1331
`
`“-15 INSTITUTE OF ELECTRICAL
`am ELECTRONICS ENGINEERS. M.
`
`iv
`
`Page 2 of 18
`
`Page 2 of 18
`
`

`

`Table of Contents
`
`General Chair's Message ..................................... v
`WMWBMWOUOIOOI IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIVi
`Piogram Committee ....................................... vii
`Referees ............................ .
`.
`.
`.
`.
`.
`.
`.
`. ......... viii
`
`K note Addrm: Application-Oriented High Speed Processms: Experiences
`Perspectives .......................................... 1
`Yasuo Karo
`
`Design Methodology
`
`CalculusofSpace-OptimalMsppin sofSystolicAlgorithmsonProcessm-An's s. . .4
`P. Clams. C. Mongener. and .R. Pen-in
`y
`A Processor-Time Minimal Systolic Array for Transitive Closure ............ 19
`PR. Cappello and CJ. Scheiimu
`Systolic Arm
`lemmtion of Nested Loop Programs ................. 31
`J. Bu,
`epreuere, and L. Thick
`
`1“.
`
`Potpourri I
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.78
`
`' Back-Projection Engine (BSSBPE) ................ 43
`The Bit-Serial S
`.3033?»
`A Database Machine Based on Surrogate Files ....................... .55
`SM. Chung
`Systolic Architectures for Decoding Reed Solomon Codes ................ .67
`J. Nelson. A. Rahman. and E. McQuade
`MappingI-li
`-DimensionWavefrontComputstionstoSilicon. .
`C.-
`.
`u, RM. Owens, andMJ. Irwin
`Systolic Architecture for 2-D Rank Order Filtering , ............ .
`1.4%ng omit-M. Jong
`Scheduling Affine Paramterizied Recm'rences by Means of Variable Dependent
`Timing Functions ........................................ I“)
`P. Quinton, C. Maxims. S. Rajopadlore. and Y. Saaurer
`'c Description Generator ............................... 111
`.B. Golchale, A. Kopser, 81’. Lucas, and 12.6. Minnlch
`Recursive
`orithms for AR Spectral Estimation andTheir Array Realizations. . .121
`C.-
`. an and C.-M. Llu
`Analyai’nflarsmetriaed Designs by Non-Standard Interpretation ............ 133
`Systolic-VLSI Compiler (SVC) for High Performance Vector Quantization Chips .
`. 14s
`J.V. McCanny, Y. Hit, and M. You
`Extensions to Linear Mapping for Regular Arrays with Complex
`Processing Elements ...................................... 156
`J. Rasseel, F. Carrhoor, and H. De Man
`Design of Run-Time Fault-Tolerant Arrays of Self-Giecking Processing Elements . 168
`J. Franzen
`
`.
`
`. ..... 90
`
`The
`
`Special-Purpose Systems
`
`GRAPE: A Special—We ComputerforN-Body Problems ........ .
`J. Hakim, 121:0,-
`. Ebi’suzaki, and D. Stigmata
`
`.
`
`.
`
`.
`
`.
`
`. 180
`
`Page 3 of 18
`
`Page 3 of 18
`
`

`

`Building Blocks for a New Generation of Application-Specific
`Computing Systems ....................................... 190
`B. Barter, G. Cox. T. Gross, 1LT. Kung, D. O'Hallaron. C. Peterson.
`J. Webb. and P. Wiley
`Reconfigurable Vector Register Windows for Fast Matrix Computation on the
`Orthogmal Multiprocessor .................................. .202
`. K. Panda and K. Hwarrg
`Massively Parallel Architecture: Application to Neural Net Emulation and Image
`Reconstruction .......................................... 214
`3. Force, D. We! and G. Mame
`A Real-Time Software Programmable Processor for HDTV and Stereo
`Scope Si
`.......................................... 226
`T.
`'. I. Tanitani, H. Harasakr‘, andM. Yano
`
`ishi
`
`Mapping Appllcatlons onto Architecture.
`
`Mapping Algorithms Onto the TUT Cellular Array Processor .............. 235
`J. Viiarrren, T. Korpiharju, J. Takala, and H. Klminla’uen
`A 3-D Wafer Scale Architecture fu- Early Vision Processing .............. 247
`SI. Tobarg
`Algorithmic Mappin of Neural Network Models onto Parallel SIMD Machines .
`VI Prasam m and KW. Pnytula
`Implementation of Systolic Al orithms Using Pipelined Functional Units ....... 2'72
`M. Valera-Garcia, JJ. avarro, JM. Llaberta, and M. Valera
`Array Processing on Finite Polynomial Rings ....................... 234
`N. Wigley and GA. Jullien
`
`. .259
`
`Potpourri II
`
`.
`
`. 329
`
`.
`
`.
`
`. .389
`
`The RAP: A Ring Array Processor for Layered Network Calculations ......... 296
`N. Morgan, J. Beck, P. Kohu, J. Bilmes, E. Altman, and J. Beer
`Linear Arrays for Residue Ma
`............................. 309
`A. Skavaurzos and 2.8.
`'
`A Fault-Tolerant Trio-Dimensional Sorting Network ................... 317
`LG. Krmner and H. Ard‘
`Channel Complexity Anal sis for Reconfigurable VLSI/WSI Processor Arrays .
`PK. Rhee and Hi.
`im
`Digit-Serial DSP Architectures ................................ 341
`XX. Parhl and C.-Y. Wang
`purer Vision ................. 352
`PASIC: A SensorlProcessm' Array for
`on:
`X. Chen, P.-E. Danielsson, WA.
`An Analog VLSI Arra Processor for Classical and Connectionist AI ......... 367
`J.W. Mills and .A. Danger
`Systolic Two-Port Adaptor for
`gh Performance Wave Digital Filtering ....... 379
`J.V. £1ch and RJ. Sing}:
`Multilayer Neural Model and Array Processor Implementation .
`.C. Fu and CC. Chiang
`Reconfiguration of WT Arrays: A Flow-Driven Approach ................401
`A. Anrola and N. Scarabortola
`Towards the Automated Design of Application Specific Array
`Processors (ASAPS) ...................................... 414
`AP. Marriott. A.W.G. Driller. RH. Srorer, AR. Thomon, and MR. Pour
`Fault-Tolerant Array Processors Using N-and-Half-Track Switches .......... 426
`J.S.N. Jean
`
`An
`
`1:
`
`Page 4 of 18
`
`Page 4 of 18
`
`

`

`DomainFlowandStreaming Architectures. .
`E.T.L. Omnigt
`An Improved Systolic Extended Euclidean Algorithm for Reed-Solomon
`Decoding: Design andlmplementation .......... . .......... .
`R. Doyle, P. Fitzrmrlck. and J. Nelson
`
`. ........ . .......... .
`
`. .433
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.448
`
`System Bulldlng Blocks
`
`Digit-Serial VLSI Microarchitecttue .............................457
`S.G. Smith, J.G. Payne. and R.W. Morgan
`CMOSVLSILuitasiewicz
`Arrays....... .....
`JW. MillsaadCA. D n er
`
`DynamicS stolrgkr‘kssociative Mgemmy Chip ....... . ....... .
`
`v
`
`.
`
`.
`
`. .....481
`
`...........469
`
`ASP Modules: Building Blocks for Application-Specific Massively Parallel
`Processors ........................................... 493
`RM. Lea
`Designing Specific Systolic Arrays with the APIISC Chip ................ 505
`P. Frlson. E. Gaum'n, D. Lavenier. and JL. Scharbarg
`
`Special-Purpose Systems 2
`
`A Prototype for a Fault-Tolerant Parallel Digital Signal Processor ........... 518
`BR. Minions, A. Aliphas. and AJ. Wei
`Byte-Serial Convolvers .
`.
`.
`. ........ . ........................530
`L. Dadda
`A VLSI Architecture for Simplified Arithmetic Fourier Transform Algorithm.
`542
`LS. Reed, MT. Shih, E. Hendon, TXTmflfaand D.WM
`Fine Grain System Architectures for Systolic Em tion ofNeuralmAlgmithms” .554
`U. Rancher and W. Raab
`
`ProgramminggEnvir-onment for a Line Processor SYMPATI-2 ............. .567
`
`RFernandez. P. Adam. D. Juvin. and].-L. Basilio
`
`Potpourrl III
`
`A Feedback Concentrator for the Image Understanding Architecture .......... 579
`D. Ram and C.C. Wear»:
`A Design Methodology for Fixed-Size Systolic Arrays.
`J. Bu, E.F. Deprettcre, andP. Dewilde
`A Formal Design Methodology for Parallel Architectures ................ 603
`M. A. BayoumiandKM. Elleithy
`A Multiple-Level Heterogeneous Architecture for Image Understanding. .
`0.33M. J..G Nash,gymcc. Weems
`Application Specific VLSI Architectures Based on De Bruijn Graphs ......... 628
`D.K. Pradhan
`
`.
`
`.
`
`.
`
`.
`
`. 615
`
`. ........... 591
`
`A Graph-Based Approach to Map Matrix Algorithms onto Local-Access
`ProcessorArrays .......
`..
`. ...................... .....641
`JH Moreno and 1".ng
`
`ApplicatiggSpecific Coprocessor Computer Architecture ................ 653
`Y
`Embedding Pyramidsin Array Processors with Pipelined Basses ............ 665
`Z. Guo and 12.6. Melhem
`
`Page 5 of 18
`
`Page 5 of 18
`
`

`

`Implementation of ANN on RISC Processtn' Array ................. 677
`A. Hirat’wa, M. Fajita. S. Kurosu, S. Arlsm. and M. Incite
`Systolic--Based
`ting Machinery for Radar Signal Processing Studies ...... 689
`S. HayfinJ’
`eber, 8. Che T. GreenlayJ. Orlando, C. Deng.
`and R Mann
`A Systolic Arm for Nonlinear Adaptive Filming and Pattern Recognition ...... 700
`J..G Me hitter DS. Broomltead andTJ.Shepherd
`Parallel Algta'itlnn for Traveling Salesman Problem on SIMD Machines Using
`Simulated Annealing ...................................... 712
`CS. Jeans audMH. Kim
`The Design of a High--Performancc Scalable Architecture for
`Image ProcessingwA licationa ............................... 722
`CT. Gray.W u, ‘1'. Hughes, and R. Gavin
`Testin a Motion Estimator Array .............................. 734
`
`.P. Montana and WR.
`
`core
`
`Systollc Arrays
`
`Spacedme-minimal SyStolic Architectures for Gaussian Elimination and the
`Algebraic Path Problem .................................... 746
`A. Bennie! and Y. Robert
`Two-Level Pipelined Implementation of Systolic Block Householder
`hansfos'mation with A lication to RLS Algorithm ....................758
`KJR. Lita. SF.
`sick andK. Yao
`Bit-Level Systolic Algorithm for the Symmetric Eigenvalue Problem ......... 770
`J.-M. Belem
`A Practical Rumime Test Method for Parallel Lattice-Gas Automate .......... 782
`R. Squier MK. Steiglirz
`.
`A Systolic Array Programming language ..........................794
`PS. Tseng
`
`Author Index ........................................... 805
`
`xii
`
`Page 6 of 18
`
`Page 6 of 18
`
`

`

`l'his maI-rlisl may be Wotecled 17v Copyright law [Title 17 0.51062]
`
`BIT-LEVEL SYSTOLIC ALGORITHM
`
`FOR THE SYMMETRIC EIGENVALUE PROBLEM
`
`JEAN-MARC DELOSME
`
`Department of Electrical Engineering
`Yale University
`
`An arithmetic algorithm is presented which speeds up the parallel Jacobi method for the
`eigen-decomposition of real symmetric matrices, The matrices to which the plane Jacobi
`rotations are applied are decomposed into even and odd part, enabling the application of
`the rotations from a single side and thus removing some sequentiality from the original
`method. The rotations are evaluated and applied in a fully concurrent fashion with the
`help of an implicit CORDIC algorithm.
`In addition, the 0012ch algorithm can perform
`rotations with variable resolution, which lead to a significant reduction in the total com-
`putation time.
`
`I. INTRODUCTION
`
`The eigenvalue decomposition of a real symmetric matrix or the singular value
`decomposition (SVD) of an arbitrary real matrix may be obtained by the Jacobi method
`(Jacobi/Kogbetliantz). This method can be parallelized to a high degree [1], {2], result-
`ing in a computation time that is approximately linear in the smallest dimension of the
`matrix. Furthermore the ratio of parallel to sequential hardware cost is of the same
`order as the gain in computation time. Thus, the parallel hardware is exercised with a
`fairly high efficiency, essentially independent of the matrix (smallest) dimension.
`Although the Jacobi method requires more Operations than the justly popular QR
`method (Francis/Golub and Kahan), a significantly higher degree of parallelism may be
`extracted from it, making it the method of choice for the fast diagonalization, via orthog-
`onal
`transformations, of unstructured dense matrices. Our objective is to determine
`extremely fast ways of performing this diagonalization in the context of signal and image
`processing. This entails, starting from the Jacobi method and the parallelization scheme
`of Brent and Luis, the design of algorithms at a detailed level and the development of
`the associated, application-specific, array architectures.
`In this paper, after analyzing
`the elementary mathematical operations in the Jacobi method (i.e. the evaluation and
`application of Jacobi
`rotations), we devise arithmetic algorithms that efl'ect
`these
`mathematical operations with few primitive operations (i.e. few shifts and adds) and
`enable the most ellicient use of the parallel hardware. Moreover we modify the Jacobi
`algorithm in order to reduce the total number of primitive operations for achieving
`matrix diagonalization.
`
`By targeting an implementation that is as fast as can be found, we are led to
`exploring and exploiting as much as possible the mathematical structure of the problem,
`hence to finding arithmetic algorithms better adapted to the problem at hand.
`Imple-
`mentations which match lower data rates can then be generated by a fairly stande
`process of sequentialization. By considering the SVD problem, which may be viewed as
`a generalization of the symmetric eigenvalue problem, we first direct our search for struc-
`ture toward fundamental objects and properties. The special features due to symmetry
`are exploited in a second phase. Our approach to algorithm design is thus hierarchical:
`first the main structure, then the refinements. This way we avoid focusing early on
`non-fundamental features and, as a result, being trapped in a local optimum.
`In fact, in
`order to uncover a global solution, we have embedded the problem into a. further
`
`7‘70
`
`CH2920-7190l0000l0770$01.00 0 1990 IEEE
`
`Page 7 of 18
`
`Page 7 of 18
`
`

`

`Systolic Arrays
`
`771
`
`generalization: the SVD problem for complex matrices. Although this generalization is
`not discussed in this paper (it is premented in [6]), it did guide us in our search.
`The parallel Jacobi algorithm of Brent and Luk is briefly described in Section II.
`This algorithm exhibits close to maximal parallelism and does it at a close to minimal
`communication cost, where ‘close to’ means ‘up to a small constant multiplicative factor’
`[7].
`It provides the starting point for the process of refinement, taking the above multi-
`plicative factors closer to unity, that brings forth our array for the eigen-decomposition
`of real symmetric matrices. The Jacobi method is a succession of parallel steps, starting
`from an initial square matrix and converging to a diagonal matrix, in which plane rota-
`tions are applied on both sides of the 2x2 submatrices of the current iterate.
`In Sec-
`tion III a mathematical property, the existence of the decomposition of Clifford numbers
`into even and odd parts, is shown to enable the application of the rotations from the
`same side, thus leading to a parallel procedure for the evaluation and the application of
`the Jacobi rotations. While this procedure would cost many operations if the rotations
`were evaluated using the standard arithmetic operations, :l:, x, I, \/, it becomes cheap if
`CORDIC arithmetic, based on shifts and adds and reviewed in Section IV, is employed.
`Our ‘implicit’ CORDIC algorithm for the symmetric eigenvalue problem, which does not
`compute rotation angles explicitly, is presented in Section V. Evaluation and application
`of the rotations may be fully overlapped with this algorithm, a feat which cannot be
`achieved with an explicit CORDIC algorithm.
`
`II. ARRAY ARCHITECTURE
`
`The method of Jacobi, first applied to the eigen-decomposition of real symmetric
`matrices B = UAUT, with U orthogonal and A real diagonal, was generalized by
`Kogbetliantz (1955)
`to the computation of the SVD of a real rectangular matrix
`A = VEUT, where U and V have orthonormal columns and E is real diagonal
`with positive entries. We shall first expose the general case and then turn to the sym-
`metric case, which can be viewed as a special case.
`Without loss of generality, A may be assumed to have more rows than columns.
`By applying plane, Givens, rotations from the left, A may be decomposed into QB
`where Q has orthonormal columns and B is square. Thus the computation of the
`SVD of a. rectangular matrix
`A reduces to the SVD computation of an associated
`square matrix B, and we can from now on only consider the decomposition of square
`matrices B.
`
`it Kn matrix B , the Jacobi method performs a short
`Starting from a general real
`sequence of sweeps to bring the matrix to diagonal form.
`In each sweep n(n—1)/2
`pairs of plane rotations are applied to both sides of the matrix to annihilate each of the
`n(n —1) off-diagonal elements once. Each pair of rotations may be represented by two
`a xn matrices, J”, applied to the matrix from the right, and 3,}, applied to the
`matrix from the left, where the couple ij, with 1 g i < j S n , is distinct for each pair
`in the sweep. Both rotation matrices differ from the identity matrix of order n by the
`principal submatrix formed at the intersection of the row and column pairs correspond-
`ing to i and j. These principal submatrices have the form
`cos 0;,-
`sin 9,,»
`cos 9,3,-
`-sin 0:}
`
`for
`
`J,-,-
`
`and
`
`for Ji}.
`
`cos 9:,-
`sin 9:,
`cos 0,,-
`-sin 9,,-
`and 6"— are selected to zero out simultaneously the ij -th and ji -th
`and the angles 9,,-
`entry of the matrix to which 1,3
`and .12; are applied.
`
`Page 8 of 18
`
`Page 8 of 18
`
`

`

`772
`
`- lmematianal Conference on Application Specific Array Processor:
`
`non-conflicting pairs of rotations, zeroing out
`The simultaneous application of p
`2p entries of the matrix to which they are applied, is called a (parallel) step. For ease
`of presentation we shall assume that n
`is even and refer to [1] and [7] for n
`odd.
`Since any partition of the set {1, ..., 1:} into pairs {i, 1'} has 11/2 parts, a maxi-
`mally parallel step would apply n/2 pairs of rotations simultaneously.
`If a sweep is
`decomposed into a sequence of such steps, each forcing n matrix entries to 0, and if for
`all the steps in the same sweep the indices
`if of the pairs of rotations are distinct, the
`number of steps in a sweep is minimal, equal to n—l.
`Such a scheme may be con-
`structed by selecting a cyclic permutation, P, of the indices {2,3, ..., n}. By parti-
`tioning into contiguous pairs the ordered concatenation {1} U S, where S is the
`ordered set {2, 3, ..., n }, a set of pairs, {12; 34; ..., n —l,n }, is obtained whose order is
`induced from the order {1} U S . These are the pairs of indices for the pairs of rotations
`applied in the first step of a sweep. Next the ordered concatenation {l} U PS is parti-
`tioned into contiguous pairs, with order induced by the order {1} U PS, defining the
`pairs of indices for the second step. The order {1} U P23 defines the pairs of indices
`for the third step, and so on until {1} U P"'35'
`for the (n—1)-th step. The following
`order is
`{1]- U Pfi'lS = {1}U S; indeed this is the beginning of the next sweep. We
`shall index the pairs at a given step i: by a single number I, 1 g I g n [2; thus J“
`will alternately be written J, and, in particular,
`J34 and J;
`represent the same
`matrix at step 1. Brent and Lu]: have selected the cyclic permutation
`2-v3—r5
`an-3—tn-1—rn—in-2—9
`6-+4—>2,
`
`ij
`which has the desirable property that the indioes
`from the neighboring pairs at step kwl, with indices
`
`in the I—th pair at step b come
`I—l, I, or I+1.
`
`The matrix B is transformed into a diagonal matrix through a. sequence of steps,
`starting with 80 = B and computing at step i:
`n
`n 2
`3s = fiJI'Bt-i Ill J1-
`! = 1
`I = 1
`
`Although this is not written explicitly, the two sets of rotations {JD 1 5 I 5 n [2} and
`{Jfi 1 5 I 5 11/2} depend on the step 1:. Moreover, since the rotations within each
`set are disjoint, each set of rotations is applied in parallel. At a high level, the scheme of
`Brent and Lot: leads directly to a parallel architecture for the SVD, taking the form of a
`square array with n [2 processors on a side. Processor
`IJ holds at the beginning of
`step It
`the 2x2 submatrix of B,“ sitting at the intersection of rows
`1‘
`and j
`and
`of columns
`r and s , where ij and rs
`are respectively the Lib and J-th pairs of
`indices at step it. Each diagonal processor, such as processor
`II or processor JJ ,
`evaluates the two plane rotations, J; and J" or 11 and J}, that zero out the two
`off-diagonal entries of the submatrix it holds, and updates accordingly the two diagonal
`entries. From each diagonal processor a representation of each of the two rotations is
`sent either along the column to which the procmsor belongs, for the rotations applied
`from the right such as J;
`and J}, or along the corresponding row, for the rotations
`applied from the left such as J" and JJ'.
`(The choice of representation per as will be
`discussed in Section V.) Each off-diagonal processor, U with I 94: J , applies J;
`from the right and J;'
`from the left to the submatrix it holds. Then the entries are
`exchanged between neighboring processors in the array in such a way that processor 1.]
`holds at
`the beginning of step k+1 the submatrix whose row indices and column
`indices are respectively the I -th and J-th pair of indices at step 1: +1.
`
`For the symmetric eigenvalue problem, at every step J” is imposed to be equal to
`1 S I S n/2. This reduces the amount of computation to be performed by the
`J11,
`diagonal processors. Moreover, since all the iterates B;
`are symmetric, the array is
`truncated to a triangular array, is. all the processors below the diagonal are removed.
`Such an array is displayed in Figure l for n = 10; the arrows indicate the communica-
`tions taking place during the exchanges while the horizontal and vertical links that carry
`
`Page 9 of 18
`
`Page 9 of 18
`
`

`

`Systolic Arrays
`
`T73
`
`(Note that the amount of data com-
`the representations of the rotations are not shown.
`municated during the exchanges could be reduced to about half, which can be argued to
`be minimal [7], if the order used so far, {1} U PFIS with P the cyclic permutation
`c~_f Brent and Luk and k _the step number, is kept when k
`is odd and is replaced by
`P({1} u PHS), where P{1,2, ...,n} = {3, 4, 1, 2, 7, s, 5,6, -
`-
`- }, when I.-
`is even.
`However this scheme is more complicated to implement.)
`
`{f—
`
`«+0
`|_*r—-.l#i‘—.
`I
`IT:5
`("fl-Pi
`ids-In
`\
`iv“
`:«j'
`XI
`i
`ix! Ix
`J
`{Lb—yak,
`1“is. _n
`fi—.)
`a$31 \
`
`x.)
`
`Figure 1. Array for the sign-decomposition of a symmetric matrix of order 10.
`
`III. CLIFFORD ALGEBRA
`
`We shall consider throughout this section the SVD problem; specialization of the
`results to the symmetric eigenvalue probiem will come in Section V. A diagonal proces-
`sor,
`II , evaluates the left and right rotations, J;' and J1, which diagonalize the 2x2
`matrix it contains, and computes the new diagonal entries:
`
`case; —sin9} r b]
`
`sin 8;
`
`cos 0}
`
`c
`
`:1
`
`c060;
`
`-sin 9;
`
`sinfl;
`
`cos 9;
`
`a
`
`o 3
`
`0
`
`An off-diagonal processor, IJ , applies from the left the rotation J” received from pro-
`cessor H to the matrix it holds, and applies from the right the rotation I;
`received
`from processor JJ :
`
`c060} ~sin9}
`
`cos 0}
`
`sin 91"
`
`[a
`
`c
`
`b
`
`:1
`
`
`
`c030;
`
`—sin 0;
`
`sine;
`
`cos 9;
`
`In a search for the niost parallel way of evaluating the rotations and updating the
`diagonal entries, and also of applying the rotations, we shall study the structure of the
`space of real
`2X2 matrices. The underlying structure to be exploited is that of 3.
`Clifford algebra: the Clifl’ord algebra of order 2, C3. To introduce this structure, we
`start from a vector space over the reels, E2, of dimension 2; this vector space is a sub-
`space of 0,. The reason for the notation E 2
`is that an Euclidean norm is defined on
`the vector space, given by the quadratic form n 2 2 u}: + “22, where u = (11, in)
`
`Page 10 of 18
`
`Page 10 of 18
`
`

`

`774
`
`International Conference on Application Specific Array Processors
`
`(Note that 3. Clifford algebra could also be defined starting with a
`belongs to E2.
`pseudo-Euclidean form,
`u 13 — u,2
`in two dimensions.) The scalar product of Wm vec-
`tors, u
`and v , is also an element of C3, defined as u ' v 2 (uv + vu )/2 where,
`clearly, uv +vu =(u +vr)'-u’—u2 isascalar.
`and v.
`1:
`We have not yet defined In , the Ciifi'ord product of the vectors
`IN = (uv +vu )/2 + (uv —vu ”2, if the exterior product
`(uv —1m )[2 2
`Since
`u A v is defined, then the (Clifford) product is defined: uv = u . v + u A v .
`It fol-
`lows from the definition of the exterior product that v A u = — u A v . Therefore,
`selecting an orthogonal basis {even} of 3,, u A v = (u1e1+ uze,)/\ (ule1+
`”283) = (u1c2~ u2u1)e1/\ e2. Thus, geometrically, u A v defines the area of the
`parallelogram with sides
`I:
`and v. Furthermore the product uv
`is equal
`to
`(n 1131 + up”)! + (u 192 — uzul)e1;\ eg, the linear combination of a scalar (propor-
`tional to the scalar unit, denoted by I ) and a bisector (proportional to e 1 A e2).
`The Clifford algebra. C,
`is defined as the vector space over the reels which is the
`closure of E 3 under Clifford multiplication. Already we have found two subspaces of
`C 3, E 3 and the two-dimensional subspace of the products of vectors. To go further We
`would have to formally define, by induction, the product of arbitrary elements of Ca.
`Because of a lack of space we shall instead define the product via the shortcut of an iso-
`morphism. The isomorphism results from the identification:
`1
`0
`0
`l
`
`e l =
`
`0 —1
`
`1 e2:
`
`
`
`
`
`g
`
`1
`
`0
`
`and Cliflord product 2 metric product. The reader may check that
`1
`0
`
`9'12 =ei‘eizi
`
`0
`
`1
`
`J =1: e2'32=1: 31‘82=(eiez+3291)/2=01-
`
`hence 21 and e, form an orthonormal basis of E3. Moreover
`U
`1
`
`e1A‘52=(‘51°2"e'231)/2= i
`
`-1 0
`
`i
`
`Hence linear combinations of scalars and bivectors are of the form
`
`p =p11 +P2¢1A¢2=
`
`Pi P:
`
`‘P2 P1
`
`.
`
`and vectors are of the form
`
`q = 9191+9'292 -
`
`91
`
`92
`
`‘12
`
`‘91
`
`Now We observe that any 2x2 real matrix In may be decomposedas p +q
`
`a
`
`b
`
`P1+¥1 P2+42
`
`m =
`
`= p +q :
`
`,
`
`C
`d
`‘Ps‘Hl's PI'QI
`
`
`with pl: rig-d. Pn= b2—c, 91: 02-d’ and 93: 5:6. Thus the space of
`linear combinations of scalars, vectors and bivectors is isomorphic to the linear space of
`real 2x2 matrices. Since the set of real 2x2 matrices is closed under matrix multipli-
`cation, the space of linear combinations of scalars. vectors and bivectors is also closed
`under Clifl‘ord multiplication and is therefore the whole of the Clifford algebra C 2.
`
`Page 11 of 18
`
`Page 11 of 18
`
`

`

`systolic Arrays
`
`775
`
`Upon identifying the real 2x2 matrices with the Clifl'ord algebra C 2, a decompo-
`sition of the real 2x2 matrices into two parts, p and q, has been brought to the
`fore. This is an instance of the so-called decomposition of Clifford numbers into even
`and odd parts.
`Indeed a Clifford algebra C m , built from a vector space E m , has for
`elements real linear combinations of scalars or iii-vectors, vectors or l-vectors, bivectors
`or 2-vectors, and so on up to m-vectors. Thus any element may be decomposed in a.
`unique way as the sum of an even part (a linear combination of even vectors) and an
`odd part (a linear combination of odd vectors).
`In other words C m is the direct sum
`of two subspaces, an ‘even' subspace denoted C at, and an ‘odd’ subspace denoted
`C '. The even subspace is closed under Ciifl'ord multiplication, written symbolically
`C 3 C ,I = C +, consequently it is a subalgebra of Cm. The odd subspace satisfies
`C J, C ,; = C3 and C ,;C J; = C $0,; = C ,;.
`(Of interest for the SVD computa-
`tion of complex matrices is the Clifford algebra. C 3, isomorphic to the 2'" = 8 dimen-
`sional space—over the reels—of complex 2x2 matrices, and with even subspace the
`quoterm'ons and odd subspace the antiquatemions [6].) The even subspace of C 2 is the
`algebra of complex numbers; by extension, the odd subspace of C 2, E 2, may be called
`the subspace of onticomplez numbers.
`
`The units of the even subspace of C, are elements 1: = p1 I + p, e1 A e 2 with
`unit norm, where the norm is naturally defined as (p I: + P32 )5. Therefore they can be
`written under the form It (0) = c050 I + sinfl e1 A82, with 0 S 0 < 211', and hence they
`are the plane rotations.
`It is easy to check (and this may also be derived as a special
`case of a property of quaternions) that plane rotations commute with even Clifford
`numbers and anticommute with odd Clifford numbers:
`
`p u(0) = 11(9):), qu(9) — 11(-6001
`
`to the left, both for the
`This enaqu us to pull the Jacobi rotations from the right
`evaluation of the rotations in the diagonal processors and for their application in the
`off-diagonal processors:
`- evaluation in processor H ,
`
`“(-39!“ “(91) = “(-90? “(90+ “(-939 “(91)
`
`= “(—9i)“(51)P + “(-Qilui—srm .
`
`hence, using the property that u (0)
`
`is isomorphic to the complex number exp(r‘ 9),
`
`"(43m um = u(—9:'+9;)p ”Hi—01):: = r1: =
`
`E
`
`0
`
`0
`_
`d
`
`- application in processor H ,
`
`“(-9”!!! 1.(it!) = Ill-9})? “(91H “Hilq “(9.1)
`
`= “(-9!+9;)p +u(-9i-9J)q
`
`hence, by pulling the rotations from the right to the left and exploiting the fact that p
`and q are fully defined by their first column, the Jacobi rotations may be applied with
`2 two-dimensional vector rotations instead of 4.
`
`Representations of u (9;) and u (4}) must be computed as intermediate forms
`in order to apply the Jacobi rotations and build up the matrices of singular vectors (or
`eigenvectors if B is symmetric) U and2 lr"T as the products, accumulated over the
`I
`’
`steps, of the matrices
`J,
`and
`JI',
`respectively. The evaluation of these
`
`I =1
`I =1
`representations may be performed by finding the rotations u (0;) and u (-flf), where
`3f 9 0; —91'
`and 0;" g 0; + 9;,
`that
`force the second component of the vectors
`
`Page 12 of 18
`
`Page 12 of 18
`
`

`

`776
`
`International Corference on Application Specific Array Processors
`
`(p, —p ,)T and (q, qflr, respectively, to 0. This approach, exploiting the decompo-
`sition of Clifford numbers into even and odd part, has been used on general purpose
`computers, using standard arithmetic, from very early on (eg. Forsythe and Henrici,
`1960). However the use of that decomposition for the application of the rotations did
`not follow. To understand why we have to place ourselves in the context of machines
`using standard arithmetic. We first note that
`in this context
`the pass
`e from
`5,; e {u (0;), u (-sf), 1 5 I 5 n [2} to 3“,, e {u (8,). u (-915). 1 5 I S n 2}, and
`the passage
`from S“.
`to S,” 2 {11 (0,3), 14—013), 1 g I 7‘: J 5 11/2}, where
`9,} Q 9, —9f
`and
`9,3 Q 0,: + 3}, are done via the trigonometric formulas for the
`tangents of the rotation angles, using the :t, x and / operations and also, for the first
`passage, \/ operations since tangents of half-angles must then be computed. The next
`observation is that a rotation is ultimately represented by its cosine and sine in this con-
`text and, given an intermediate tan ent
`representation, generating the cosine/sine
`representation requires +, x, f, and
`operations. The reason for not exploiting the
`decomposition in a sequential setting is now clear: the computation of the cosine/sine
`representation of S,”
`given the tangent representation of 34,-“,
`requires 001’) :l:,
`x, l, and w/ operations while the computation of the cosine/sine representation of Son
`given its tangent representation costs only O(n) 4-, x, f, and t/ operations; the halv-
`ing of multiplications obtained when performing the two-dimensional vector rotations
`using the decomposition does not offset the large increase in :i:, x, f, and \/ operations
`needed to find the representation of the rotations

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket