throbber
Proceedings of the
`International Conference on
`
`APPLICATION SPECIFIC
`ARRAY PROCESSORS
`
`September 5-7, 1990
`
`Princeton, New Jersey
`
`Sponsored by
`Princeton University
`
`Co-Sponsored by
`Industrial Development Board for Northern Ireland
`NEC Research Institute
`
`In Cooperation with
`IEEE Computer Society
`
`Edited by
`
`Sun-Yuan Kung
`Princeton University
`
`jose A.B. Fortes
`Purdue University
`
`Earl E. Swartzlander,Jr.
`University of Texas at Austin
`
`K. Wojtek Przytula
`Hughes Research Laboratories
`

`IEEE ComputerSociety Press
`Los Alamitos, California
`
`Washington « Brussels » Tokyo
`
`Page 1 of 18
`
`SAMSUNG EXHIBIT1041
`
`Page 1 of 18
`
`SAMSUNG EXHIBIT 1041
`
`

`

`The papers in this book comprise the proceedings of the meeting mentioned on the
`cover and title page. They reflect the authors’ opinions and are published as presented
`and without change,
`in the interests of timely dissemination. Their inclusion in this
`publication does not necessarily constitute endorsement by the editors,
`the IEEE
`ComputerSociety Press, or The Institute of Electrical and Electronics Engineers,Inc.
`
`Publishedby
`
`@® IEEEComputerSocietyPress
`
`10662 Los Vaqueros Circle
`P.O. Box 301
`4
`Los Alamitos, CA 90720-1264
`
`Copyright © 1990 bythe Institute of Electrical and Electronics Engineers,Inc.
`Cover designed by Jack |. Ballestero
`
`Printed In United States of America
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source.
`Libraries are permitted to photocopy beyond the limits of U.S. copyright law for private
`use of patrons those articles in this volume that carry a code at the bottom ofthefirst
`page, provided the per-copy fee indicated In the code is pald through the Copyright
`Clearance Center, 29 Congress Street, Salem, MA 01970.
`Instructors are permitted to
`photocopy isolated articles for noncommercial classroom use without fee. For other
`copying, reprint or republication permission, write to Director, Publishing Services, IEEE,
`345 East 47th Street, New York, NY 10017. All rights reserved.
`
`IEEE Computer Society Press Order Number 2089
`Library of Congress Number 90-82602
`IEEE Catalog Number 90CH2920-7
`ISBN 0-8186-9089-5 (case)
`ISBN 0-8186-6089-9 (microfiche)
`SAN 264-620X
`
`Additional copies can be ordered from:
`
`(EEE Computer Society Press
`Customer Service Center
`10662 Los Vaqueres Circle
`P.0. Box 3014
`Los Alamitos, CA 90720-1264
`
`IEEE Computer Society
`13, Avenue de I'Aquilon
`B-1200 Brussels
`BELGIUM
`
`IEEE Computer Society
`Ooshima Building
`2-19-1 Minaml-Acyama,
`Minato-Ku
`Tokyo 107, JAPAN
`
`IEEE Service Center
`445 Hoes Lane
`P.O. Box 1331
`Piscataway, NJ 08856-1331
`
`THE INSTITUTE OF ELECTRICAL
`AND ELECTRONICS ENGINEERS,INC.
`
`iv
`
`Page 2 of 18
`
`Page 2 of 18
`
`

`

`Table of Contents
`
`a ween A ok I es he v
`0
`F WA ioe. 6
`Clomcral (Tinie DARAIAGO 658s 4.
`Program Chair's Message eee@eseeteeeeeeeneeeeeenreprtenetereeneeeereeteeeeeeesesaea#»esrvi
`POOREI COTE 6k 6 iy Ree la 4K 6 HOTS os eS ET Oe ke vii
`RORCue eles 46 sie ob ES Dec OR Wale 4-8 nce viii
`
`Keynote Address: Application-Oriented High Speed Processors: Experiences
`PRDECHVES ois F date oleta lee elle BAO. OUD GS USS so 1
`Yasuo Kato
`
`Design Methodology
`
`Calculus of Space-Optimal Mappings of Systolic Algorithms on Processor Arrays. .
`P. Clauss, C. Mongenet, and GR,Perrin
`7
`A Processor-Time Minimal Systolic Array for Transitive Closure............ 19
`PR. Cappello and C.J. Scheiman
`Systolic Arra
`lementation of Nested Loop Programs. .........e000008 31
`J. Bu,
`E.F. Deprettere, and L. Thiele
`
`. 4
`
`Potpourri |
`The oe Systolic Back-Projection Engine (BSSBPE)................ 43
`. Bayfor
`A Database Machine Based on Surrogate Files... . 2.2... eee eee ee 55
`S.M. Chung
`Systolic Architectures for Decoding Reed Solomon Codes..........00000% $7
`J. Nelson, A. Rahman, and E. McQuade
`Mapping High-Dimension Wavefront Computations to Silicon, ...........+..78
`C.-M.
`Wu, R.M. Owens, and M_J. Irwin
`Systolic Architecture for 2-D Rank Order Filtering, .........0005 tal Fide BP 90
`J.-N. Hwang and J.-M. Jong
`Scheduling Affine Paramterized Recurrences by Means of Variable Dependent
`Tinting PUncCti0ns oss 6c seca cos ewes « OES ROOTED s 100
`P. Quinton, C. Mauras, S. Rajopadhye, and Y. Saouter
`Logic Description Generator... .... 00 cere eee ee eee eee eeenas 111
`B. Gokhale, A. Kopser, S.P. Lucas, and R.G. Minnich
`Recursive
`Algorithms for AR Spectral Estimation and Their Array Realizations ,
`.
`, 121
`C.-W.
`Jen and C.-M.Liu
`See Designs by Non-Standard Interpretation............ 133
`SystolicVLSI Compiler (SVC) for High Performance Vector Quantisation Chips. . 145
`J.V. McCanny, Y. Hu, and M. Yan
`Extensions to Linear Mapping for Regular Arrays with Complex
`Processing Biemonts eieh i aerate easel be’. oho Lew eeeiaald, bongyn 156
`J. Rosseel, F. Catthoor, and H. De Man
`Design of Run-Time Fault-Tolerant Arrays of Self-Checking Processing Elements , 168
`J. Franzen
`
`The
`
`Special-Purpose Systems
`GRAPE: A SeeEen Computer for N-Body Problems........ want URED
`J. Makino,T. Ito,
`T. Ebisuzaki, and D. Sugimoto
`
`Page 3 of 18
`
`Page 3 of 18
`
`

`

`Building Blocks for a New Generation of Application-Specific
`CORRNINOS SUSU, ses ike 4 8 6S Fk DO HELEE 05 ECE ENB Oe & 190
`B. Baxter, G. Cox, T. Gross, H.T. Kung, D. O'Hallaron, C. Peterson,
`J. Webb, and P. Wiley
`Reconfigurable Vector Register Windows for Fast Matrix Computation on the
`rings Sele a eee oe MEG wk ocsemeiea me RRC RUS 202
`. K. Panda and K. Hwang
`Massively Parallel Architecture: Application to Neural Net Emulation and Image
`Recomsenguet eteoyieet 2598s Ska Pe eS Gee RS PE214
`B. Faure, D. Lattard and G. Mazare
`A Real-Time Software Programmable Processor for HDTV and Stereo
`Pi ORE
`at meena Siam & LN eR oe aE oie BR eT 226
`T. Nishitani, I. Tamitani, H. Harasaki, and M. Yano
`
`Mapping Applications onto Architectures
`
`Mapping Algorithms Onto the TUT Cellular Array Processor.............. 235
`J. Viitanen, T. Korpiharju, J. Takala, and H. Kiminkinen
`A 3-D Wafer Scale Architecture for Early Vision Procemng OE as 8S 247
`S.T. Toborg
`Algorithmic Mapping of Neural Network Models onto Parallel SIMD Machines.
`V.K Prasanna Kumar and K.W. Przytula
`
`Implementation ofSystolic Moan Using Pipelined Functional Units.......272
`M.Valero-Garcta, J.J. Navarro, J.M. Llaberta, and M. Valero
`Array Processing on Finite Polynomial RINSE co eH aee ws 4.9K & DeLee Fs 284
`N. Wigley and G.A. Jullien
`Potpourri Il
`
`. .259
`
`A. Skavantzos and Z.B
`
`The RAP: A Ring Array Processor for Layered Network Calculations,........ 296
`N. licesonJ. Beck, P. Kohn, J. Bilmes, E. Allman, andJ. Beer
`Linearign forResidue MAppens SR EY Me ST IE Oe rR 309
`A Fault-TolerantTwo-Dimensional Aenibe TINGE SF ore le LS A317
`J.G. Krammer and H.Arif
`Channel Complexity Anal:yale forReconfigurable VLSI/WSI Processor Arrays, .
`. 329
`P.K. Rhee and JH.
`Digit-Gerial DSP Archsectorea 3) Ro Ph eG TR ae TR RY 341
`K.K. Parhi and C.-Y. Wang
`EE ViRiOn 26. AR PY tad. 352
`PASIC: A Sensor/Processor Array for
`om
`K. Chen, P.-E. Danielsson, and A.
`An Analog VLSI Array Processor for Classical and Connectionist AI......... 367
`
`JW.Millsand C.A.Daffinger
`High Performance WaveDigital Filtering....... 379
`Systolic Two-Port Adaptor for
`J.V. McCanny and RJ. Singh
`Multilayer Mone Model and Array ProcessorImplementation. .
`An
`.
`. .389
`sect Fu and C.C. Chia
`guration ofFFT Are A Flow-Driven Approach. ...........008%401
`aTace.andN. Scarabottolo
`Towards the Automated Design of Application Specific Array
`Procestars GASAPS) «vs 4. sss 6 86s eee bu po er Ree aes ree 414
`AP. Marriott, A.W.G. Duller, R.H. Storer, AR. Thomson, and MR. Pout
`Fault-Tolerant Array Processors Using N-and-Half-Track Switches.......... 426
`JASN. Jean
`
`x
`
`Page 4 of 18
`
`Page 4 of 18
`
`

`

`Domain Flow and Streaming Architectures........... Math eer R .
`ETL. Omtzigt
`An Improved Systolic Extended Euclidean Algorithm for Reed-Solomon
`Decoding: Design and Implementation.,........ eae re wees 448
`R. Doyle, P. Fitzpatrick, and J. Nelson
`
`» 438
`
`System Bullding Blocks
`Digit-Serial VLSI Microarchitecture. .. 1. ee ee ee eee457
`S.G. Smith, J.G. Payne, andR.R.W. Morgan
`CMOS VLSI Lukasiewicz
`ANTES og nae c
`tne emcanic pena tae ae aes yobont
`it469
`TyeeS pailsAMisiatyeBeanie eh eee ee ee f nape Fw481
`J.W. Mills and C.A. Daffinger
`ASPModulees BuildingBlocksforAppllewice-SogoMassivelyParallel
`
`5 eer crmcvasas triacs cones necienial
`PCIE 8
`R.M.Lea
`Designing Specific Systolic Arrays with the APII5C Chip... ........0005. 505
`P. Frison, E. Gautrin, D. Lavenier, and J.L. Scharbarg
`
`ne Abe Keen URARVN A OES 493
`
`Special-Purpose Systems 2
`A Prototype for a Fault-TolerantParallelDiput Signal Process0e, «0. ss es-00 518
`B.R. Musicus, A. oe andAJ. Wei
`aeOoania ere ee inlay SEI EN Sana NTR OR eee530
`A VLSi Architecture for Simplified Arithmetic FourierTransform Algorithm. ..,542
`Fine Grain System Architectures for Systolic Emulation ofNeuralyeAion. 554
`1.8. Reed, M.T. Shih, E. Hendon, TK.aoe:andD.W.Tufts
`U. Ramacher and W. Raab
`Programming Environment for a Line Processor SYMPATI-2.........0005 567
`P. Fernandez, P. Adam, D. Juvin, and J.-L. Basille
`
`Potpourri Ill
`A Feedback Comerfor the Image Understanding Architecture..........579
`D. Rana and C.C. Weems
`A Design Methodology for Fixed-Size Systolic Arrays.
`P.S eT REAR EES 591
`J. Bu, E.F. Deprettere, and P. Dewilde
`A Formal Design Methodology for Parallel Architectures... .......00000- 603
`M.A. Bayoumi and K.M. Elleithy
`
`D.B Shu, J.G. Nash, and C.C. Weems
`
`A Multiple-Level Heterogeneous tea forImageUnderstanding....... 615
`EtenSpecific VLSIArchitectures Basedon De Bruijn Graphs......... 628
`AbiaaidatteaagenttoaMatrixeaeot ontoLocal-Access
`Processor ArrayS....... Tee ee a ee et:
`J.H. Moreno and T. Lang
`eo CoprocessorComputer Architecture... ....+e eee eeee 653
`g Pyramids in Array Processors with Pipelined Busses..........+- 665
`Z. Guo and R.G. Melhem
`
`Page 5 of 18
`
`Page 5 of 18
`
`

`

`Implementation of ANN on RISC Processor Array... ..seeeee eee eee 677
`A.Hiraiwa, M.Fujita, S. Kurosu,S. it~ and M.Inoue
`Systolic-Based
`ting Machinery for Radar Signal Processing Studies...... 689
`Sees erB.Cho, T. Greenlay,J.Orlando,C.Deng,
`
`A Systolic Array for Nonlinear Adaptive Filtering and Pattern Recognition...... 700
`J.G. McWhirter, D.S. Broomhead, and TJ. Shepherd
`Parallel Algorithm for Traveling Salesman Problem on SIMD Machines Using
`rN ATION (55 os ies Walk ky Sis ele ww nO eC ae aca bees we et 712
`C.S. Jeong and M.H. Kim
`The Designofaig-Performance Scalable Architecture for
`CT. oroW.
`Liu,T.FeagandR. Cavin
`anPURO 6 nk: 0 3.15.3
`noses ERT eS 8S GAR Rk BIN 722
`Testing a Motion Estimator Arraa Oi naar ae ie hance pan cane ae ee 734
`
`.P. Marnane and W.R. Moore
`
`Systolic Arrays
`Spacetime-minimal Systolic Architectures for Gaussian Elimination and the
`Masha MEOPEN cei ote Werder ek SOR © bee ae ES 746
`A. Benaini and Y. Robert
`Two-Level Pipelined Implementation of areoke2Block Householder
`Transformation with Application to RLS Algorithm. ............00 ee eee758
`K.J.R. Liu, SF. Hsieh, and K. Yao
`Bit-Level Systolic Algorithm for the Symmetric Eigenvalue Problem. ........ 770
`J.-M. Delosme
`A Practical Runtime Test Method for Parallel Lattice-Gas Automata..........782
`R. Squier and K. Steiglitz
`
`A cyanieTend ProgranmraneEatosees 202°, Se i ea ee se794
`
`P
`
`AGUTINGRE Sccvicc scr ere ec Cessation save enes a wees sea ee 805
`
`xii
`
`Page 6 of 18
`
`Page 6 of 18
`
`

`

`This material may be protected by Copyright law [Title 17 U.S. Code}
`
`BIT-LEVEL SYSTOLIC ALGORITHM
`
`FOR THE SYMMETRIC EIGENVALUE PROBLEM
`
`JEAN-MARC DELOSME
`Depart ment ofElectrical Engineering
`Yale University
`
`An arithmetic algorithm is presented which speeds up the parallel Jacobi method for the
`eigen-decomposition of real symmetric matrices. The matrices to which the plane Jacobi
`rotations are applied are decomposed into even and odd part, enabling the application of
`the rotations from a single side and thus removing some sequentiality from the original
`method. The rotations are evaluated and applied in a fully concurrent fashion with the
`help of an implicit CORDICalgorithm.
`In addition, the CORDIC algorithm can perform
`rotations with variable resolution, which lead to a significant reduction in the total com-
`putation time.
`
`I. INTRODUCTION
`
`The eigenvalue decomposition of a real symmetric matrix or the singular value
`decomposition (SVD) of an arbitrary real matrix may be obtained by the Jacobi method
`(Jacobi/Kogbetliantz). This method can be parallelized to a high degree [1], [2], result-
`ing in a computation time that is approximately linear in the smallest dimension of the
`matrix. Furthermore the ratio of parallel to sequential hardware cost is of the same
`order as the gain in computation time. Thus, the parallel hardware is exercised with a
`fairly high efficiency, essentially independent of the matrix (smallest) dimension.
`Although the Jacobi method requires more operations than the justly popular QR
`method (Francis/Golub and Kahan), a significantly higher degree of parallelism may be
`extracted from it, making it the method of choice for the fast diagonalization, via orthog-
`onal
`transformations, of unstructured dense matrices. Our objective is to determine
`extremely fast ways of performing this diagonalization in the context of signal and image
`processing. This entails, starting from the Jacobi method and the parallelization scheme
`of Brent and Luk, the design of algorithms at a detailed level and the development of
`the associated, application-specific, array architectures.
`In this paper, after analyzing
`the elementary mathematical operations in the Jacobi method (i.e. the evaluation and
`application of Jacobi
`rotations), we devise arithmetic algorithms that effect
`these
`mathematical operations with few primitive operations (i.e. few shifts and adds) and
`enable the most efficient use of the parallel hardware. Moreover we modify the Jacobi
`algorithm in order to reduce the total number of primitive operations for achieving
`matrix diagonalization.
`By targeting an implementation that is as fast as can be found, we are led to
`exploring and exploiting as much as possible the mathematical structure of the problem,
`hence to finding arithmetic algorithms better adapted to the problem at hand.
`Imple-
`mentations which match lower data rates can then be generated by a fairly standard
`process of sequentialization. By considering the SVD problem, which may be viewed as
`a generalization of the symmetric eigenvalue problem, wefirst direct our search for struc-
`ture toward fundamental objects and properties. The special features due to symmetry
`are exploited in a second phase. Our approach to algorithm design is thus hierarchical:
`first the main structure, then the refinements. This way we avoid focusing early on
`non-fundamental features and, as a result, being trapped in a local optimum.
`In fact, in
`order to uncover a global solution, we have embedded the problem into a further
`
`770
`
`CH2920-7/90/0000/0770$01.00 © 1990 IEEE
`
`Page 7 of 18
`
`Page 7 of 18
`
`

`

`Systolic Arrays
`
`771
`
`generalization: the SVD problem for complez matrices. Although this generalization is
`not discussed in this paper(it is presented in [6]), it did guide us in our search.
`The parallel Jacobi algorithm of Brent and Lukis briefly described in Section II.
`This algorithm exhibits close to maximal parallelism and does it at a close to minimal
`communication cost, where ‘close to’ means ‘up to a small constant multiplicative factor’
`[7].
`It provides the starting point for the process of refinement, taking the above multi-
`plicative factors closer to unity, that brings forth our array for the eigen-decomposition
`of real symmetric matrices. The Jacobi method is a succession of parallel steps, starting
`from an initial square matrix and converging to a diagonal matrix, in which plane rota-
`tions are applied on both sides of the 2x2 submatrices of the currentiterate.
`In Sec-
`tion III a mathematical property, the existence of the decomposition of Clifford numbers
`into even and odd parts, is shown to enable the application of the rotations from the
`sameside, thus leading to a parallel procedure for the evaluation and the application of
`the Jacobi rotations. While this procedure would cost many operations if the rotations
`were evaluated using the standard arithmetic operations, +, x, /,
`V, it becomes cheapif
`CORDICarithmetic, based on shifts and adds and reviewed in Section IV, is employed.
`Our ‘implicit? CORDIC algorithm for the symmetric eigenvalue problem, which does not
`compute rotation angles explicitly, is presented in Section V. Evaluation and application
`of the rotations may be fully overlapped with this algorithm, a feat which cannot be
`achieved with an explicit CORDIC algorithm.
`
`II. ARRAY ARCHITECTURE
`
`The method of Jacobi, first applied to the eigen-decomposition of real symmetric
`matrices B = UAU?, with U orthogonal and A real diagonal, was generalized by
`Kogbetliantz (1955)
`to the computation of the SVD of a real rectangular matrix
`A =VZU7, where U and V_ have orthonormal columns and © is real diagonal
`with positive entries. We shall first expose the general case and then turn to the sym-
`metric case, which can be viewed as a special case.
`Without loss of generality, A may be assumed to have more rows than columns.
`By applying plane, Givens, rotations from the left, A may be decomposed into QB
`where Q has orthonormal columns and B is square. Thus the computation of the
`SVD of a rectangular matrix
`A reduces to the SVD computation of an associated
`square matrix B, and we can from now on only consider the decomposition of square
`matrices B.
`Starting from a general real nxn matrix B, the Jacobi method performs a short
`sequence of sweeps to bring the matrix to diagonal form.
`In each sweep n(n-1)/2
`pairs of plane rotations are applied to both sides of the matrix to annihilate each of the
`n(n-1) off-diagonal elements once. Each pair of rotations may be represented by two
`nxn matrices, J;;, applied to the matrix from the right, and J;;, applied to the
`matrix from the left, where the couple ij, with 1<i <j <n, is distinct for each pair
`in the sweep. Both rotation matrices differ from the identity matrix of order
`n_ by the
`principal submatrix formed at the intersection of the row and columnpairs correspond-
`ing to i and j. These principal submatrices have the form
`cos @;;
`sin 8;;
`cos 6/;
`-sin 0/;
`
`cos ;;
`sin 6/;
`cos 6;;
`-sin 6;;
`and 9;; are selected to zero out simultaneously the ij-th and ji-th
`and the angles @;;
`entry of the matrix to which J;;
`and Jj; are applied.
`
`for J;;
`
`and
`
`for Jj,
`
`Page 8 of 18
`
`Page 8 of 18
`
`

`

`712
`
`International Conference on Application Specific Array Processors
`
`The simultaneous application of p non-conflicting pairs of rotations, zeroing out
`2p entries of the matrix to which they are applied, is called a (parallel) step. For ease
`of presentation we shall assume that
`n_
`is even and refer to [1] and [7] for n
`odd.
`Since any partition of the set {1,...,n} into pairs {i,j} has n/2 parts, a maxi-
`mally parallel step would apply n/2 pairs of rotations simultaneously.
`If a sweep is
`decomposed into a sequence of such steps, each forcing mn matrix entries to 0,andif for
`all the steps in the same sweep the indices
`ij of the pairs of rotations are distinct, the
`number of steps in a sweep is minimal, equal to n-1.
`Such a scheme may be con-
`structed by selecting a cyclic permutation, P, of the indices {2,3,...,n}. By parti-
`tioning into contiguous pairs the ordered concatenation {1}US, where S is the
`ordered set {2,3,...,n}, a set of pairs, {12; 34; ...;n-1,n }, is obtained whose orderis
`induced from the order {1} US. These are the pairs of indices for the pairs of rotations
`applied in the first step of a sweep. Next the ordered concatenation {1} U PS is parti-
`tioned into contiguous pairs, with order induced by the order {1} U PS, defining the
`pairs of indices for the second step. The order {1} U P?S defines the pairs of indices
`for the third step, and so on until {1}UP"S for the (n-1)-th step. The following
`order is {1}U P""S = {1} US; indeed this is the beginning of the next sweep. We
`shall index the pairs at a given step & by a single number J,1< I <n/2; thus Jj;
`will alternately be written J, and, in particular, J3, and J,
`represent the same
`matrix at step 1. Brent and Luk haveselected the cyclic permutation
`2-3-5 --- +~n-3Bon-lon-n-2- --- 6-4-2,
`which has the desirable property that the indices
`ij
`in the J-th pair at step k come
`from the neighboring pairs at step ‘-1, with indices
`IJ-1, I,or J+1.
`The matrix B is transformed into a diagonal matrix through a sequence of steps,
`starting with B y= B and computing at step k
`n
`B, = iL By il Jj
`fel
`F=1
`Although this is not written explicitly, the two sets of rotations {J;,1<I <n/2} and
`{J;',1< I <n/2} depend on the step &. Moreover, since the rotations within each
`set are disjoint, each set of rotations is applied in parallel. At a high level, the scheme of
`Brent and Luk leads directly to a parallel architecture for the SVD, taking the form of a
`square array with n/2 processors on a side. Processor JJ holds at the beginning of
`step k
`the 2x2 submatrix of B,_, sitting at the intersection of rows
`1
`and j
`and
`of columns
`r and s, where ij and rs
`are respectively the J-th and J-th pairs of
`indices at step k. Each diagonal processor, such as processor
`JJ or processor JJ,
`evaluates the two plane rotations, J; and J;' or J; and Jj, that zero out the two
`off-diagonal entries of the submatrix it holds, and updates accordingly the two diagonal
`entries. From each diagonal processor a representation of each of the two rotations is
`sent either along the column to which the processor belongs, for the rotations applied
`from the right such as J;
`and J,, or along the corresponding row, for the rotations
`applied from the left such as J;' and Jj.
`(The choice of representation per se will be
`discussed in Section V.) Each off-diagonal processor,
`JJ with I] #J, applies
`Jy
`from the right and J;'
`from the left to the submatrix it holds. Then the entries are
`exchanged between neighboring processors in the array in such a way that processor IJ
`holds at
`the beginning of step k+1 the submatrix whose row indices and column
`indices are respectively the 7-th and J-th pair of indices at step k +1.
`For the symmetric eigenvalue problem, at every step J,’ is imposed to be equal to
`Jf, 1<JI<n/2. This reduces the amount of computation to be performed by the
`diagonal processors. Moreover, since all the iterates B,
`are symmetric, the array is
`truncated to a triangular array,i.e. all the processors below the diagonal are removed.
`Such an array is displayed in Figure 1 for n = 10; the arrows indicate the communica-
`tions taking place during the exchanges while the horizontal and vertical links that carry
`
`a /2
`
`Page 9 of 18
`
`Page 9 of 18
`
`

`

`Systolic Arrays
`
`773
`
`(Note that the amount of data com-
`the representations of the rotations are not shown.
`municated during the exchanges could be reduced to about half, which can be argued to
`be minimal [7], if the order used so far, {1}U P* "5 with P the cyclic permutation
`of Brent and Luk and & the step number, is kept when &
`is odd andis replaced by
`P({i}U P*+3), where P{1,2,...,n} = {3,4, 1,2, 7,8,5,6,°--}, when k is even.
`However this scheme is more complicated to implement.)
`
`Pr ie eee ee fp
`Coty sty eel le)
`
`f_
`
`Figure 1. Array for the eigen-decomposition of a symmetric matrix of order 10.
`
`III. CLIFFORD ALGEBRA
`
`Weshall consider throughout this section the SVD problem;specialization of the
`results to the symmetric eigenvalue problem will come in Section V. A diagonal proces-
`sor, II, evaluates the left and right rotations, J;' and J,;, which diagonalize the 2x2
`matrix it contains, and computes the new diagonal entries:
`
`cos 6;
`
`sin 6;
`
`-sin 6;
`
`cos6;|Le
`
`|: ‘ cos 6;
`
`d
`
`-sin 0;
`
`sin 6;
`
`cos 6;
`
`zO
`
`od
`
`received from pro-
`An off-diagonal processor, IJ, applies from the left the rotation J;'
`cessor JJ
`to the matrix it holds, and applies from the right the rotation J,
`received
`from processor JJ:
`
`cos 6;
`
`sin 6;
`
`-sin 07 ° b
`d
`
`cos 6;
`
`c
`
`cos @,
`
`sin 6y
`
`-sin 87
`
`cos Oy
`
`In a search for the niost parallel way of evaluating the rotations and updating the
`diagonal entries, and also of applying the rotations, we shall study the structure of the
`space of real 2x2 matrices. The underlying structure to be exploited is that of a
`Clifford algebra: the Clifford algebra of order 2, C . To introduce this structure, we
`start from a vector space over the reals, E 2, of dimension 2; this vector space is a sub-
`space of C. The reason for the notation E,
`is that an Euclidean norm is defined on
`the vector space, given by the quadratic form u? 4u,? +u?, where u = (u, u,)
`
`Page 10 of 18
`
`Page 10 of 18
`
`

`

`774
`
`International Conference onApplication SpecificArray Processors
`
`belongs to E . (Note that a Clifford algebra could also be defined starting with a
`pseudo-Euclidean form, uj? -u? in two dimensions.) The scalar product of two vec-
`tors, u
`and v, is also an element of C,, defined as u-v 4 (uv + vu )/2 where,
`clearly, uv +vu =(u +v)*?-u?-v? isa scalar.
`and v.
`We have not yet defined uv, the Clifford product of the vectors u
`Since uv = (uv + vu )/2+(uv —vu)/2,
`if the exterior product
`(uv -vu)/24
`u A v is defined, then the (Clifford) productis defined: uv = u-v +u A v. It fol-
`lows from the definition of the exterior product that v A u =-u A v. Therefore,
`selecting an orthogonal basis {e,,e,} of Ej, u A v =(uje,+u e,)A (vje,+
`U2@2) = (u,v2—-u42v,)e;A e. Thus, geometrically, u A v_
`defines the area of the
`parallelogram with sides u
`and v. Furthermore the product uv
`is equal
`to
`(u,v, + u2v2)I + (uyv2-u2v,)e,A e, the linear combination of a scalar (propor-
`tional to the scalar unit, denoted by I ) and a bivector (proportional to e, A e>).
`The Clifford algebra C,
`is defined as the vector space over the reals which is the
`closure of E under Clifford multiplication. Already we have found two subspaces of
`C,, E 2 and the two-dimensional subspace of the products of vectors. To go further we
`would have to formally define, by induction, the product of arbitrary elements of C 5.
`Because of a lack of space weshall instead define the product via the shortcut of an iso-
`morphism. The isomorphism results from the identification:
`1
`0
`0
`1
`
`e 1=
`

`
`€&29=
`
`}
`
`
`
`| =I, eye, =I, ee, =(e,e,+e,¢,)/2 = 01,
`e; =e,e,= |
`€,A €2 = (e;e2-e€,€,)/2
`= |
`|
`
`0
`1
`-1
`0
`and Clifford product = matriz product. The reader may check that
`1
`0
`
`01
`hence e, and e, form an orthonormal basis of E. Moreover
`01
`
`-1 0
`Hence linear combinations of scalars and bivectors are of the form
`
`P = pil + pre, A eg =
`
`Pi
`
`P2
`
`“P2 Pi
`
`9
`
`and vectors are of the form
`
`GQ = 121+ 9282 =
`
`41
`
`42
`
`Gq2
`-41
`Now we observe that any 2x2 real matrix m may be decomposed as p +q:
`
`a
`
`b
`
`Pit P2t42
`
`n=
`
`=pt+q=
`
`‘
`
`
`e
`d
`“Pot. Pi-%
`with p,= = te, P2= — a= eae and qe. Thus the space of
`
`linear combinations of scalars, vectors and bivectorsis isomorphic to the linear space of
`real 2x2 matrices. Since the set of real 2x2 matrices is closed under matrix multipli-
`cation, the space of linear combinations of scalars, vectors and bivectors is also closed
`under Clifford multiplication and is therefore the whole of the Clifford algebra C9.
`
`Page 11 of 18
`
`Page 11 of 18
`
`

`

`Systolic Arrays
`
`715
`
`Upon identifying the real 2x2 matrices with the Clifford algebra C , a decompo-
`sition of the real 2x2 matrices into two parts, p
`and q, has been brought to the
`fore. This is an instance of the so-called decomposition of Clifford numbers into even
`and odd parts.
`Indeed a Clifford algebra C,,, built from a vector space E ,,, has for
`elements real linear combinations of scalars or 0-vectors, vectors or 1-vectors, bivectors
`or 2-vectors, and so on up to m-vectors. Thus any element may be decomposed in a
`unique way as the sum of an even part (a linear combination of even vectors) and an
`odd part (a linear combination of odd vectors).
`In other words C,,
`is the direct sum
`of two subspaces, an ‘even’ subspace denoted C *, and an ‘odd’ subspace denoted
`C,,- The even subspace is closed under Clifford multiplication, written symbolically
`C,;,C+t+=C fF, consequently it is a subalgebra of C,,. The odd subspacesatisfies
`C,C,=CF ad C7jCt=CirC, =C,. (Of interest for the SVD computa-
`tion of complex matrices is the Clifford algebra C4, isomorphic to the 2" =8 dimen-
`sional space—over the reals—of complex 2x2 matrices, and with even subspace the
`quaternions and odd subspace the antiquaternions (6].) The even subspace of C, is the
`algebra of complez numbers; by extension, the odd subspace of C,, E, may becalled
`the subspace of anticomplez numbers.
`The units of the even subspace of C2 are elements p = p,I + p.e, Ae, with
`unit norm, where the norm is naturally defined as (p? + p2)*. Therefore they can be
`written under the form u(@) = cos@I + sin®e, Ae, with 0 < @< 2m, and hence they
`are the plane rotations.
`It is easy to check (and this may also be derived as a special
`case of a property of quaternions) that plane rotations commute with even Clifford
`numbers and anticommute with odd Clifford numbers:
`
`qu(@) = u(-)q
`pu()= u()p ,
`This enables us to pull the Jacobi rotations from the right
`to the left, both for the
`evaluation of the rotations in the diagonal processors and for their application in the
`off-diagonal processors:
`- evaluation in processor II,
`u(-9;)m u(@;) = u(-7) p u(6;) + u(-97)q u(4;)
`= u(-0;)u(9;)p + u(-9;)u(-4;)q ,
`hence, using the property that u(@)
`is isomorphic to the complex number exp(i@),
`
`a
`0
`u(-67)m u(6;) = u(-07+6;)p +u(-9/-6))q =m =||
`0d
`
`- application in processor IJ,
`u (-97)m u(@;) = u(-f) p u(6z) + u(-97)q u (Iz)
`u(-0; + 0;)p +u(-0j-4,)q
`hence, by pulling the rotations from the right to the left and exploiting the fact that p
`and q_
`are fully defined by their first column, the Jacobi rotations may be applied with
`2 two-dimensional vector rotations instead of 4.
`Representations of u(@,;) and u(-0;) must be computed as intermediate forms
`in order to apply the Jacobi rotations and build up the matrices of singular vectors (or
`eigenvectors if Bis symmetric) U and V7? as the products, accumulated over the
`n
`n/2
`steps, of the matrices
`J;
`and
`il J;',
`respectively. The evaluation of these
`Isl
`cag
`representations may be performed by finding the rotations u(@;) and u(-6;‘), where
`6; 46,-0;
`and 6+ 46, +6},
`that
`force the second component of the vectors
`
`Page 12 of 18
`
`Page 12 of 18
`
`

`

`776
`
`International Conference on Application Specific Array Processors
`
`ial" s respectively, to 0. This approach, exploiting the decompo-
`(py -p>)? and (q1
`sition of Clifford numbers into even and odd part, has been used on general purpose
`computers, using standard arithmetic, from very early on (e.g. Forsythe and Henrici,
`1960). However the use of that decomposition for the application of the rotations did
`not follow. To understand why we have to place ourselves in the context of machines
`using standard arithmetic. We first note that
`in this context
`the passage from
`Sé, & {u (67), u(-6),1< 1 <n/2} to Sting & {u (07), u (-O7),1 <5 1 < n/2}, and
`the passage
`from Sjig,
`to
`Spy 4S {u(8y),u(-05),1<1¢I <n /2}, where
`07, 20, -0;
`and
`675 4 6; + 87, are done via the trigonometric formulas for the
`tangents of the rotation angles, using the +, x and / operations andalso, for the first
`passage, V operations since tangents of half-angles must then be computed. The next
`observation is that a rotation is ultimately represented by its cosine and sine in this con-
`text and, given an intermediate tangent
`representation, generating the cosine/sine
`representation requires +, x, /, andVoperations. The reason for not exploiting the
`decomposition in a sequential setting is now clear: the computation of the cosine/sine
`representation of S,,,
`given the tangent representation of Sdiag Tequires O(n”) +,
`x, /, and V operations while the computation of the cosine/sine representation of Seiag
`given its tangent representation costs only O(n) +, x, /, and V operations; the halv-
`ing of multiplications obtained when performing the two-dimensional vector rotations
`using the decomposition does not offset the large increase in +, x, /, and V operations
`needed to find the representation of the rotations. The bottom line in a parallel setting
`is that the use of the decomposition is not advantageous either, because it saves the time
`of the rotation of a two-dimensional vector, i.e. a multiply and add, at the expense of
`the time of the computation of
`tan @j, or
`tan Oj; given tan@,
`and tan 0j, i.e. a
`multiply and add and a divide. Yet the existence of the decomposition signals something
`significant.
`It removes some sequentiality at the level of the rotation operations and, if
`an arithmetic implementation of the rotations is employed that is better adapted to
`these operations than the traditional decomposition into +

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket