`
`UNCLASSIFIEO
`
`NOOOII-‘rO-c-OSTO
`
`CARNEGIE-HELLO?! UNIV PITTSBLRSH Pl DEPT OF CMTEI “ETC FIG 9’2
`STSTOLIC ARRAYS FOR (VLSI).IU)
`MC 7!
`H 'I' KWII C E LEISEISON
`CMS-Wl.’
`
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 1
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 1
`
`
`
`J
`
`"Ills "”TFFE EEEE1 -IEr
`I--Mno'0
`E
`“NE Eml]: EM3l-
`
`NI
`
`lefiRl
`MILHUCOPY RISULLIHUN fiSI'
`Mllrmn'. 111mm] nI ulmmlau. IHI.I &
`
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 2
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 2
`
`
`
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 3
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 3
`
`
`
`
`
`‘
`
`_
`
`I
`names-79 - 103
`
`.
`
`SYSTOLIC ARRAYS FOR (VLSI)
`
`H. T. Kong and Chart-a E. Lainraon
`
`Doparlmoni a! Complain Stine.
`Carnegie-Minn Univ-wily
`Pittsburgh. Pomyivania 15213
`
`April 1378
`
`Ill! “vi-Id Datum 19781
`
`
`
`
`
`_—.——_.,—.-.4
`
`i
`
`
`[in the iorthcomim book Iniroduclion Io VLSI System by C. A. Mud md L. A. Conway]
`
`Copyright -c- 1979 by HT. Rum and Charla: E. Loiuraon
`All Right: Rum
`
`'
`
`rauarch is support“ in part by the National Sci-ma Foundation under Grant
`This
`M38 75-22245! and the Oflico of Naval Ros-arch and-r Contract MOM-7904370.
`MOM-422. Charm ELelsarum is annual-tad in part by the Fun-Ii- and him Hort:
`Foundation.
`-
`
`Th!- document has been (gram-d
`for public release and Ialo; It!
`III-tribunal: la unlimihd.
`
`.Qla..ua.-.I'I'l
`
`
`
`._..__..-__.._...u-Hud'-I-.
`
`i
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 4
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 4
`
`
`
`*r-I—I.TWMW
`
`
`
`~1--_v.
`-so!"
`
`,
`.
`
`'
`
`I
`.
`
`.
`
`
`
`
`
`1
`
`r
`I
`
`And now I see with eye serene
`The very put-re of the machine.
`-'Wllllarn Wordsworth
`
`Abstract
`
`3)
`
`A systolic system is a network of processors which rhythmically compute and pass
`data through the system.
`Physiologists use the word ”systole“ to refer to the
`rhythmically recurrent contraction of
`the heart and arteries which pulses blood
`through the body.
`in e systolic computing system. the function ot a processor is
`analogous to that of the heart. Every processor regularly pumps data in and out,
`each time perlormlne some short computation, so that a regular Hour at date is kept
`Up In the network.
`
`Many bssic matrix computations can be pipelined elegantly and efficiently on
`systolic networks having an array structure. As an example, hesagonally connected
`processors can optimally perform matrix multiplication.
`Surprisingly,
`a similar
`systolic array can compute the LU-decomposilion of a matrix. These systolic arrays
`enjoy simple and regular cornmunication paths, and almost all processors used in the
`networks are identical. As a result. special purpose hardware devices based on
`systolic arrays can be built inexpensively using the VLSI technology.
`
`- 1. Introduction
`
`Developments in microelectronics have revolutionized computer design. Integrated
`circuit technology has increased the number and complexity of components that can
`ill on a chip or a printed circuit board. Companent density has been doubling every
`one-to-two years and already. a multiplier can ill on a very large scale integrator!
`M51) circuit chip. As a result.
`the new technology makes it feasible to build
`low-cost special purpose, peripheral devices to rapidly solve sophisticated problems.
`
`Reflecting the changing technology.
`
`this paper proposes new multiprocessor
`
`‘
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 5
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 5
`
`
`
`1
`r-
`
`
`
`structures for processing some basic matrix computations.
`
`that m a
`'slrugture;
`parallel
`interested in high-performance
`fl! 9;;
`implemented directly g lowiogt hardware M By performance. we are not
`
`refering to the traditional operation counts that characterize classical analyses of
`algorithms, but rather, the throughput obtainable when a special purpose peripheral
`device is attached to a general purpose host computer. This implies that time spent
`
`in U0. control, and data movement as well as arithmetic must all be considered. VLSl
`
`offers excellent opportunities for inexpensive implementation ot high performance
`
`devices (Mead and Conway [1978]}. Thus, in this paper the cost of a device will be
`
`determined by the expense of a VLSl implementation.
`
`'Fit the job to the bargain
`
`components“ -- Blakeslee [1975. p. 4].
`
`VLSI technology has made one thing clear. Simple and regular interconnections
`
`lead to cheap implementations and high densities. and high density implies both high
`
`perlormence and low overhead for support components.
`
`(Sutherland and Mead
`
`[1977] has a good discussion on the importance of having simple and regular
`
`geometries for data paths.) For
`
`these reasons. 3;; fig intgrgstgg in m
`
`WWMMMQMM—Mnit-lionm iris
`
`mmmrs: omit.“Iinin usmorlmmmmm
`
`W By pipelining, computation may proceed concurrently with input and
`
`Pipelining plus
`output. and consequently overall execution time is minimized.
`multiprocessing at each stage of
`a pipeline should lead to the best-possible
`porformence.
`
`W m provide a realistic model of computation which captures the
`concepts of pipelining. parallelism and interconnection structures. We do not want to
`
`give a formal definition ot systolic systems here. For the purpose at this paper, it
`suffices to view a systolic system as a network of processors which rhythmically
`
`to the rhythmic
`The analogy is
`compute and pass data through the system.
`contraction of the heart which pulses blood through the circulatory system of the
`body. Each processor in a systolic network can be thought of as a heart that pumps
`
`multiple streams of data through itself.
`
`The regular beating of
`
`these parallel
`
`processors keeps up a constant flow of data throughout the entire nelworIL As a
`
`processor pumps data items through.
`and may update some of the items.
`
`it performs some constant-time computation
`
`Unlike the closed-loop circulatory system of the body. a systolic computing system
`
`usually has ports into which inputs "0w, and ports where the results of the systolic
`
`computation are retrieved. Thus a systolic system can be a pipetined system - input
`
`I P
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 6
`
`etitioner Microsoft Corporation - EX. 1015 , p. 6
`
`
`
`'A-
`
`and output occur with every pulsation. This makes them attractive as peripheral
`processors attached to the data channel of a host cOmputer Figure 1-1 illustrates
`how a special purpose systolic device might
`lorm a part of a PUP-ll system. A
`systolic device may also process a rsal~lime data stream or he a component
`in a
`larger special purpose system.
`
`
`
`Figure 1-1: A systolic device connected to the UNIBUS of a POP-l I.
`
`This paper deals largely with systolic systems where the underlying network is
`array structured
`(See also Run; and Leiserson [1978].) An array network is
`attractive tor it enioys simple and regular communication paths.
`in Section 2, we
`describe the basic hardware requirements and interconnection schemes tor
`the
`
`systolic arrays propoeed and discuss the lsasibility ot building them in VLSt. Section
`3 deals with the matrix-vector multiplication problem. Multiplication of two matrices
`is considered in Section 4.
`in Section 5, we show that essentially the same systolic
`arrays
`tor matrix multiplication
`in Section
`it
`can be used
`to
`find the
`
`LU-decomposilion ot a matrix. Section 6 is concerned with solving triangular linear
`systems. We show that
`this problem can be solved by almost
`the same systolic
`array tor matrix-vector multiplication described in Section 3. Section 7 discusses
`applications and extensions ol' the results presented in the previous sections. The
`
`applications include the computations at finite impulse response litters. convolutions,
`and discrete Fourier transforms.
`Some concluding remarks are given in the last
`section.
`
`The size of each oi our systolic array networks is dependent only on the band
`width of the hand matrix to be processed, and is independent of the length of the
`hand. Thu. a fixed size systolic array can pipeline hand matrices with arbitrarily
`Ions bands. The pipelinin; aspect of our arrays is. of course. most ettective~tor
`bend matrices with ion; bends. Band matrices are interesting in their own right,
`since many important scientific computations involve band matrices.
`For
`these
`reasons, BIDS! of
`the results in this paper will he presented in terms of their
`
`applications to bend matrices. All the results apply to dense matrices since a dense
`
`a .
`
`
`
`.‘IF\.!'~"w'”9-"~vell-C"?
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 7
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 7
`
`
`
`--..."———.e.+— ._.._..
`
`nostril cen be viewed es I hand matrix having the masimumrpossibie bend width.
`
`2. The Basic Components and Systolic Array Structures
`
`2.] The Inner Product Step Processor
`
`I
`
`The single operation common to all the cornputations considered in this paper is
`the so—caned inner product step. C 0- C + A x B. We postulate a processor which
`hes three registers RA: Fla, and RC. Each register hes two connections. one tor input
`and one tor output. Figure 2-1 shows two types at geometries tor this processor.
`
`A
`
`C"
`
`3"
`
`"C
`
`*B
`
`t
`
`i
`
`A
`
`{at
`
`Figure 2-l: Geometries tor the inner product step processor.
`
`Typo (I) Ioomtry will be used for metrie-vector multiplication end solution of
`
`triengular linear systems (Sections 3 and 5}, whereas type lb) geometry will be used
`tor matrix multiplication and LU-decoinposition {Sections 4 end 5}. The processor is
`cepeble ot performing the inner product step end is called the trim; m m
`
`this
`in terms at the operation oi
`m We shall deline e basic time unit
`processor.
`In each unit‘tlnie interval. the processor shifts the date on its input lines
`
`denoted
`by A.
`6 end G into R“.
`“B and RC,
`respectively,
`cornputes
`RC 0- RC 9 “A 1 RB. and makes the input vetoes for RA and R3 together with the
`new value of RC sveitehte as outputs an the output lines denoted by A, 3 end C,
`respectively. All outputs ere latched end the logic is clocked so thet when one
`
`processor is connected to another, the changing output ot one during a unit time
`interuel will not inlertere with the input to another during this time interval. This is
`
`not the only processing element we shell make use oi. but it will be the work horse.
`
`A special processor tor performing division will be specified later when it is used.
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 8
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 8
`
`
`
`-~mm-wrrsrw-n-
`
`2.2 Systolic Arrays
`
`A systolic device is typically composed of many interconnected inner product step
`processors. The basic network organization we shall adopt is the mesh-connected
`scheme in which all connections tram a processor are to neighboring processors.
`(See Figure 2-2.)
`
`CHE—CHI]
`
`(a) linearly connected
`{MIAC N)
`
`{ht orthogonally connected
`
`(ct hesegonetly connected
`
`Figure Z'ZIIMWIN systolic arrays.
`
`The most widely known system based on this organization is the ILLIAC IV (Barnes
`
`it diagonal connections are added in one direction only. are shell cell
`es e1. [1968]).
`the reuniting schemeWW orW tor short. We
`shall demonstrate that linearly connected and hes-connected arrays are natural for
`matrix problems.
`
`Processors lying on the boundary of
`the systolic array may have external
`connections to the host watery. Thus, an inputfoutput data path ot a boundary
`Processor my somtimes be designated as an external input/output connection for
`the device. A boundary processor may receive input from the host morory through
`such an salernal connection. or it may receive a tired value such as zero. On the
`
`other hand. a boundary processor can send data to the host nienlory through an
`esternal output connection. An output of a boundary processor may sometimes be
`ignored. This will be designated by omitting the corresponding output tine.
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 9
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 9
`
`
`
`In this paper we assume that the processors in a systolic array are synchronous
`as described in Section 2.1. However. it is possible to view the processors being
`
`asynchronous, each computing its output values when all its inputs are available, as
`
`in a data flow model. For the results oi
`approach to be more direct and intuitive.
`
`this paper we believe the synchronous
`
`The hardware demands of the systolic arrays in this paper are readily seen to be
`modest. The processing elements are unilorm. interprocessor connections are simple
`
`and regular, and external connections are minimized.
`
`It
`
`is our belief
`
`that
`
`construction 01' these systolic arrays will prove to be cost-ettective using, for
`instance, the modern VLSl technology.
`
`3. Matrix-Vector Multiplication on e Linear Systolic Array
`
`the problem of multiplying a matria A -(a~"-) with a vector
`We consider
`at - “tr-“nif- The elements in the product y -(y1....yn)T can be computed by the
`lollowing recurrences.
`
`;
`
`vi" -
`
`o.
`
`lily!"
`
`Vi” * 'ik‘ko
`
`Vt
`
`_
`
`3{En-rt)-
`
`'
`‘
`
`Suppose A is an nvn band matrix with band width w - pus-t. {See Figure 3-1 for
`the case when p - 2 and o - 3.) Then the above recurrences can be evaluated by
`
`pipelining the "i and Vi through a systolic array consisting at is linearly connected
`inner product. step processors. We illustrate the operation of the systolic arrayI to'r
`the band matrix-vector multiplication problem in Figure 3-1.
`For
`this case the
`
`linearly connected systolic array has four inner product step processors.
`Figure 3-2.
`
`See
`
`The general scheme of the computation can be viewed as follows. The Vl- which
`are initially zero. are pumped to the left while the "i are pumped to the right and
`the 'ii are marching down.
`{For
`the general problem of computing And where
`e-(dlfianfl' is any given vector, yi new be initialized as cit an the moves are
`synchronised.
`It turns out that each Vi is able to accumulate all its terms, namely. 'i.
`has“? 'iJ-l'l-l' 'iJ'i and auditi”. betore it
`leaves the neter Figure 3-3
`illustrates the llrst seven pulsations ol the systolic array. little that when yl and ’2
`are output
`they have the correct values. Observe also that at any given time
`
`'.
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 10
`
`etitioner Microsoft Corporation - EX. 1015 , p. 10
`
`6 P
`
`i
`
`_
`
`.
`
`
`
`,
`
`i
`
`‘ J
`
`1
`i
`
`i
`1
`.
`
`1
`
`i
`
`'
`
`E
`I
`""
`
`i
`i
`.
`1
`_
`i
`‘
`I
`i
`
`i
`
`':
`
`.
`
`I"
`
`:
`I
`
`u
`I
`
`ll
`I
`
`'
`
`q
`
`o
`
` xt
`
`I:
`
`8a
`
`“a
`
`Y.
`
`Y:
`
`V:
`
`Ya
`
`:
`
`x
`
`y
`
`Hull 3-]:Wfipflulionolnvulwhylhmdmlfixwflhp-Zlndq-S.
`:
`.u
`‘
`'8
`:
`I
`I
`1
`I
`I
`:-
`E
`:
`:
`:
`i -.
`E
`
`'-
`
`-.
`
`'-
`
`'3
`
`i
`:
`I
`'
`l
`
`.u
`
`I
`
`n f
`
`I,
`
`'II J
`1'
`o
`i
`.
`:
`l
`'
`l
`
`-----—-> I
`
`Finn 3-2: 1']:- Iinurly com-clad ”midi: only [or Ila matrix-ruler
`“agitation problm in Finn's 3-1.
`
`7 P
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 11
`
`etitioner Microsoft Corporation - EX. 1015 , p. 11
`
`
`
`
`
`Comments
`
`, initialized as zero.
`3?.
`is pumped into tho fourth
`processor.
`
`is pumpsd into tho first
`2,
`processor whils y,
`is moved
`left one place.
`(From now
`on tho x. and y. keep moving
`right sud loft, rsspectivsly.)
`
`s" enters the second
`procsssor whorl y.
`is
`updstsd by y.
`*- yl + I" 1..
`Thus y. I s“ ‘1'
`
`snout an outer the first
`sod third processors,
`rsspoctivoly.
`y. - snx,+ aux:
`“4 h" 'lel '
`
`'
`
`is mod. out
`y,
`3': " saw ‘a‘:
`y: I 51'-
`
`h " lair" h‘z'" ‘n‘s‘
`Y5
`"' 531+ 'J‘L‘s'
`
`y1 is punpod out.
`31" 'sfl‘fl'
`‘32‘3-1' ‘Js‘s'
`h" 'qi‘r
`
`Figur- 3-3: Tho firs! sum Motions oi Ibo lino" systolic srrsy in F‘iguro 3-2..
`
`Ind-Id. by'cosloscin; osirs of sdiscsnt proosssors. it
`Illomslo processors srs idlo.
`Is possiblo to on v42 procsssors in lho network [or s gar-oral bond mslris with
`bond width or.
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 12
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 12
`
`
`
`I“
`
`'
`
`-.
`
`
`
`w-v—wrm.-.
`
`.ystolic array more precisely. Assume that
`We now specify the operation of thc
`the processors are numbered by integers l, 2. . .
`.. w from the tett end processor to
`
`the right end processor. Each procesaor has three registers, RA- R: and Ry, which
`will hold entries in A. s and y, respectively.
`Initially, all registers contain zeros.
`
`Each pulsation oi the systolic array consists of the tollowing operations. but for odd
`numbered pulses only add numbered processors are activated and for even
`numbered pulses only e’ven numbered processors are activated.
`
`1. Shifl.
`
`-
`
`.
`
`- RA gets a new element in the band of matrix A.
`
`- it gets the contents at register Rx from the left neighboring
`node. {The R: in processor 1 gets a new component of it.)
`
`- P7 sets the contents of register Fl
`trorn the right neighboring
`node.
`(Processor
`1 outputs its $3, centents and the Ry in
`processor w gels zero.)
`
`2. Multiply and Add.
`
`Rye-RyfRAxR '.
`
`Using the type ta) inner product step processor postulated in section 2, we note
`
`that the three shitt operations in step 1 can be done simultaneously, and that each
`pulsation of the systolic array takes a unit of time. Suppose the bandwidth of A is
`w - pro-1.
`It
`is readity seen that after w units of time the components ol
`the
`product y - As are pumped out
`lrcm the left end processor at
`the rate at one
`
`output every two units at time. Thgretorg, M gm systolic network gfl m n
`
`ELMLMnt' arcane—£40mute EMMMEWH‘r d amalgam
`mmsmmrunway—mmMummies.
`'
`
`4. Matrix Multiplication on a Hexagonal Systolic Array
`
`This section considers the problem ot multiplying two It‘ll! matrices.
`
`it is easy to
`
`see that the matrix product C - (cij! of A - (aijt and B - lb”) can be computed by
`the (allowing re'currences.
`
`er
`G”
`
`+1 _
`I
`
`-
`
`‘ 'skbujc
`
`chit}
`GH" 1 }.
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 13
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 13
`
`
`
`
`
`k
`
`r'
`
`Let A and B be mm band matrices of band width w; and wz, respectively- We show
`how the recurrences above can be evaluated by pipelining the ‘iiI hi] and :ii
`through a systolic array having “'th hex—connected inner product step processors.
`'Wo illustrate the general scheme by considering the matrix multiplication problem
`
`depicted in Figure 4-1. The diamond shaped systolic array tor this case is shown in
`
`Figure 4-2, where processors are hex-connected and data flows are indicated by
`arrows.
`
`' n
`'2:
`
`' is
`'rr
`
`'ra
`
`0
`
`his
`bl!
`"II
`bar ha ha
`
`br-
`
`0
`
`‘s:
`
`'sa
`
`‘3 'aa
`
`|’aa
`
`”as
`
`b:-
`
`has
`
`=
`
`a"
`
`.
`
`b.
`
`.
`
`
`
`
`
`Figure 4-1: Band matrix multiplication.
`
`The elements in the bands of A, B and C are pumped through the systolic network in
`
`Each ”ii
`three directions synchronously.
`network through the bottom boundaries.
`
`initialized to zero as it enters the
`is
`(For the general problem of computing
`
`is any given matrix. c”- should be initialized as air) One can
`ABoD where D-{diit
`eesily see that with the type (b) inner product step processors described in Section
`
`2. each cij is able to accumulate all its terms before it leaves the network through
`the upper boundaries.
`Figure 4-3 shows
`four consecutive pulsations of
`the
`
`this
`The reader is invited to study the data flow of
`hexagonal systolic array.
`problem more closely by making tranparencies ot the bend matrices shown in the
`
`figures. and moving them over the network picture as described above.
`
`mamammmmummummm Incas
`
`Mm mummmmmmm
`
`mtgglgligatign A_a§ i3; Qnomingw l! 9-31 gn_it_1 91 Liam Note that in any row or column ol
`the network, out of every three consecutive processors, only one is active at given
`
`I
`.
`
`It is possible to use about item”: processors in the network lor multiplying
`time.
`two band matrices with band widths in 1 and tea.
`
`'
`
`10
`
`
`
`2-14.21 ~.'.-
`
`
`'r.
`
`i
`-
`-r.*1'¥rflf-I'r‘.—.._ra
`.
`.-
`.-i.-_—I.:’a-.‘;gn,.,.,.
`
`._ .__;-';
`____._,_.
`-
`.
`_
`_
`-
`_
`.
`_
`.
`’. _a_..
`‘71"??? Twit-W" "
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 14
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 14
`
`
`
`_-.Ft...
`
`
`
`.iltiiir.‘.In!..rill-l!
`
`8—-..f
`
`All
`
`'I"II-".I.l'l"
`
`I
`
`.lol.'t'illl'I'I‘l"
`
`.”I
`IuIIBCI
`
`\
`
`CI
`
`\\\
`
`\.
`
`AIIIIIIIIIIIIIIIIII.
`‘||Ill".‘ul'-l'l8\““llhl'liU
`A----:_....-----............:--fie”
`
`L...oaIna-+9....
`:mto”...
`
`”.Alllllr’.flC.I
`.‘AI.I.I'.II--I'.-I|"In
`AlIIIII‘
`
`
`
`———--——I
`
`T
`
`Figure 4-2: The hos-comet“ lyslolie may for the matrix mullipfiulion problaln
`In Elm fi-I.
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 15
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 15
`
`
`
`
`
`
`
`
`
`
`
`....||.I.IL
`
`--——--—-—-—---o--¢--n——_———q—--oou--_—____-_----...----_--_--
`
`
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 16
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 16
`
`
`
`
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 17
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 17
`
`
`
`i
`
`'
`
`L.-
`
`5. The LU-Decomposition of 3 Matrix on a Hexagonal Systolic Array
`
`The problem at lactoring a matrix A into lower and upper triangular matrices L
`and U is called LU-decompOsition. Figure 5-1 illustrates the LU-decomposition of a
`bend matrix with p - 4 and o I it. Once the L and U factors are known,
`it
`is
`relatively easy to invert A or sotve the linear system Ax - b. We deal with the
`latter problem in section 6. This section describes a hexagonal systolic array for
`cmnpuiing LU-decomposiiions.
`
`O
`
`'I: ‘s 'M
`'1:
`'u 'n ‘rs .20 '19
`Isi‘u‘aa‘u'as
`.It
`'s: ‘ss
`.I'I .‘
`
`D
`
`e
`
`O
`
`.
`
`I
`'n I
`_ lulu!
`— 'II
`In '1:
`k B
`O_
`
`A
`
`0
`
`e
`
`.
`
`1 .
`
`L:
`
`0
`
`"It “Is “I: “is
`“so "as ”as "as
`"an-ya
`'
`
`0
`
`a
`
`.
`
`U
`
`Figure 5-1: The LU-dasomposition of a band matrix.
`
`We assume that matrix A has the property that its LU-deconiposilion Fan be done
`by Gaussian elimination without pivoting.
`(This is true. tor example. when A is a
`swirl: positive-definite, or an irreducible. diagonally dominant matrix.) The
`triangular matrices L - it“) and U - (”ij’ are evaluated according to the following
`rmr.flcn.
`
`'
`
`4P -
`nlr*"-
`
`-
`
`l
`
`it
`
`_
`..,.
`4:" + lint-viii
`0
`1
`
`{ alk'ufl
`
`'
`
`if i s a,
`it i . a,
`
`it i a k,
`
`{all},
`
`flkrb
`
`«list.
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 18
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 18
`
`
`
`'I
`
`"
`
`-
`
`I'
`!
`
`this
`hes-connected systolic array of hes-connected processors. A global view of
`pipelined computation is shown in Figure 5-2 for
`the LU-decomposition problem
`depicted in Figure 5-1. The systolic array in Figure 5-2 is constructed as follows.
`
`The processors below the upper boundaries are the standard type to) inner product
`
`step processors and are hes-connected exactly same as the matrix multiplication
`
`network presented in Section 4. The processor at the top, denoted by a circle' is a
`
`special processor.
`
`It computes the reciprocal of its input and pumps the result
`
`southwest, and also pumps
`
`the same input northward unchanged.
`
`The other
`
`processors on the upper boundaries are again type (b)
`
`inner product
`
`step
`
`processorsI but their orientation is changed:
`
`the ones on the upper left boundary
`
`are rotated 120 degrees clockwise:
`
`the ones on the upper right boundary are
`
`rotated 120 degrees counterclockwise.
`
`The flow of data on the systolic array is indicated by arrows in the figure. As in
`
`the hexagonal systolic array for matrix multiplication . each processor only operates
`
`every third time pulse.
`
`Figure 5-3 illustrates tour conseCutive pulsations of
`
`the
`
`systolic array:. Note that in the figure. because A is I hand matrix with p - d and
`q - 4 we have that air: J- 'l-OSJ and affla- 'i,i+3 tor ls k st and t a 2. Thus e52,
`tor example, can be viewed as age when it enters the networlc
`
`There are several equivalent systolic arrays that reflect only minor changes to
`the network presented in this section. For example, the elements of L and U can be
`
`retrieved as output
`
`in a number ot ditlerent ways. Also. the '-t' input
`
`to the
`
`network can be changed to a 'el‘ it the special processor at the top of the network
`computes minus the reciprocal ol its input.
`
`usummmmmmmm* - eMmm—Mn on
`mummwmmmt
`the.
`m "muse
`W mi};
`9_l m It A is an nxn dense matrix.
`this means
`that n2
`hes-connected processors can compute the L and U matrices in 4.1 units at time
`
`which includes ”0 time.
`
`The remarkable tact that the matrix multiplication network forms a major part of
`
`the Lil-decomposition network is due to the similarity of their defining recurrences.
`
`In any row or column of the LU-dechmposition systolic array. only one out at every
`three consecutive processors is active at a given time. As we observed tor matrix
`multiplication. the number of processors can be reduced to about qua.
`
`l P
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 19
`
`etitioner Microsoft Corporation - EX. 1015 , p. 19
`
`'
`
`.
`
`
`
`
`
`I'l"I'I-"I"""
`
`Figaro 5-2: Tho bus-unmet“ systolic any tor plum-in; IboWW
`a! Hummus-1mm I-l.
`
`‘6
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 20
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 20
`
`
`
`
`.
`
`:J’LH.‘
`
`.
`
`' ‘1".
`
`..
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 21
`
`Petitioner Microsoft Corporation - EX. 1015, p. 21
`
`rill..-.in...
`
`
`
`
`
`..-..lI-.!l...ulil..1'
`
`
`
`
` llqul
`
`—---.-u———---——---——.—-—n.-uu--—-—-——uu-——----—-—n—¢-nun-n----
`
`Flgm 5-3; Four mum of Ib- hoauond midi: arr-y In Finn 5-2.
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 22
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 22
`
`
`
`
`
`6. Solving a Triangular Linear System on a Linear Systolic Array
`
`to solve a linear system Ax - h. Then after having done
`Suppose that we want
`the Lu-decomposiiion oi A n.3,. by methods described in Section 5), we still have to
`solve two triangular linear cysteine Ly - b and U3: - y. This section concerns itself
`
`with the solution of triangular linear systems. An upper triangular linear system can
`
`always be rewritten as a lower triangular linear system. Without loss of generality,
`this section deals exclusively with lower triangular linear cysteine.
`
`Let a . {3”} be a nonaingular nan band lower triangular matrix. Suppose that A
`and an n-vector b - iblmthT are given. The problem is to compute it - (11.....8")T
`such that A: - h. The vector x can be computed by forward substitution:
`
`xi” -
`
`0.
`
`Vin!"
`
`Vt” *’ fit!»
`
`at,
`
`-'
`
`(bi-Allin“.
`
`all
`
`a,' a,
`a,I a," a,
`
`q
`
`0
`
`a“ a.
`
`a... a.
`
`a.ll a. a“ a"
`
`i.
`
`.
`
`A
`
`a.
`
`x,
`I.
`
`1..
`
`l.
`
`.
`
`x
`
`=
`
`b.
`
`b,
`b:
`
`b.
`
`b,
`
`.
`
`b
`
`Figure l-l: The band {lower} triangular linear ayateun where e I 4.
`
`Suppose that A ie a band matrix with hand width at - a.
`
`(See Figure 6-1 for the
`
`one when 11 a OJ Then the above recurrences can be evaluated by e ayatolic array
`eimilar to that uaed tor hand matrix-vector multiplication in Section 3.
`(Observe 'the
`
`almilarity oi
`
`the datining rawrancea tor than two mm) We ilhaatrate our-
`
`.“genome-q“...
`
`“at...
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 23
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 23
`
`
`
`reeull by considering the linear eyslem problem in Figure 5-1. PM this cm. the
`systolic errey is described in Figure 6-2.
`
`.
`
`I
`
`'
`I
`I
`.I I
`I
`I
`I
`I
`I
`I
`
`I
`"
`
`I I
`
`I
`|
`I
`I
`I .
`I a
`I
`I
`
`I
`
`.*
`
`I I
`
`I
`
`I
`
`I
`" I
`’
`
`I
`
`’
`
`i
`I .
`I
`I 3
`I
`1‘
`I
`I I
`I
`I
`I
`:
`:
`I
`I”
`I
`l
`I
`I
`I .
`I
`I
`I u”
`I
`I
`I
`,
`
`
`: I
`I
`|
`
`:
`I
`
`..'
`"a
`” I
`I
`I
`I
`I
`|
`I
`-I
`I
`I
`I
`I
`e
`
`
`
`---...3. I:
`
`g'
`
`y. (“"-
`
`m.a.1...»-
`
`II
`
`4
`
`Figure O-ZI The “nearly mm systolic array Ier eelm
`the triennial- lineer In!“ In figure H.
`
`The 9;, which ere Iain-fly zero. ere lowed Ieflmrd through the systolic errey
`whilolhe “I- eueI-Idbierempodelmeiedinfi'weS-z 'l'helellend
`W ie special In IIuI II perIIII-IIII III-Ibrnulfl.
`(In Iect. IIIe specie! processor
`“main uclanIeIerelhe WIMMhelpuiel ceeeof
`
`1’
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 24
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 24
`
`
`
`this more general processor.) Each y; accumulates inner product terms in the rest of
`the processors as it moves to the leil. At the time yi reaches the Ielt end processor
`it has the value 'it‘t"iz‘zh-“iJ-HE-l- and. consequently,
`the li coinputed by
`air-(bi-yitfaii at the processor will have the correct value. Figure 6-3 demonstrates
`the tlrst seven pulsations ot the systolic array. From the figure one can check that
`
`tho tinal values of ‘1' x2, :3 and 14 are all correct. w m firstoiig m E: gm
`
`mummmmI r MmlmmMMLuinmmLsfl
`m As we observed for the matrix-vector multiplication problem. the number of
`processors required by the array can be reduced to wfz.
`
`'7. Applications and Comments
`
`7.1 Variants ot the Systolic Arrey
`
`'5!”
`
`It more lntormalion is available about the specific matrices involved, an optimized
`version of the systolic arrays presented above can be used.
`it is important that the
`
`reader understands the basic principles so that he can construct appropriate
`variants tor his specific problems. No attempt is made here to list all the possible
`variants.
`
`As pointed out
`
`in Section 1. although most oi our illustrations are of band
`
`matrices, all the systolic arrays vroriv. for regular nan dense matrices.
`
`In this case
`
`It the band width ot a matrix is so largo
`tho band width of the matrix is It - 2n-l.
`that
`it
`requires more. processors than a given array provides. then one should
`decompose the matrix and solve each subproblem on the network.
`Thus, for
`example. the matrimr unalliplication ot two nan matrices or the tU-decomposition of an
`nan metric can be done in Gina/k2) time on a lurk systolic array.
`
`One can often reduce the number ot processors required by a systolic array ii the
`
`metric is known to be sparse or symmetric. For example. the matrices arising from a
`eel ot finite dillerences or finite elements approximations to dillerenlial equations
`are usually “sparse bend matrices”. These are band matrices whose nonzero entries
`
`appear only in a tear of those lines in the band which are parallel to the diagonal.
`
`In
`
`this case by introducing proper delays to each processor tor shitting its data to its
`neighbors. the number ot processors required by the systolic array in Section 3 can
`
`be reduced to the number ot those diagonsis which contain nonzero entries. This
`variant is useful tor parlorming iterative methods involving sparse band matrices.
`Another example concerns the LU-decomposition probbm considered in Section 5.
`it
`
`matrix A is symmetric positive definite, then it is possible to use only the toll portion
`ot m. hes-contacted network. since in this case u a simply oLT vrhere o is the
`
`
`
`____4_.__
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 25
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 25
`
`
`
`.
`
`i
`
`L
`
`Poise
`anber
`
`O
`
`1 .
`
`Configuration
`
`Comments
`'
`
`__
`
`3': can“ processor a.
`
`3‘. novel lot: on position.
`
`3; onto:- proconor 4.
`
`x - al.- 1 3/
`.
`<='=. - hf-ul- :2“. y. - 0-)
`
`2
`
`a
`
`4.
`
`U
`
`'
`
`‘
`
`,
`
`.
`
`0
`
`'
`
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 26
`
`y; - on :l.
`
`3:, " (bu. - 5V9...
`73 ' lull.
`
`’3 " 'JI ‘Iflaflv
`N" .11 x] '
`
`x. 1: and out
`33 '0';- n)/‘u'
`,4 - ~l 11+ ‘ulsu
`
`-
`+
`+
`z: . :32. M w:
`
`o
`
`3:11 pupal out.
`I: " 0H - VqJ/Iw-
`.ys - fi‘xti’ .3335-
`
`J
`
`;
`
`‘
`
`‘
`
`'
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 26
`
`
`
`H's-«er
`
`diagonal matrix {akhm}.
`
`the systolic network to solve a particular
`the size of
`The optimal choice of
`problem depends upon not only the problem but also the memory bandwidth to the
`
`hoat computer.
`
`For achieving high perlormance, it
`
`is desirable to have as many
`
`processors as possible in the network, provided they can all be kept busy doing
`usetul computations.
`'
`
`is possible to use our. systolic arrays to solve some nonnumerical problems
`it
`when appropriate interpretations are given to the addition (+1 and multiplication (x)
`
`operations. For example. some pattern matching problems can be viewed as matrix
`problems with campariscln and Boolean operations.
`it
`is possible to store a
`
`dynamically changing data structure in a systolic array so that an order statistic can
`always be determined in constant
`time. We shall report these results in a future
`paper.
`it can be instructive to view the + and x operations as operations in an
`
`abstract algebraic structure such as a semiring and then to examine how our results
`hold in such an abstract setting.
`
`7.2 Convolution, Filter. and Discrete Fourier Transform
`
`There are a number of
`
`important problems which can be formulated as
`
`matrix-vector multiplication problems and thus can be solved rapidly by the systolic
`
`array in Section 3. The problems of computing convolutions, finite impulse response
`
`It a matrix has ,the
`(FIR) litters. and discrete Fourier transform are. such examples.
`property that the entries on any line parallel to the diagonal are all the same. then
`
`the matrix is s Toeplitz matrix. The convolution problem is simply the matrix-vector
`
`multiplication where the matrix is a triangular Toeplitz matrix (see Figure 7-1).
`4
`
`AI p-tap FIR tiller can be viewed as a matrix-vector multiplication where the
`matrix is an band upper triangular Toeplitz matrix with band width w - p. Figure .7-2
`
`represents the computation of a A-tap tiller.
`
`On the other hand, an n-point discrete Fourier tranetorm is "I metrix-voctOr
`multiplication, where the lid} entry at the matrix is ohm?