throbber
AD-AObb 060
`
`UNCLASSIFIEO
`
`NOOOII-‘rO-c-OSTO
`
`CARNEGIE-HELLO?! UNIV PITTSBLRSH Pl DEPT OF CMTEI “ETC FIG 9’2
`STSTOLIC ARRAYS FOR (VLSI).IU)
`MC 7!
`H 'I' KWII C E LEISEISON
`CMS-Wl.’
`
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 1
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 1
`
`

`

`J
`
`"Ills "”TFFE EEEE1 -IEr
`I--Mno'0
`E
`“NE Eml]: EM3l-
`
`NI
`
`lefiRl
`MILHUCOPY RISULLIHUN fiSI'
`Mllrmn'. 111mm] nI ulmmlau. IHI.I &
`
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 2
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 2
`
`

`

`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 3
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 3
`
`

`

`
`
`‘
`
`_
`
`I
`names-79 - 103
`
`.
`
`SYSTOLIC ARRAYS FOR (VLSI)
`
`H. T. Kong and Chart-a E. Lainraon
`
`Doparlmoni a! Complain Stine.
`Carnegie-Minn Univ-wily
`Pittsburgh. Pomyivania 15213
`
`April 1378
`
`Ill! “vi-Id Datum 19781
`
`
`
`
`
`_—.——_.,—.-.4
`
`i
`
`
`[in the iorthcomim book Iniroduclion Io VLSI System by C. A. Mud md L. A. Conway]
`
`Copyright -c- 1979 by HT. Rum and Charla: E. Loiuraon
`All Right: Rum
`
`'
`
`rauarch is support“ in part by the National Sci-ma Foundation under Grant
`This
`M38 75-22245! and the Oflico of Naval Ros-arch and-r Contract MOM-7904370.
`MOM-422. Charm ELelsarum is annual-tad in part by the Fun-Ii- and him Hort:
`Foundation.
`-
`
`Th!- document has been (gram-d
`for public release and Ialo; It!
`III-tribunal: la unlimihd.
`
`.Qla..ua.-.I'I'l
`
`
`
`._..__..-__.._...u-Hud'-I-.
`
`i
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 4
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 4
`
`

`

`*r-I—I.TWMW
`
`
`
`~1--_v.
`-so!"
`
`,
`.
`
`'
`
`I
`.
`
`.
`
`
`
`
`
`1
`
`r
`I
`
`And now I see with eye serene
`The very put-re of the machine.
`-'Wllllarn Wordsworth
`
`Abstract
`
`3)
`
`A systolic system is a network of processors which rhythmically compute and pass
`data through the system.
`Physiologists use the word ”systole“ to refer to the
`rhythmically recurrent contraction of
`the heart and arteries which pulses blood
`through the body.
`in e systolic computing system. the function ot a processor is
`analogous to that of the heart. Every processor regularly pumps data in and out,
`each time perlormlne some short computation, so that a regular Hour at date is kept
`Up In the network.
`
`Many bssic matrix computations can be pipelined elegantly and efficiently on
`systolic networks having an array structure. As an example, hesagonally connected
`processors can optimally perform matrix multiplication.
`Surprisingly,
`a similar
`systolic array can compute the LU-decomposilion of a matrix. These systolic arrays
`enjoy simple and regular cornmunication paths, and almost all processors used in the
`networks are identical. As a result. special purpose hardware devices based on
`systolic arrays can be built inexpensively using the VLSI technology.
`
`- 1. Introduction
`
`Developments in microelectronics have revolutionized computer design. Integrated
`circuit technology has increased the number and complexity of components that can
`ill on a chip or a printed circuit board. Companent density has been doubling every
`one-to-two years and already. a multiplier can ill on a very large scale integrator!
`M51) circuit chip. As a result.
`the new technology makes it feasible to build
`low-cost special purpose, peripheral devices to rapidly solve sophisticated problems.
`
`Reflecting the changing technology.
`
`this paper proposes new multiprocessor
`
`‘
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 5
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 5
`
`

`

`1
`r-
`
`
`
`structures for processing some basic matrix computations.
`
`that m a
`'slrugture;
`parallel
`interested in high-performance
`fl! 9;;
`implemented directly g lowiogt hardware M By performance. we are not
`
`refering to the traditional operation counts that characterize classical analyses of
`algorithms, but rather, the throughput obtainable when a special purpose peripheral
`device is attached to a general purpose host computer. This implies that time spent
`
`in U0. control, and data movement as well as arithmetic must all be considered. VLSl
`
`offers excellent opportunities for inexpensive implementation ot high performance
`
`devices (Mead and Conway [1978]}. Thus, in this paper the cost of a device will be
`
`determined by the expense of a VLSl implementation.
`
`'Fit the job to the bargain
`
`components“ -- Blakeslee [1975. p. 4].
`
`VLSI technology has made one thing clear. Simple and regular interconnections
`
`lead to cheap implementations and high densities. and high density implies both high
`
`perlormence and low overhead for support components.
`
`(Sutherland and Mead
`
`[1977] has a good discussion on the importance of having simple and regular
`
`geometries for data paths.) For
`
`these reasons. 3;; fig intgrgstgg in m
`
`WWMMMQMM—Mnit-lionm iris
`
`mmmrs: omit.“Iinin usmorlmmmmm
`
`W By pipelining, computation may proceed concurrently with input and
`
`Pipelining plus
`output. and consequently overall execution time is minimized.
`multiprocessing at each stage of
`a pipeline should lead to the best-possible
`porformence.
`
`W m provide a realistic model of computation which captures the
`concepts of pipelining. parallelism and interconnection structures. We do not want to
`
`give a formal definition ot systolic systems here. For the purpose at this paper, it
`suffices to view a systolic system as a network of processors which rhythmically
`
`to the rhythmic
`The analogy is
`compute and pass data through the system.
`contraction of the heart which pulses blood through the circulatory system of the
`body. Each processor in a systolic network can be thought of as a heart that pumps
`
`multiple streams of data through itself.
`
`The regular beating of
`
`these parallel
`
`processors keeps up a constant flow of data throughout the entire nelworIL As a
`
`processor pumps data items through.
`and may update some of the items.
`
`it performs some constant-time computation
`
`Unlike the closed-loop circulatory system of the body. a systolic computing system
`
`usually has ports into which inputs "0w, and ports where the results of the systolic
`
`computation are retrieved. Thus a systolic system can be a pipetined system - input
`
`I P
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 6
`
`etitioner Microsoft Corporation - EX. 1015 , p. 6
`
`

`

`'A-
`
`and output occur with every pulsation. This makes them attractive as peripheral
`processors attached to the data channel of a host cOmputer Figure 1-1 illustrates
`how a special purpose systolic device might
`lorm a part of a PUP-ll system. A
`systolic device may also process a rsal~lime data stream or he a component
`in a
`larger special purpose system.
`
`
`
`Figure 1-1: A systolic device connected to the UNIBUS of a POP-l I.
`
`This paper deals largely with systolic systems where the underlying network is
`array structured
`(See also Run; and Leiserson [1978].) An array network is
`attractive tor it enioys simple and regular communication paths.
`in Section 2, we
`describe the basic hardware requirements and interconnection schemes tor
`the
`
`systolic arrays propoeed and discuss the lsasibility ot building them in VLSt. Section
`3 deals with the matrix-vector multiplication problem. Multiplication of two matrices
`is considered in Section 4.
`in Section 5, we show that essentially the same systolic
`arrays
`tor matrix multiplication
`in Section
`it
`can be used
`to
`find the
`
`LU-decomposilion ot a matrix. Section 6 is concerned with solving triangular linear
`systems. We show that
`this problem can be solved by almost
`the same systolic
`array tor matrix-vector multiplication described in Section 3. Section 7 discusses
`applications and extensions ol' the results presented in the previous sections. The
`
`applications include the computations at finite impulse response litters. convolutions,
`and discrete Fourier transforms.
`Some concluding remarks are given in the last
`section.
`
`The size of each oi our systolic array networks is dependent only on the band
`width of the hand matrix to be processed, and is independent of the length of the
`hand. Thu. a fixed size systolic array can pipeline hand matrices with arbitrarily
`Ions bands. The pipelinin; aspect of our arrays is. of course. most ettective~tor
`bend matrices with ion; bends. Band matrices are interesting in their own right,
`since many important scientific computations involve band matrices.
`For
`these
`reasons, BIDS! of
`the results in this paper will he presented in terms of their
`
`applications to bend matrices. All the results apply to dense matrices since a dense
`
`a .
`
`
`
`.‘IF\.!'~"w'”9-"~vell-C"?
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 7
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 7
`
`

`

`--..."———.e.+— ._.._..
`
`nostril cen be viewed es I hand matrix having the masimumrpossibie bend width.
`
`2. The Basic Components and Systolic Array Structures
`
`2.] The Inner Product Step Processor
`
`I
`
`The single operation common to all the cornputations considered in this paper is
`the so—caned inner product step. C 0- C + A x B. We postulate a processor which
`hes three registers RA: Fla, and RC. Each register hes two connections. one tor input
`and one tor output. Figure 2-1 shows two types at geometries tor this processor.
`
`A
`
`C"
`
`3"
`
`"C
`
`*B
`
`t
`
`i
`
`A
`
`{at
`
`Figure 2-l: Geometries tor the inner product step processor.
`
`Typo (I) Ioomtry will be used for metrie-vector multiplication end solution of
`
`triengular linear systems (Sections 3 and 5}, whereas type lb) geometry will be used
`tor matrix multiplication and LU-decoinposition {Sections 4 end 5}. The processor is
`cepeble ot performing the inner product step end is called the trim; m m
`
`this
`in terms at the operation oi
`m We shall deline e basic time unit
`processor.
`In each unit‘tlnie interval. the processor shifts the date on its input lines
`
`denoted
`by A.
`6 end G into R“.
`“B and RC,
`respectively,
`cornputes
`RC 0- RC 9 “A 1 RB. and makes the input vetoes for RA and R3 together with the
`new value of RC sveitehte as outputs an the output lines denoted by A, 3 end C,
`respectively. All outputs ere latched end the logic is clocked so thet when one
`
`processor is connected to another, the changing output ot one during a unit time
`interuel will not inlertere with the input to another during this time interval. This is
`
`not the only processing element we shell make use oi. but it will be the work horse.
`
`A special processor tor performing division will be specified later when it is used.
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 8
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 8
`
`

`

`-~mm-wrrsrw-n-
`
`2.2 Systolic Arrays
`
`A systolic device is typically composed of many interconnected inner product step
`processors. The basic network organization we shall adopt is the mesh-connected
`scheme in which all connections tram a processor are to neighboring processors.
`(See Figure 2-2.)
`
`CHE—CHI]
`
`(a) linearly connected
`{MIAC N)
`
`{ht orthogonally connected
`
`(ct hesegonetly connected
`
`Figure Z'ZIIMWIN systolic arrays.
`
`The most widely known system based on this organization is the ILLIAC IV (Barnes
`
`it diagonal connections are added in one direction only. are shell cell
`es e1. [1968]).
`the reuniting schemeWW orW tor short. We
`shall demonstrate that linearly connected and hes-connected arrays are natural for
`matrix problems.
`
`Processors lying on the boundary of
`the systolic array may have external
`connections to the host watery. Thus, an inputfoutput data path ot a boundary
`Processor my somtimes be designated as an external input/output connection for
`the device. A boundary processor may receive input from the host morory through
`such an salernal connection. or it may receive a tired value such as zero. On the
`
`other hand. a boundary processor can send data to the host nienlory through an
`esternal output connection. An output of a boundary processor may sometimes be
`ignored. This will be designated by omitting the corresponding output tine.
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 9
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 9
`
`

`

`In this paper we assume that the processors in a systolic array are synchronous
`as described in Section 2.1. However. it is possible to view the processors being
`
`asynchronous, each computing its output values when all its inputs are available, as
`
`in a data flow model. For the results oi
`approach to be more direct and intuitive.
`
`this paper we believe the synchronous
`
`The hardware demands of the systolic arrays in this paper are readily seen to be
`modest. The processing elements are unilorm. interprocessor connections are simple
`
`and regular, and external connections are minimized.
`
`It
`
`is our belief
`
`that
`
`construction 01' these systolic arrays will prove to be cost-ettective using, for
`instance, the modern VLSl technology.
`
`3. Matrix-Vector Multiplication on e Linear Systolic Array
`
`the problem of multiplying a matria A -(a~"-) with a vector
`We consider
`at - “tr-“nif- The elements in the product y -(y1....yn)T can be computed by the
`lollowing recurrences.
`
`;
`
`vi" -
`
`o.
`
`lily!"
`
`Vi” * 'ik‘ko
`
`Vt
`
`_
`
`3{En-rt)-
`
`'
`‘
`
`Suppose A is an nvn band matrix with band width w - pus-t. {See Figure 3-1 for
`the case when p - 2 and o - 3.) Then the above recurrences can be evaluated by
`
`pipelining the "i and Vi through a systolic array consisting at is linearly connected
`inner product. step processors. We illustrate the operation of the systolic arrayI to'r
`the band matrix-vector multiplication problem in Figure 3-1.
`For
`this case the
`
`linearly connected systolic array has four inner product step processors.
`Figure 3-2.
`
`See
`
`The general scheme of the computation can be viewed as follows. The Vl- which
`are initially zero. are pumped to the left while the "i are pumped to the right and
`the 'ii are marching down.
`{For
`the general problem of computing And where
`e-(dlfianfl' is any given vector, yi new be initialized as cit an the moves are
`synchronised.
`It turns out that each Vi is able to accumulate all its terms, namely. 'i.
`has“? 'iJ-l'l-l' 'iJ'i and auditi”. betore it
`leaves the neter Figure 3-3
`illustrates the llrst seven pulsations ol the systolic array. little that when yl and ’2
`are output
`they have the correct values. Observe also that at any given time
`
`'.
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 10
`
`etitioner Microsoft Corporation - EX. 1015 , p. 10
`
`6 P
`
`i
`
`_
`
`.
`
`

`

`,
`
`i
`
`‘ J
`
`1
`i
`
`i
`1
`.
`
`1
`
`i
`
`'
`
`E
`I
`""
`
`i
`i
`.
`1
`_
`i
`‘
`I
`i
`
`i
`
`':
`
`.
`
`I"
`
`:
`I
`
`u
`I
`
`ll
`I
`
`'
`
`q
`
`o
`
` xt
`
`I:
`
`8a
`
`“a
`
`Y.
`
`Y:
`
`V:
`
`Ya
`
`:
`
`x
`
`y
`
`Hull 3-]:Wfipflulionolnvulwhylhmdmlfixwflhp-Zlndq-S.
`:
`.u
`‘
`'8
`:
`I
`I
`1
`I
`I
`:-
`E
`:
`:
`:
`i -.
`E
`
`'-
`
`-.
`
`'-
`
`'3
`
`i
`:
`I
`'
`l
`
`.u
`
`I
`
`n f
`
`I,
`
`'II J
`1'
`o
`i
`.
`:
`l
`'
`l
`
`-----—-> I
`
`Finn 3-2: 1']:- Iinurly com-clad ”midi: only [or Ila matrix-ruler
`“agitation problm in Finn's 3-1.
`
`7 P
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 11
`
`etitioner Microsoft Corporation - EX. 1015 , p. 11
`
`

`

`
`
`Comments
`
`, initialized as zero.
`3?.
`is pumped into tho fourth
`processor.
`
`is pumpsd into tho first
`2,
`processor whils y,
`is moved
`left one place.
`(From now
`on tho x. and y. keep moving
`right sud loft, rsspectivsly.)
`
`s" enters the second
`procsssor whorl y.
`is
`updstsd by y.
`*- yl + I" 1..
`Thus y. I s“ ‘1'
`
`snout an outer the first
`sod third processors,
`rsspoctivoly.
`y. - snx,+ aux:
`“4 h" 'lel '
`
`'
`
`is mod. out
`y,
`3': " saw ‘a‘:
`y: I 51'-
`
`h " lair" h‘z'" ‘n‘s‘
`Y5
`"' 531+ 'J‘L‘s'
`
`y1 is punpod out.
`31" 'sfl‘fl'
`‘32‘3-1' ‘Js‘s'
`h" 'qi‘r
`
`Figur- 3-3: Tho firs! sum Motions oi Ibo lino" systolic srrsy in F‘iguro 3-2..
`
`Ind-Id. by'cosloscin; osirs of sdiscsnt proosssors. it
`Illomslo processors srs idlo.
`Is possiblo to on v42 procsssors in lho network [or s gar-oral bond mslris with
`bond width or.
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 12
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 12
`
`

`

`I“
`
`'
`
`-.
`
`
`
`w-v—wrm.-.
`
`.ystolic array more precisely. Assume that
`We now specify the operation of thc
`the processors are numbered by integers l, 2. . .
`.. w from the tett end processor to
`
`the right end processor. Each procesaor has three registers, RA- R: and Ry, which
`will hold entries in A. s and y, respectively.
`Initially, all registers contain zeros.
`
`Each pulsation oi the systolic array consists of the tollowing operations. but for odd
`numbered pulses only add numbered processors are activated and for even
`numbered pulses only e’ven numbered processors are activated.
`
`1. Shifl.
`
`-
`
`.
`
`- RA gets a new element in the band of matrix A.
`
`- it gets the contents at register Rx from the left neighboring
`node. {The R: in processor 1 gets a new component of it.)
`
`- P7 sets the contents of register Fl
`trorn the right neighboring
`node.
`(Processor
`1 outputs its $3, centents and the Ry in
`processor w gels zero.)
`
`2. Multiply and Add.
`
`Rye-RyfRAxR '.
`
`Using the type ta) inner product step processor postulated in section 2, we note
`
`that the three shitt operations in step 1 can be done simultaneously, and that each
`pulsation of the systolic array takes a unit of time. Suppose the bandwidth of A is
`w - pro-1.
`It
`is readity seen that after w units of time the components ol
`the
`product y - As are pumped out
`lrcm the left end processor at
`the rate at one
`
`output every two units at time. Thgretorg, M gm systolic network gfl m n
`
`ELMLMnt' arcane—£40mute EMMMEWH‘r d amalgam
`mmsmmrunway—mmMummies.
`'
`
`4. Matrix Multiplication on a Hexagonal Systolic Array
`
`This section considers the problem ot multiplying two It‘ll! matrices.
`
`it is easy to
`
`see that the matrix product C - (cij! of A - (aijt and B - lb”) can be computed by
`the (allowing re'currences.
`
`er
`G”
`
`+1 _
`I
`
`-
`
`‘ 'skbujc
`
`chit}
`GH" 1 }.
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 13
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 13
`
`

`

`
`
`k
`
`r'
`
`Let A and B be mm band matrices of band width w; and wz, respectively- We show
`how the recurrences above can be evaluated by pipelining the ‘iiI hi] and :ii
`through a systolic array having “'th hex—connected inner product step processors.
`'Wo illustrate the general scheme by considering the matrix multiplication problem
`
`depicted in Figure 4-1. The diamond shaped systolic array tor this case is shown in
`
`Figure 4-2, where processors are hex-connected and data flows are indicated by
`arrows.
`
`' n
`'2:
`
`' is
`'rr
`
`'ra
`
`0
`
`his
`bl!
`"II
`bar ha ha
`
`br-
`
`0
`
`‘s:
`
`'sa
`
`‘3 'aa
`
`|’aa
`
`”as
`
`b:-
`
`has
`
`=
`
`a"
`
`.
`
`b.
`
`.
`
`
`
`
`
`Figure 4-1: Band matrix multiplication.
`
`The elements in the bands of A, B and C are pumped through the systolic network in
`
`Each ”ii
`three directions synchronously.
`network through the bottom boundaries.
`
`initialized to zero as it enters the
`is
`(For the general problem of computing
`
`is any given matrix. c”- should be initialized as air) One can
`ABoD where D-{diit
`eesily see that with the type (b) inner product step processors described in Section
`
`2. each cij is able to accumulate all its terms before it leaves the network through
`the upper boundaries.
`Figure 4-3 shows
`four consecutive pulsations of
`the
`
`this
`The reader is invited to study the data flow of
`hexagonal systolic array.
`problem more closely by making tranparencies ot the bend matrices shown in the
`
`figures. and moving them over the network picture as described above.
`
`mamammmmummummm Incas
`
`Mm mummmmmmm
`
`mtgglgligatign A_a§ i3; Qnomingw l! 9-31 gn_it_1 91 Liam Note that in any row or column ol
`the network, out of every three consecutive processors, only one is active at given
`
`I
`.
`
`It is possible to use about item”: processors in the network lor multiplying
`time.
`two band matrices with band widths in 1 and tea.
`
`'
`
`10
`
`
`
`2-14.21 ~.'.-
`
`
`'r.
`
`i
`-
`-r.*1'¥rflf-I'r‘.—.._ra
`.
`.-
`.-i.-_—I.:’a-.‘;gn,.,.,.
`
`._ .__;-';
`____._,_.
`-
`.
`_
`_
`-
`_
`.
`_
`.
`’. _a_..
`‘71"??? Twit-W" "
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 14
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 14
`
`

`

`_-.Ft...
`
`
`
`.iltiiir.‘.In!..rill-l!
`
`8—-..f
`
`All
`
`'I"II-".I.l'l"
`
`I
`
`.lol.'t'illl'I'I‘l"
`
`.”I
`IuIIBCI
`
`\
`
`CI
`
`\\\
`
`\.
`
`AIIIIIIIIIIIIIIIIII.
`‘||Ill".‘ul'-l'l8\““llhl'liU
`A----:_....-----............:--fie”
`
`L...oaIna-+9....
`:mto”...
`
`”.Alllllr’.flC.I
`.‘AI.I.I'.II--I'.-I|"In
`AlIIIII‘
`
`
`
`———--——I
`
`T
`
`Figure 4-2: The hos-comet“ lyslolie may for the matrix mullipfiulion problaln
`In Elm fi-I.
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 15
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 15
`
`
`
`
`
`
`
`

`

`
`
`....||.I.IL
`
`--——--—-—-—---o--¢--n——_———q—--oou--_—____-_----...----_--_--
`
`
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 16
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 16
`
`
`

`

`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 17
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 17
`
`

`

`i
`
`'
`
`L.-
`
`5. The LU-Decomposition of 3 Matrix on a Hexagonal Systolic Array
`
`The problem at lactoring a matrix A into lower and upper triangular matrices L
`and U is called LU-decompOsition. Figure 5-1 illustrates the LU-decomposition of a
`bend matrix with p - 4 and o I it. Once the L and U factors are known,
`it
`is
`relatively easy to invert A or sotve the linear system Ax - b. We deal with the
`latter problem in section 6. This section describes a hexagonal systolic array for
`cmnpuiing LU-decomposiiions.
`
`O
`
`'I: ‘s 'M
`'1:
`'u 'n ‘rs .20 '19
`Isi‘u‘aa‘u'as
`.It
`'s: ‘ss
`.I'I .‘
`
`D
`
`e
`
`O
`
`.
`
`I
`'n I
`_ lulu!
`— 'II
`In '1:
`k B
`O_
`
`A
`
`0
`
`e
`
`.
`
`1 .
`
`L:
`
`0
`
`"It “Is “I: “is
`“so "as ”as "as
`"an-ya
`'
`
`0
`
`a
`
`.
`
`U
`
`Figure 5-1: The LU-dasomposition of a band matrix.
`
`We assume that matrix A has the property that its LU-deconiposilion Fan be done
`by Gaussian elimination without pivoting.
`(This is true. tor example. when A is a
`swirl: positive-definite, or an irreducible. diagonally dominant matrix.) The
`triangular matrices L - it“) and U - (”ij’ are evaluated according to the following
`rmr.flcn.
`
`'
`
`4P -
`nlr*"-
`
`-
`
`l
`
`it
`
`_
`..,.
`4:" + lint-viii
`0
`1
`
`{ alk'ufl
`
`'
`
`if i s a,
`it i . a,
`
`it i a k,
`
`{all},
`
`flkrb
`
`«list.
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 18
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 18
`
`

`

`'I
`
`"
`
`-
`
`I'
`!
`
`this
`hes-connected systolic array of hes-connected processors. A global view of
`pipelined computation is shown in Figure 5-2 for
`the LU-decomposition problem
`depicted in Figure 5-1. The systolic array in Figure 5-2 is constructed as follows.
`
`The processors below the upper boundaries are the standard type to) inner product
`
`step processors and are hes-connected exactly same as the matrix multiplication
`
`network presented in Section 4. The processor at the top, denoted by a circle' is a
`
`special processor.
`
`It computes the reciprocal of its input and pumps the result
`
`southwest, and also pumps
`
`the same input northward unchanged.
`
`The other
`
`processors on the upper boundaries are again type (b)
`
`inner product
`
`step
`
`processorsI but their orientation is changed:
`
`the ones on the upper left boundary
`
`are rotated 120 degrees clockwise:
`
`the ones on the upper right boundary are
`
`rotated 120 degrees counterclockwise.
`
`The flow of data on the systolic array is indicated by arrows in the figure. As in
`
`the hexagonal systolic array for matrix multiplication . each processor only operates
`
`every third time pulse.
`
`Figure 5-3 illustrates tour conseCutive pulsations of
`
`the
`
`systolic array:. Note that in the figure. because A is I hand matrix with p - d and
`q - 4 we have that air: J- 'l-OSJ and affla- 'i,i+3 tor ls k st and t a 2. Thus e52,
`tor example, can be viewed as age when it enters the networlc
`
`There are several equivalent systolic arrays that reflect only minor changes to
`the network presented in this section. For example, the elements of L and U can be
`
`retrieved as output
`
`in a number ot ditlerent ways. Also. the '-t' input
`
`to the
`
`network can be changed to a 'el‘ it the special processor at the top of the network
`computes minus the reciprocal ol its input.
`
`usummmmmmmm* - eMmm—Mn on
`mummwmmmt
`the.
`m "muse
`W mi};
`9_l m It A is an nxn dense matrix.
`this means
`that n2
`hes-connected processors can compute the L and U matrices in 4.1 units at time
`
`which includes ”0 time.
`
`The remarkable tact that the matrix multiplication network forms a major part of
`
`the Lil-decomposition network is due to the similarity of their defining recurrences.
`
`In any row or column of the LU-dechmposition systolic array. only one out at every
`three consecutive processors is active at a given time. As we observed tor matrix
`multiplication. the number of processors can be reduced to about qua.
`
`l P
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 19
`
`etitioner Microsoft Corporation - EX. 1015 , p. 19
`
`'
`
`.
`
`

`

`
`
`I'l"I'I-"I"""
`
`Figaro 5-2: Tho bus-unmet“ systolic any tor plum-in; IboWW
`a! Hummus-1mm I-l.
`
`‘6
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 20
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 20
`
`
`

`

`.
`
`:J’LH.‘
`
`.
`
`' ‘1".
`
`..
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 21
`
`Petitioner Microsoft Corporation - EX. 1015, p. 21
`
`rill..-.in...
`
`
`
`
`
`..-..lI-.!l...ulil..1'
`
`
`

`

` llqul
`
`—---.-u———---——---——.—-—n.-uu--—-—-——uu-——----—-—n—¢-nun-n----
`
`Flgm 5-3; Four mum of Ib- hoauond midi: arr-y In Finn 5-2.
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 22
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 22
`
`
`
`

`

`6. Solving a Triangular Linear System on a Linear Systolic Array
`
`to solve a linear system Ax - h. Then after having done
`Suppose that we want
`the Lu-decomposiiion oi A n.3,. by methods described in Section 5), we still have to
`solve two triangular linear cysteine Ly - b and U3: - y. This section concerns itself
`
`with the solution of triangular linear systems. An upper triangular linear system can
`
`always be rewritten as a lower triangular linear system. Without loss of generality,
`this section deals exclusively with lower triangular linear cysteine.
`
`Let a . {3”} be a nonaingular nan band lower triangular matrix. Suppose that A
`and an n-vector b - iblmthT are given. The problem is to compute it - (11.....8")T
`such that A: - h. The vector x can be computed by forward substitution:
`
`xi” -
`
`0.
`
`Vin!"
`
`Vt” *’ fit!»
`
`at,
`
`-'
`
`(bi-Allin“.
`
`all
`
`a,' a,
`a,I a," a,
`
`q
`
`0
`
`a“ a.
`
`a... a.
`
`a.ll a. a“ a"
`
`i.
`
`.
`
`A
`
`a.
`
`x,
`I.
`
`1..
`
`l.
`
`.
`
`x
`
`=
`
`b.
`
`b,
`b:
`
`b.
`
`b,
`
`.
`
`b
`
`Figure l-l: The band {lower} triangular linear ayateun where e I 4.
`
`Suppose that A ie a band matrix with hand width at - a.
`
`(See Figure 6-1 for the
`
`one when 11 a OJ Then the above recurrences can be evaluated by e ayatolic array
`eimilar to that uaed tor hand matrix-vector multiplication in Section 3.
`(Observe 'the
`
`almilarity oi
`
`the datining rawrancea tor than two mm) We ilhaatrate our-
`
`.“genome-q“...
`
`“at...
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 23
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 23
`
`

`

`reeull by considering the linear eyslem problem in Figure 5-1. PM this cm. the
`systolic errey is described in Figure 6-2.
`
`.
`
`I
`
`'
`I
`I
`.I I
`I
`I
`I
`I
`I
`I
`
`I
`"
`
`I I
`
`I
`|
`I
`I
`I .
`I a
`I
`I
`
`I
`
`.*
`
`I I
`
`I
`
`I
`
`I
`" I
`’
`
`I
`
`’
`
`i
`I .
`I
`I 3
`I
`1‘
`I
`I I
`I
`I
`I
`:
`:
`I
`I”
`I
`l
`I
`I
`I .
`I
`I
`I u”
`I
`I
`I
`,
`
`
`: I
`I
`|
`
`:
`I
`
`..'
`"a
`” I
`I
`I
`I
`I
`|
`I
`-I
`I
`I
`I
`I
`e
`
`
`
`---...3. I:
`
`g'
`
`y. (“"-
`
`m.a.1...»-
`
`II
`
`4
`
`Figure O-ZI The “nearly mm systolic array Ier eelm
`the triennial- lineer In!“ In figure H.
`
`The 9;, which ere Iain-fly zero. ere lowed Ieflmrd through the systolic errey
`whilolhe “I- eueI-Idbierempodelmeiedinfi'weS-z 'l'helellend
`W ie special In IIuI II perIIII-IIII III-Ibrnulfl.
`(In Iect. IIIe specie! processor
`“main uclanIeIerelhe WIMMhelpuiel ceeeof
`
`1’
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 24
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 24
`
`

`

`this more general processor.) Each y; accumulates inner product terms in the rest of
`the processors as it moves to the leil. At the time yi reaches the Ielt end processor
`it has the value 'it‘t"iz‘zh-“iJ-HE-l- and. consequently,
`the li coinputed by
`air-(bi-yitfaii at the processor will have the correct value. Figure 6-3 demonstrates
`the tlrst seven pulsations ot the systolic array. From the figure one can check that
`
`tho tinal values of ‘1' x2, :3 and 14 are all correct. w m firstoiig m E: gm
`
`mummmmI r MmlmmMMLuinmmLsfl
`m As we observed for the matrix-vector multiplication problem. the number of
`processors required by the array can be reduced to wfz.
`
`'7. Applications and Comments
`
`7.1 Variants ot the Systolic Arrey
`
`'5!”
`
`It more lntormalion is available about the specific matrices involved, an optimized
`version of the systolic arrays presented above can be used.
`it is important that the
`
`reader understands the basic principles so that he can construct appropriate
`variants tor his specific problems. No attempt is made here to list all the possible
`variants.
`
`As pointed out
`
`in Section 1. although most oi our illustrations are of band
`
`matrices, all the systolic arrays vroriv. for regular nan dense matrices.
`
`In this case
`
`It the band width ot a matrix is so largo
`tho band width of the matrix is It - 2n-l.
`that
`it
`requires more. processors than a given array provides. then one should
`decompose the matrix and solve each subproblem on the network.
`Thus, for
`example. the matrimr unalliplication ot two nan matrices or the tU-decomposition of an
`nan metric can be done in Gina/k2) time on a lurk systolic array.
`
`One can often reduce the number ot processors required by a systolic array ii the
`
`metric is known to be sparse or symmetric. For example. the matrices arising from a
`eel ot finite dillerences or finite elements approximations to dillerenlial equations
`are usually “sparse bend matrices”. These are band matrices whose nonzero entries
`
`appear only in a tear of those lines in the band which are parallel to the diagonal.
`
`In
`
`this case by introducing proper delays to each processor tor shitting its data to its
`neighbors. the number ot processors required by the systolic array in Section 3 can
`
`be reduced to the number ot those diagonsis which contain nonzero entries. This
`variant is useful tor parlorming iterative methods involving sparse band matrices.
`Another example concerns the LU-decomposition probbm considered in Section 5.
`it
`
`matrix A is symmetric positive definite, then it is possible to use only the toll portion
`ot m. hes-contacted network. since in this case u a simply oLT vrhere o is the
`
`
`
`____4_.__
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 25
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 25
`
`

`

`.
`
`i
`
`L
`
`Poise
`anber
`
`O
`
`1 .
`
`Configuration
`
`Comments
`'
`
`__
`
`3': can“ processor a.
`
`3‘. novel lot: on position.
`
`3; onto:- proconor 4.
`
`x - al.- 1 3/
`.
`<='=. - hf-ul- :2“. y. - 0-)
`
`2
`
`a
`
`4.
`
`U
`
`'
`
`‘
`
`,
`
`.
`
`0
`
`'
`
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 26
`
`y; - on :l.
`
`3:, " (bu. - 5V9...
`73 ' lull.
`
`’3 " 'JI ‘Iflaflv
`N" .11 x] '
`
`x. 1: and out
`33 '0';- n)/‘u'
`,4 - ~l 11+ ‘ulsu
`
`-
`+
`+
`z: . :32. M w:
`
`o
`
`3:11 pupal out.
`I: " 0H - VqJ/Iw-
`.ys - fi‘xti’ .3335-
`
`J
`
`;
`
`‘
`
`‘
`
`'
`
`Petitioner Microsoft Corporation - EX. 1015 , p. 26
`
`

`

`H's-«er
`
`diagonal matrix {akhm}.
`
`the systolic network to solve a particular
`the size of
`The optimal choice of
`problem depends upon not only the problem but also the memory bandwidth to the
`
`hoat computer.
`
`For achieving high perlormance, it
`
`is desirable to have as many
`
`processors as possible in the network, provided they can all be kept busy doing
`usetul computations.
`'
`
`is possible to use our. systolic arrays to solve some nonnumerical problems
`it
`when appropriate interpretations are given to the addition (+1 and multiplication (x)
`
`operations. For example. some pattern matching problems can be viewed as matrix
`problems with campariscln and Boolean operations.
`it
`is possible to store a
`
`dynamically changing data structure in a systolic array so that an order statistic can
`always be determined in constant
`time. We shall report these results in a future
`paper.
`it can be instructive to view the + and x operations as operations in an
`
`abstract algebraic structure such as a semiring and then to examine how our results
`hold in such an abstract setting.
`
`7.2 Convolution, Filter. and Discrete Fourier Transform
`
`There are a number of
`
`important problems which can be formulated as
`
`matrix-vector multiplication problems and thus can be solved rapidly by the systolic
`
`array in Section 3. The problems of computing convolutions, finite impulse response
`
`It a matrix has ,the
`(FIR) litters. and discrete Fourier transform are. such examples.
`property that the entries on any line parallel to the diagonal are all the same. then
`
`the matrix is s Toeplitz matrix. The convolution problem is simply the matrix-vector
`
`multiplication where the matrix is a triangular Toeplitz matrix (see Figure 7-1).
`4
`
`AI p-tap FIR tiller can be viewed as a matrix-vector multiplication where the
`matrix is an band upper triangular Toeplitz matrix with band width w - p. Figure .7-2
`
`represents the computation of a A-tap tiller.
`
`On the other hand, an n-point discrete Fourier tranetorm is "I metrix-voctOr
`multiplication, where the lid} entry at the matrix is ohm?

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket