`CARNEGIE*MELLON UNIV PITTSBURGH Pa DEPT OF COMPUTER
`SYSTOLIC ARRAYS FOR (VLSI). (U)
`ics
`_—
`DEC 78
`H T KUNG» C E LEISERSON
`CMU=CS=79-103
`—NL
`i
`
`AD=A066 060
`
`==ETC F/
`
`|
`
`UNCLASSIFIED
`
`
`
`ENDDATE
`FILMED}
`
`j i}
`
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 1
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 1
`
`
`
`te Be
`
`Stu
`Li 5." lies
`
`a M
`
`ICROCOPY RESOLUTION TEST CHART
`NATIONAL BUREAU OF
`STANDARDS 1964-4
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 2
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 2
`
`
`
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 3
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 3
`
`
`
`
`
`
`
`a SS
`
`4
`CMU-CS-79-103
`
`‘
`
`SYSTOLIC ARRAYS FOR (VLSI)
`
`H. T. Kung and Charles E. Leiserson
`
`Department of Computer Science
`Carnegie-Mellon University
`Pittsburgh, Pennsylvania 15213
`
`(Last revised December 1978)
`
`April 1978
`
`[In the forthcoming book Intraduction to VLSI Systems by C. A. Mead and L. A. Conway]
`
`Copyright -C- 1979 by HT. Kung and Charles E. Leiserson
`
`All Rights Reserved
`
`MaineelaeLaeee
`anteaeclla
`
`2
`
`
`
`
`
`
`
`
`Skear
`hale ce ROBArNore y cy,
`ian oi * ve
`a
`is te aa ear8 Fs fe aig “4
`
`ee
`
`
`research is supported in part by the National Science Foundation under Grant
`This
`MCS 75-222-55 and the Office of Naval Research under Contract NOO014-76-C-0370,
`NR 044-422. Charles E. Leiserson is supported in part by the Fannie and John Hertz
`Foundation.
`:
`
`
` This document has boon Gpproved
`for public release ond salo; iis
`distribution is unlimited.
`Petitioner Microsoft Corporation - Ex. 1015, p. 4
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 4
`
`
`
`SS
`
`
`
`CDi
`—~e
`
`;
`;
`
`oer
`
`.
`
`
`
`
`
`And now I see with eye serene
`The very pulse of the machine.
`--William Wordsworth
`
`Abstract
`
`A systolic system is a network of processors which rhythmically compute and pass
`data through the system.
`Physiologists use the word “systole” to refer to the
`rhythmically recurrent contraction of
`the heart and arteries which pulses blood
`through the body.
`In a systolic computing system, the function of a processor is
`analogous to that of the heart. Every processor regularly pumps data in and out,
`each time performing some short computation, so that a regular flow of data is kept
`up in the network.
`
`Many basic matrix computations can be pipelined elegantly and efficiently on
`systolic networks having an array structure. As an example, hexagonally connected
`processors can optimally perform matrix multiplication.
`Surprisingly, a similar
`systolic array can compute the LU-decomposition of a matrix. These systolic arrays
`enjoy simple and regular communication paths, and almost all processors used in the
`networks are identical. As a resull, special purpose hardware devices based on
`systolic arrays can be built inexpensively using the VLSI technology.
`
`’
`
`1. Introduction
`
`Developments in microelectronics have revolutionized computer design. Integrated
`circuit technology has increased the number and complexity of components that can
`fit on a chip or a printed circuit board. Component density has been doubling every
`one-to-(wo years and already, a multiplier can fit on a very large scale integrated
`(VLSI) circuit chip. As @ result,
`the new technology makes it feasible to build
`low-cost special purpose, peripheral devices to rapidly solve sophisticated problems.
`Reflecting the changing technology,
`this paper proposes new multiprocessor
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 5
`
`
`
`
`ogee Tn amareneamenv9 r
`
`- — — “
`+
`a
`£
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 5
`
`
`
`{
`r=
`
`
`
`structures for processing some basic matrix computations.
`
`that can be
`parallel ‘structures
`We are interested in high-performance
`implemented directly as low-cost hardware devices, By performance, we are not
`refering to the traditional operation counts that characterize classical analyses of
`algorithms, but rather, the throughput obtainable when a special purpose peripheral
`device is attached to a general purpose host computer. This implies that time spent
`in 1/0, control, and dala movement as well as arithmetic must all be considered. VLSI
`offers excellent opportunities for inexpensive implementation of high performance
`devices (Mead and Conway [1978]). Thus, in this paper the cost of a device will be
`determined by the expense of a VLSI implementation.
`“Fit the job to the bargain
`components” -- Blakeslee (1975, p. 4].
`
`VLSI technology has made one thing clear. Simple and regular interconnections
`lead to cheap implementations and high densities, and high density implies both high
`performance and low overhead for support components.
`(Sutherland and Mead
`[1977] has a good discussion on the importance of having simple and regular
`geometries for data paths.) For
`these reasons, we are interested in designing
`multiprocessor structures which have simple and regular communication paths, We
`are also interested in employing pipelining as a general method for using these
`structures, By pipelining, computation may proceed concurrently with input and
`output, and consequently overall execution time is minimized.
`Pipelining plus
`multiprocessing at each stage of a pipeline should lead to the best-possible
`performance.
`
`Systolic systems provide a realistic model of computation which captures the
`concepts of pipelining, parallelism and interconnection structures. We do not want to
`give a formal definition of systolic systems here. For the purpose of this paper, it
`‘suffices to view a systolic system as a network of processors which rhythmically
`‘compute and pass data through the system.
`The analogy is
`to the rhythmic
`contraction of the heart which pulses biood through the circulatory system of the
`body. Each processor in a systolic network can be thought of as a heart that pumps
`multiple streams of data through itself.
`The regular beating of
`these parallel
`processors keeps up a constant flow of data throughout the entire network. As a
`processor pumps data items through,
`it performs some constant-lime computation
`and may update some of the items.
`
`Unlike the closed-loop circulatory system of the body, a systolic computing system
`usually has ports into which inputs flow, and ports where the results of the systolic
`computation are retrieved. Thus a systolic system can be a pipelined system - input
`
`|
`
`eSae
`Petitioner Microsoft Corporation - Ex. 1015, p. 6
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 6
`
`ae},
`+
`arr 39
`ey
`
`i
`
`
`
`a
`
`ceceRLEo
`hereae
`
`a
`
`—_+=—-
`
`and output occur with every pulsation. This makes them attractive as peripheral
`processors attached to the data channel of a host computer. Figure 1-1 illustrates
`how a special purpose systolic device might form a part of a PDP-11 system. A
`systolic device may also process a real-time data stream or be a componentin a
`larger special purpose system.
`
`
`
`Figure 1-1: A systolic device connected to the UNIBUS of a PDP-11.
`
`This paper deals largely with systolic systems where the underlying network is
`array structured.
`(See also Kung and Leiserson [1978]. An array network is
`attractive for it enjoys simple and regular communication paths,
`In Section 2, we
`describe the basic hardware requirements and interconnection schemes for
`the
`systolic arrays proposed and discuss the feasibility of building them in VLSL Section
`3 deals with the matrix-vector multiplication problem. Multiplication of two matrices
`is considered in Section 4.
`In Section 5, we show that essentially the same systolic
`arrays
`for matrix multiplication
`in Section
`4 can be used
`to
`find the
`LU-decomposition of a mairix. Section 6 is concerned with solving triangular linear
`systems. We show that
`this problem can be solved by almast
`the same systolic
`array for matrix-vector multiplication described in Section 3. Section 7 discusses
`applications and extensions of the results presented in the previous sections. The
`applications include the computations offinite impulse response filters, convolutions,
`and discrete Fourier transforms,
`Some concluding remarks are given in the last
`section.
`
`The size of each of our systolic arrsy networks is dependent only on the band
`width of the band matrix to be processed, and is independent of the length of the
`bend. Thus, a fixed size systolic array can pipeline band matrices with arbitrarily
`long bands. The pipelining aspect of our arrays is, of course, most effective -for
`band matrices with long bands. Band matrices are interesting in their own right,
`since many important scientific computations invalve band matrices.
`For
`these
`reasons, most of
`the results in this paper will be presented in terms of their
`applications to band matrices. All the results apply to dense matrices since s dense
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 7
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 7
`
`
`
`= ensllietheeieatanee _—————____..-
`
`
`
`
`
`
`
`'
`
`h,
`
`matrix can be viewed as a band matrix having the maximum-possible band width.
`
`2. The Basic Components and Systolic Array Structures
`
`2.1 The Inner Product Step Processor
`
`”
`
`The single operation common to all the computations considered in this paper is
`the so-called inner product step, C +«C+Ax 8. We postulate a processor which
`has three registers Ra, Rg, and Ro. Each register has two connections, one for input
`and one for output. Figure 2-1 shows two types of geometries for this processor.
`
`c=
`
`B=
`
`-— Cc
`
`—- B
`
`A
`
`‘
`
`b
`
`A
`
`(a)
`
`\
`
`1
`
`Figure 2-1: Geometries for the inner product slep processor.
`
`Type (a) geometry will be used for matrix-vector multiplication and solution of
`triangular linear systems (Sections 3 and 6), whereas type (b) geomeiry will be used
`for matrix multiplication and LU-decomposition (Sections 4 and 5). The processor is
`capable of performing the inner product step and is called the inner product step
`processor. We shall define a basic time unit
`in terms of the operation of
`this
`processor.
`In each unit ‘time interval, the processor shifts the data on its input lines
`denoted
`by A,
`B
`and C into Ra, Rg
`and Ro,
`respectively,
`computes
`Ro + Ro + Ra Rg, and makes the input values for Ra and Ry together with the
`new value of Ro available as outputs on the outpul lines denoted by A, B and C,
`respectively. All outputs are latched and the logic is clocked so that when one
`processor is connected to another, the changing output of one during a unit time
`interval will not interfere with the input to another during this time interval. This is
`not the only processing element we shall make use of, bul it will be the work horse.
`
`i
`
`A special processor for performing division will be specified later when it is used.
`Petitioner Microsoft Corporation - Ex. 1015, p. 8
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 8
`
`
`
`2.2 Systolic Arrays
`
`:
`'
`
`A systolic device is typically composed of many interconnected inner product step
`processors. The basic network organization we shail adopt is the mesh-connected
`scheme in which all connections from a processor are to neighboring processors.
`(See Figure 2-2.)
`
`LEEREAS
`
`(a) linearly connected
`
`
`
`(c) hexagonally connected
`
`(b) orthogonally connected
`{ILLIAC IV)
`
`Figyre 2-2: Mesh~connected systolic arrays.
`
`The most widely known system based on this organization is the ILLIAC IV (Barnes
`et al. [1968]).
`If diagonal connections are added in one direction only, we shall call
`the resulting scheme hexagonally mesh-connected or hex-connected for short. We
`shall demonstrate that linearly connected and hex-connected arrays are natural for
`matrix problems.
`
`the systolic array may have external
`Processors lying om the boundary of
`connections to the host memory. Thus, an input/output dala path of a boundary
`processor may sometimes be designated as an external input/output connection for
`the device. A boundary processor may receive input from the host memory through
`such an exiernal connection, or it may receive a fixed value such as zero. On the
`other hand, » boundary processor can send data to the host memory through an
`external output connection. An output of a boundary processor may sometimes be
`
`\
`
`t ®;
`
`é
`
`i
`
`{
`
`ignored. This will be designated by omitting the corresponding outputline.
`Petitioner Microsoft Corporation - Ex. 1015, p. 9
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 9
`
`
`
`|
`
`In this paper we assume that the processors in a systolic array are synchronous
`as described in Section 2.1. However, it is possible to view the processors being
`asynchronous, each computing its output values whenall its inputs are available, as
`in a data flow model. For the results of
`this paper we believe the synchronous
`approach to be more direct and intuitive.
`
`The hardware demands of the systolic arrays in this paper are readily seen to be
`modest. The processing elements are uniform, interprocessor connections are simple
`and regular, and external connections are minimized.
`It
`is our belief
`that
`construction of’ these systolic arrays will prove to be cost-effective using, for
`instance, the modern VLSI technology.
`
`3. Matrix-Vector Multiplication on a Linear Systolic Array
`
`the problem of multiplying a matrix A= (aij) with a vector
`We consider
`x = (x ine The elements in the product y = (¥j-W¥p_)? can be computed by the
`following recurrences,
`
`y= 9,
`yfrel).
`yh) + BinXhe
`Yj
`ie
`yn),
`(See Figure 3-1 for
`Suppose A is an nxn band matrix with band width w = p+q-1.
`the case when p = 2 and q = 3.) Then the above recurrences can be evaluated by
`pipelining the x; and y; through a systolic array consisting of w linearly connected
`inner product step processors. Weillustrate the operation of the systolic array for
`the band matrix-vector multiplication problem in Figure 3-1.
`For
`this case the
`linearly connected systolic array has four inner product step processors.
`See
`Figure 3-2.
`
`The general scheme of the computation can be viewed as follows. The y;, which
`are inilially zero, are pumped to the left while the x; are pumped to the right and
`the 9; are marching down.
`(For the general problem of computing Ax+d where
`d=(dj,-d,)" is any given vector, y; should be initialized as d,). All the moves are
`synchronized.
`It turns out that each y; is able to accumulateall its terms, namely, i,
`b-2xj_9, 8)j-1%j-1+ 8j%) 999 9)541%j41, before it
`leaves the network. Figure 3-3
`iWustrates the first seven pulsations of the systolic array. Note that when y) and yo
`are output
`they have the correct values. Observe also thal at any given time
`
`6
`
`‘
`|
`
`|}
`
`|be
`Petitioner Microsoft Corporation - Ex. 1015, p. 10
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 10
`
`
`
`pa
`
`Vig
`
`
`
`0
`
`ihe
`
`a, a3
`
`ty ty Se
`
`, te 06 Se
`
`a,
`
`:
`
`x,
`
`x?
`
`x,
`
`X
`
`;
`
`y,
`
`Y:
`
`Ys
`
`Ys
`
`.
`
`_-
`
`q
`
`Figure 3-1: Multiplication of a vector by a band matrix with p = 2 and q = 3.
`'
`:
`'
`1 Ss
`'
`1
`'
`‘
`4
`'
`'
`4
`1
`'
`'
`: a,
`
`:
`
`a
`
`|
`
`4
`
`ht
`|
`
`
`7
`Petitioner Microsoft Corporation - Ex. 1015, p. 11
`
`!
`
`ttka
`
`|
`\
`i
`
`i
`4
`|
`
`1
`i
`}
`:
`j
`‘
`1
`;
`a
`i
`;
`ae
`7
`1
`J
`j
`a
`|
`|
`
`8.
`
`a,
`
`"5
`
`a,
`
`@
`
`yo
`®
`e
`Sy
`-A
`~
`bos
`Sinagttit
`Size
`‘
`,
`'
`' "4 '
`
`1
`'
`‘
`'
`
`x
`
`a“oc--=5>
`
`x
`
`a
`
`8a!
`'
`,
`1
`'
`'
`'
`'
`a
`’
`4
`zo |
`'
`1
`'
`ace}
`1
`
`¥,
`
`Ge -<<=<=-
`
`Figure 3-2: The linearly connected systolic array for the matrix-vector
`multiplication problem in Figure 3-1.
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 11
`
`
`
`eel
`
`Pulse
`Number
`
`Configuration
`
`Comments
`
`—
`
`N
`
`w
`
`i
`
`=
`=
`
`=
`—
`
`.
`
`y,» initialized as zero,
`is pumped into the fourth
`processor.
`
`is pumped into the first
`x,
`processor while y,
`is moved
`left one place.
`(From now
`on the x, and y, keep moving
`right aad left, respectively.)
`
`¥,
`
`a, enters the second
`processor where y,
`is
`updated by y, ~ y, + 4, %,~
`Thus y, * &, X,-
`
`-
`=
`
`-"
`=
`
`=
`Y¥j=
`Hx, Pu —~
`‘
`
`-
`—
`
`=
`~
`
`Y=
`
`laIeis iitLs iitfs 4itES
`aaGtee
`
`
`=
`
`a,,and a,,enter the first
`and third processors,
`respectively.
`y,™ a,X,+ 4,2,
`and yz,™ 4,,%,-
`
`is pumped out
`y,
`2 7 8X, + Oy
`¥y3 ™ 4%, °
`
`27 Ot SQ2%zF Sts:
`Ys * MX * Sets"
`
`y, is pumped out.
`*‘= 5itiTe
`=
`x aie
`Ys" BnX it Gear e%3-
`x -
`a ne
`Ye" Xr
`
`on
`
`Figure 3-3: The first seven pulsations of the linear systolic array in Figure 3-2.
`
`Indeed, by coalescing pairs of adjacent processors,it
`alternate processors are idle.
`is possible to use w/2 processors in the network for a general band matrix with
`band width w.
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 12
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 12
`
`
`
`r=
`
`*
`
`;
`
`Ae
`
`-ystolic array more precisely, Assume that
`We now specify the operation of the
`the processors are numbered byintegers 1, 2,..., w from the left end processor to
`the right end processor. Each processor has three registers, Ra: R, and R.
`y? which
`Initially, all registers contain zeros.
`will hold entries in A, x and y, respectively.
`Each pulsation of the systolic array consists of the following operations, but for odd
`numbered pulses only odd numbered processors are activated and for even
`numbered pulses only even numbered processors are activated.
`
`1. Shift.
`
`‘
`
`;
`
`~ Ra gets a new element in the band of matrix A.
`
`from the left neighboring
`-R, gets the contents of register R,
`node.
`(The R, in processor 1 gets a new componentofx.)
`
`7 Ry gets the contents of register RL
`from the right neighboring
`node.
`(Processor
`1 outpuls its Ry contents and the Ry in
`
`processor w gets zero.)
`
`2. Multiply and Add.
`
`Ry «Ry +Ra xR *
`
`Using the type (a) inner product step processor postulated in section 2, we note
`thal the three shift operations in step 1 can be done simultaneously, and that each
`pulsation of the systolic array takes a unit of time. Suppose the bandwidth of A is
`w=p+q-l.
`It
`is readily seen that after w units of time the components of
`the
`product y = Ax are pumped out
`from the left end processor at
`the rate of one
`output every two units of time. Therefore, using our systolic network all the n
`components of y can be computed in 2n+w time units, as compared to the Q(wn) time
`needed for a sequential algorithm on a uniprocessor computer,
`
`4. Matrix Multiplication on a Hexagonal Systolic Array
`
`It is easy to
`This section considers the problem of multiplying two nxn matrices.
`see that the matrix product C = (cj) of A= (a,j) and B = (b;;) can be computed by
`the following recurrences.
`
`fie fh)
`+ BiKPKje
`¢jj
`- ai),
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 13
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 13
`
`xe
`
` ney Mere” “weil ” ha
`
`ie AESee
`
`
`
`
`
`*
`
`Let A and 6 be nxn band matrices of band width Ww) and wo, respectively. We show
`how the recurrences above can be evaluated by pipelining the ajjr bij and Cij
`through a systolic array having Ww wa hex-connected inner product step processors.
`‘We illustrate the general scheme by considering the matrix multiplication problem
`depicted in Figure 4-1. The diamond shaped systolic array for this case is shown in
`Figure 4-2, where processors are hex-connected and data flows are indicated by
`arrows.
`
`ay, 41
`a2,
`4n 4x
`
`0
`
`b,
`by,
`
`db,
`Bb,
`by by dy
`
`0
`
`fees
`
`4x)
`
`Faz 8g Fg
`
`By by dy Ds
`
`=
`
`Be
`
`‘
`
`Figure 4-1: Band matrix multiplication.
`
`The elements in the bands of A, B and C are pumped throughthe systolic network in
`three directions synchronously.
`Each Cj
`is
`initialized to zero as it enters the
`network through the bottom boundaries.
`(For the general problem of computing
`AB+D where D=(d;)) is any given matrix, ij should be initialized as djj-) One can
`easily see that with the type (b) inner product step processors described in Section
`2, each cj is able to accumulate all its terms before it leaves the network through
`the upper boundaries.
`Figure 4-3 shows
`four consecutive pulsations of
`the
`hexagonal systolic array.
`The reader is invited to study the data flow of
`this
`problem more closely by making tranparencies of the band matrices shown in the
`figures, and moving them over the network picture as described above.
`
`Let A and B be nxn band matrices of band width w; and wo, respectively. Then a
`systolic array of wyw2 hex-connected processors can pipeline the matrix
`multiplication AxB in 3nemin(w,, w>) units of time. Note that in any row or column of
`the network, out of every three consecutive processors, only one is active at given
`time.
`It is possible to use about (w 1¥2)/3 processors in the network for multiplying
`two band matrices wilh band widths w, and wo.
`
`10
`
`}
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 14
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 14
`
`
`
`+ell
`
`eeRO
`
`
`
`‘‘a‘‘‘R\«=‘\\fs'2'
`
`&":o'i1poseneeeeeeeS';¢aaronerieesweenne---554oo|&<i2<=as¥>.
`o+Bececneee-———-=1Haeraee.Ri
`ceceAESOGWoes
`placeliaiataaa=}i:e
`ioe
`ieeH)oe
`==2~c{]seeseeeeeeecee
`
`.=
`.¥uv“——eSeeeeeeseeee
`
`<——_
`
`Figure 4-2: The hex-connected systolic array for the matrix multiplication problem
`in Figure 4-1.
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 15
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 15
`
`
`
`
`
`
`
`
`(a)
`
`i
`—weeee eee ee eee eee ee ee ew BP Be Ee ge eee ee ee ew ew ew ee ee ee ee ee ee ee ee ee ee
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 16
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 16
`
`
`
`
`(c) peereer
`
`
`
`
`Figuro 4-3: Four pulsations of the hexagonal systolic array in Finure 4-2. |
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 17
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 17
`
`
`
`5. The LU-Decomposition of a Matrix on a Hexagonal Systolic Array
`The problem of factoring a matrix A into lower and upper triangular matrices L
`and U is called LU-decomposition. Figure 5-1 illustrates the LU-decomposition of a
`band matrix with p =4 and q=4. Once the L and U factors are known,
`it
`is
`relatively easy to invert A or solve the linear system Ax = b. We deal with the
`latter problem in section 6. This section describes a hexagonal systolic array for
`computing LU-decompositions.
`
`Oo
`
`9 Az As 4
`@, 2 Sa Sx Ses
`%, M2 4 Bn Ss
`By, My Fay
`*e2 ~
`
`.
`
`1
`1
`4,
`| wh 4
`= ly
`bee
`hg
`Me
`te
`
`0
`
`‘
`
`0
`
`A
`
`QO
`
`.
`
`f
`
`;
`
`2 .
`
`E |
`
`0
`
`Uy, Ue Ys Ure
`Uz Uzg Uy, Uy
`Usa Use Use
`
`0
`
`.
`
`:
`
`U
`
`Figure 5-1: The LU-decomposition of a band matrix.
`
`We assume that matrix A has the property that its LU-decomposition can be done
`by Gaussian elimination without pivoting.
`(This is true, for example, when A is a
`symmetric positive-definite, or an irreducible, diagonally dominant matrix.) The
`triangular matrices L = (ij) and U= (uj) are evaluated according to the following
`recurrences.
`
`ie
`
`«
`
`ii<¢k,
` itiek,
`
`it i> ky
`
`\,
`
`{
`
`'
`
`|
`
`Pn np
`afk+1)=
`afk) + lix(-Upj
`=
`{ uct
`0
`1
`si
`{ aft?
`show that
`the evaluation of
`
`Ih
`
`0
`
`it k>j,
`
`these recurrences can be pipelined on «
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 18
`
`ths}. We
`Petitioner Microsoft Corporation - Ex. 1015, p. 18
`
`
`
`|
`
`p
`
`wee
`
`this
`hex-connected Systolic array of hex-connected processors. A global view of
`pipelined computalion is shown in Figure 5-2 for
`the LU-decomposition problem
`depicted in Figure 5-1. The systolic array in Figure 5-2 is constructed as follows,
`The processors below the upper boundaries are the standard type (b) inner product
`step processors and are hex-connected exactly same as the matrix multiplication
`network presented in Section 4. The processor at the top, denoted by a circle, is a
`special processor.
`It computes the reciprocal of its input and pumps the result
`southwest, and also pumps
`the same input northward unchanged.
`The other
`processors on the upper boundaries are again type (b)
`inner product
`step
`processors, but their orientation is changed:
`the ones on the upper left boundary
`are rotated 120 degrees clockwise;
`the ones on the upper right boundary are
`rotated 120 degrees counterclockwise.
`
`The flow of data on the systolic array is indicated by arrows in the figure. As in
`the hexagonal systolic array for matrix multiplication , each processor only operates
`every third time pulse.
`Figure 5-3 illustrates four consecutive pulsations of
`the
`systolic array. Note that in the figure, because A is a band matrix with p = 4 and
`q = 4 we have that afk)i= 4443,) and afk)a= aj4g for Lsksiandiz 2. Thus ago,
`for example, can be viewed as af) whenit enters the network.
`
`There are several equivalent systolic arrays that reflect only minor changes to
`the network presented in this section. For example, the elements of L and U can be
`retrieved as oulput
`in a number of different ways. Also, the “-1" input
`to the
`network can be changed to a "+1" if the special processor at the top of the network
`computes minus the reciprocal of its input.
`
`It A is an nxn band matrix with band width w =p+q-], a systolic array having no
`more than pq hex-connected processors can compute the
`mposition of A in
`Gn+minip,a) units of time,
`If A is an nxn dense matrix,
`this means
`that
`n@
`hex-connected processors can compute the L and U matrices in 4n units of time
`which includes I/O time.
`
`The remarkable fact that the matrix multiplication network forms a major part of
`the LU-decomposilion network is due to the similarity of their defining recurrences.
`In any row or column of the LU-decomposition systolic array, only one out of every
`three consecutive processors is active at a given time. As we observed for matrix
`multiplication, the number of processors can be reduced to about pq/3.
`
`.
`
`i P
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 19
`
`etitioner Microsoft Corporation - Ex. 1015, p. 19
`
`
`
`
`
`—sseeeeeeeeeeeee=
`
`Figure 5-2: The hex-connected systolic array for pipelining the LU-decomposition
`of the band matrix in Figure 5-1.
`
`16
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 20
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 20
`
`
`
`
`
`—oO—
`ee eee eee nese e see eee ee w eae See ees sess seses es eesesereresee wees
`
`7ro ae
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 21
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 21
`
`
`
`
`
`
`wee ew we ee ee ew eee ee ee wwe ee ee eee ee eee ew ew eee ee ee ew ep ewe ee eee SS
`
`Figure 5-3: Four pulsations of the hexagonal systolic array in Figure 5-2.
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 22
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 22
`
`
`
`
`
`
`———_—___--—_--
`
`6. Solving a Triangular Linear System on a Linear Systolic Array
`
`to solve a linear system Ax = b. Then after having done
`Suppose that we want
`the LU-decomposition of A (e.g., by methods described in Section 5), we still have to
`solve two triangular linear systems Ly = b and Ux = y. This section concerns itself
`with the solution of triangular linear systems. An upper triangular linear system can
`always be rewritten as a lower triangular linear system. Without loss of generality,
`this section deals exclusively with lower triangular linear systems.
`
`Let A= (aj) be a oo nxn band lower triangular matrix. Suppose that A
`and an n-vector b = (bj,..,b,) are given. The problem is to compute x = (xproXp)?
`such that Ax = b. The vector x can be computed by forward substitution:
`
`yfht le
`
`0,
`
`y$h) + aipxp,
`(b;-y6)/23,
`
`0
`
`dana
`ay ay
`
`A
`
`x,
`
`My
`My
`Xe
`My
`
`x
`
`=
`
`b,
`
`b,
`b,
`b,
`b,
`
`b
`
`ateeger
`
`Figure 6-1: The band (lower) triengular linear system where q = 4.
`
`(See Figure 6-1 for the
`Suppose that A is s band matrix with band width w =q.
`case when q = 4.) Then the above recurrences can be evaluated by @ systolic array
`similar lo that used for band matrix-vector multiplication in Section 3.
`(Observe the
`
`similarity of the defining recurrences for these two problems.) We illustrate our: pe
`Petitioner Microsoft Corporation - Ex. 1015, p. 23
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 23
`
`
`
`result by considering the linear system problem in Figure 6-1. For this case, the
`systolic array is described in Figure 6-2.
`
`“
`
`a
`
`_
`bt)
`
`a
`
`=
`
`*,
`
`! '
`
`t
`1
`'
`!
`‘a
`1"s
`i
`1
`'
`'
`.
`‘
`1 2
`'
`
`t'
`
`‘
`ft
`t
`'
`'
`\
`'
`‘
`'
`4
`a,,
`'
`aa
`we 8
`'
`a,
`'
`™ 4
`‘
`#
`ot
`'
`¢
`Sete fa
`é
`'
`'
`'
`ai
`'
`'
`'
`'« Pid
`;
`'
`'
`'
`“
`'
`1
`a
`‘
`'
`'
`‘
`
`‘ 4
`'
`‘
`’
`
`
`*
`
`|
`
`f
`
`<--->,
`
`x,
`
`
`
`¥y S----
`
`ee
`
`b A
`
`Figure 6-2: The linearly connected systolic array for solving
`the triangular linear system in Figure 6-1.
`
`:
`
`|
`oni
`
`The y;, which are initially zero, are torced leftward through the systolic array
`while the Kip aij; and 6; are pumped as indicated in Figure 6-2. The left end
`processor is special in that it performs x+{b;~y,)/a;;.
`(In fect, the special processor
`introduced in section 5 to solve the LU-decomposition problem is s special case of
`
`atieRaeiain
`Petitioner Microsoft Corporation - Ex. 1015, p. 24
`
`Co eeeSate ee
`
`e| Ps a
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 24
`
`
`
`this more general processor.) Each y; accumulalns inner product terms in the rest of
`the processors as it moves to the left. At the time yj reaches the left end processor
`it has the value 3) 1% 1 +8(QXQt--*3)5-1 %j-]) and, consequently,
`the x; computed by
`xj+(b;-y;)/a;; at the processor will have the correct value. Figure 6-3 demonstrates
`the first seven pulsations of the systolic array. From the figure one can check that
`the final values of x), x9, xq and xg are all correct. With this systolic array we can
`solve an nxn band triangular linear system with band width w= q in 2n+q units of
`time, As we observed for the matrix-vector multiplication problem, the number of
`processors required by the array can be reduced to w/2.
`
`7. Applications and Comments
`
`7.1 Variants of the Systolic Array
`
`or
`
`If more information is available about the specific matrices involved, an optimized
`version of the systolic arrays presented above can be used.
`It is important that the
`reader understands the basic principles so that he can construct appropriate
`variants for his specific problems. No attempt is made hereto list all the possible
`variants.
`
`in Section 1, although most of our illustrations are of band
`As pointed out
`matrices, all the systolic arrays work for regular nxn dense matrices.
`In this case
`the band width of the matrix is w = 2n-1.
`If the band width of a matrix is so large
`that
`it
`requires more processors than a given array provides, then one should
`decompose the matrix and solve each subproblem on the network.
`Thus, for
`example, the matrix multiplication of two nxn matrices or the LU-decomposition of an
`nxn matrix can be done in O(n9x2) lime on a kxk systolic array.
`
`One can often reduce the number of processors required by a systolic array if the
`matrix is known to be sparse or symmetric. For example, the matrices arising from a
`set of finite differences or finite elements approximations to differential equations
`ere usually “sparse band matrices”. These are band matrices whose nonzero entries
`appear only in a few of those lines in the band which are parallel to the diagonal.
`In
`this case by introducing proper delays to each processor for shifting its data to its
`neighbors, the number of processors required by the systolic array in Section 3 can
`be reduced to the number of those diagonals which contain nonzero entries. This
`variant is useful for performing iterative methods involving sparse band matrices.
`Another example concerns the LU-decomposition problem considered in Section 5.
`If
`matrix A is symmetric positive definite, then it is possible to use only the left portion
`of the hex-connected network, since in this case U is simply OL’ where D is the
`
`21
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 25
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 25
`
`
`
`Pulse
`Number
`
`Configuration
`
`Comments
`
`y, enters processor 4,
`
`y, moves left one position.
`
`yz, enters processor 4.
`
`=
`
`& ah , % = 0.)
`
`(- y,)/a,-
`
`
`Yg * &y.%.+ &g3%3-
`
`~
`
`7a ~ Sug =
`
`x. * (b2 - y,)/a,,-
`¥3 on ay, %,-
`
`Ys _™ Mg, X +43, X2-
`yy" ay, x, s
`
`is pumped out
`x,
`Xs ™=(by- ¥3)/@s5-
`¥qy = ay, x,+ By, X°
`
`Vy = By, %,+ &y,Xz+ By; %,-
`Ys ™ &52%.-
`
`x,is pumped out.
`xq (by - yy)/ayy-
`
`:
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 26
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 26
`
`
`
`diagonal matrix (ayy.
`
`the systolic network to solve a particular
`the size of
`The optimal choice of
`problem depends upon not only the problem but also the memory bandwidth to the
`host computer.
`For achieving high performance, it
`is desirable to have as many
`processors as possible in the network, provided they can all be kept busy doing
`useful computations.
`i
`
`is possible to use oug systolic arrays to solve some nonnumerical problems
`It
`when appropriate interpretations are given to the addition (+) and multiplication (x)
`operations. For example, some pattern matching problems can be viewed as matrix
`problems with comparison and Boolean aperations.
`It
`is possible to store a
`dynamically chazging data structure in a systolic array so that an order statistic can
`always be determined in constant
`lime. We shall report these results in a future
`paper.
`It can be instructive to view the + and x operations as operations in an
`
`abstract algebraic structure such as a semiring and then to examine how our results
`hold in such an abstract setting.
`
`1.
`
`7.2 Convolution, Filter, and Discrete Fourier Transform
`
`important problems which can be formulated as
`There are a number of
`matrix-vector multiplication problems and thus can be