throbber

`CARNEGIE*MELLON UNIV PITTSBURGH Pa DEPT OF COMPUTER
`SYSTOLIC ARRAYS FOR (VLSI). (U)
`ics
`_—
`DEC 78
`H T KUNG» C E LEISERSON
`CMU=CS=79-103
`—NL
`i
`
`AD=A066 060
`
`==ETC F/
`
`|
`
`UNCLASSIFIED
`
`
`
`ENDDATE
`FILMED}
`
`j i}
`
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 1
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 1
`
`

`

`te Be
`
`Stu
`Li 5." lies
`
`a M
`
`ICROCOPY RESOLUTION TEST CHART
`NATIONAL BUREAU OF
`STANDARDS 1964-4
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 2
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 2
`
`

`

`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 3
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 3
`
`

`

`
`
`
`
`a SS
`
`4
`CMU-CS-79-103
`
`‘
`
`SYSTOLIC ARRAYS FOR (VLSI)
`
`H. T. Kung and Charles E. Leiserson
`
`Department of Computer Science
`Carnegie-Mellon University
`Pittsburgh, Pennsylvania 15213
`
`(Last revised December 1978)
`
`April 1978
`
`[In the forthcoming book Intraduction to VLSI Systems by C. A. Mead and L. A. Conway]
`
`Copyright -C- 1979 by HT. Kung and Charles E. Leiserson
`
`All Rights Reserved
`
`MaineelaeLaeee
`anteaeclla
`
`2
`
`
`
`
`
`
`
`
`Skear
`hale ce ROBArNore y cy,
`ian oi * ve
`a
`is te aa ear8 Fs fe aig “4
`
`ee
`
`
`research is supported in part by the National Science Foundation under Grant
`This
`MCS 75-222-55 and the Office of Naval Research under Contract NOO014-76-C-0370,
`NR 044-422. Charles E. Leiserson is supported in part by the Fannie and John Hertz
`Foundation.
`:
`
`
` This document has boon Gpproved
`for public release ond salo; iis
`distribution is unlimited.
`Petitioner Microsoft Corporation - Ex. 1015, p. 4
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 4
`
`

`

`SS
`
`
`
`CDi
`—~e
`
`;
`;
`
`oer
`
`.
`
`
`
`
`
`And now I see with eye serene
`The very pulse of the machine.
`--William Wordsworth
`
`Abstract
`
`A systolic system is a network of processors which rhythmically compute and pass
`data through the system.
`Physiologists use the word “systole” to refer to the
`rhythmically recurrent contraction of
`the heart and arteries which pulses blood
`through the body.
`In a systolic computing system, the function of a processor is
`analogous to that of the heart. Every processor regularly pumps data in and out,
`each time performing some short computation, so that a regular flow of data is kept
`up in the network.
`
`Many basic matrix computations can be pipelined elegantly and efficiently on
`systolic networks having an array structure. As an example, hexagonally connected
`processors can optimally perform matrix multiplication.
`Surprisingly, a similar
`systolic array can compute the LU-decomposition of a matrix. These systolic arrays
`enjoy simple and regular communication paths, and almost all processors used in the
`networks are identical. As a resull, special purpose hardware devices based on
`systolic arrays can be built inexpensively using the VLSI technology.
`
`’
`
`1. Introduction
`
`Developments in microelectronics have revolutionized computer design. Integrated
`circuit technology has increased the number and complexity of components that can
`fit on a chip or a printed circuit board. Component density has been doubling every
`one-to-(wo years and already, a multiplier can fit on a very large scale integrated
`(VLSI) circuit chip. As @ result,
`the new technology makes it feasible to build
`low-cost special purpose, peripheral devices to rapidly solve sophisticated problems.
`Reflecting the changing technology,
`this paper proposes new multiprocessor
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 5
`
`
`
`
`ogee Tn amareneamenv9 r
`
`- — — “
`+
`a

`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 5
`
`

`

`{
`r=
`
`
`
`structures for processing some basic matrix computations.
`
`that can be
`parallel ‘structures
`We are interested in high-performance
`implemented directly as low-cost hardware devices, By performance, we are not
`refering to the traditional operation counts that characterize classical analyses of
`algorithms, but rather, the throughput obtainable when a special purpose peripheral
`device is attached to a general purpose host computer. This implies that time spent
`in 1/0, control, and dala movement as well as arithmetic must all be considered. VLSI
`offers excellent opportunities for inexpensive implementation of high performance
`devices (Mead and Conway [1978]). Thus, in this paper the cost of a device will be
`determined by the expense of a VLSI implementation.
`“Fit the job to the bargain
`components” -- Blakeslee (1975, p. 4].
`
`VLSI technology has made one thing clear. Simple and regular interconnections
`lead to cheap implementations and high densities, and high density implies both high
`performance and low overhead for support components.
`(Sutherland and Mead
`[1977] has a good discussion on the importance of having simple and regular
`geometries for data paths.) For
`these reasons, we are interested in designing
`multiprocessor structures which have simple and regular communication paths, We
`are also interested in employing pipelining as a general method for using these
`structures, By pipelining, computation may proceed concurrently with input and
`output, and consequently overall execution time is minimized.
`Pipelining plus
`multiprocessing at each stage of a pipeline should lead to the best-possible
`performance.
`
`Systolic systems provide a realistic model of computation which captures the
`concepts of pipelining, parallelism and interconnection structures. We do not want to
`give a formal definition of systolic systems here. For the purpose of this paper, it
`‘suffices to view a systolic system as a network of processors which rhythmically
`‘compute and pass data through the system.
`The analogy is
`to the rhythmic
`contraction of the heart which pulses biood through the circulatory system of the
`body. Each processor in a systolic network can be thought of as a heart that pumps
`multiple streams of data through itself.
`The regular beating of
`these parallel
`processors keeps up a constant flow of data throughout the entire network. As a
`processor pumps data items through,
`it performs some constant-lime computation
`and may update some of the items.
`
`Unlike the closed-loop circulatory system of the body, a systolic computing system
`usually has ports into which inputs flow, and ports where the results of the systolic
`computation are retrieved. Thus a systolic system can be a pipelined system - input
`
`|
`
`eSae
`Petitioner Microsoft Corporation - Ex. 1015, p. 6
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 6
`
`ae},
`+
`arr 39
`ey
`
`i
`
`

`

`a
`
`ceceRLEo
`hereae
`
`a
`
`—_+=—-
`
`and output occur with every pulsation. This makes them attractive as peripheral
`processors attached to the data channel of a host computer. Figure 1-1 illustrates
`how a special purpose systolic device might form a part of a PDP-11 system. A
`systolic device may also process a real-time data stream or be a componentin a
`larger special purpose system.
`
`
`
`Figure 1-1: A systolic device connected to the UNIBUS of a PDP-11.
`
`This paper deals largely with systolic systems where the underlying network is
`array structured.
`(See also Kung and Leiserson [1978]. An array network is
`attractive for it enjoys simple and regular communication paths,
`In Section 2, we
`describe the basic hardware requirements and interconnection schemes for
`the
`systolic arrays proposed and discuss the feasibility of building them in VLSL Section
`3 deals with the matrix-vector multiplication problem. Multiplication of two matrices
`is considered in Section 4.
`In Section 5, we show that essentially the same systolic
`arrays
`for matrix multiplication
`in Section
`4 can be used
`to
`find the
`LU-decomposition of a mairix. Section 6 is concerned with solving triangular linear
`systems. We show that
`this problem can be solved by almast
`the same systolic
`array for matrix-vector multiplication described in Section 3. Section 7 discusses
`applications and extensions of the results presented in the previous sections. The
`applications include the computations offinite impulse response filters, convolutions,
`and discrete Fourier transforms,
`Some concluding remarks are given in the last
`section.
`
`The size of each of our systolic arrsy networks is dependent only on the band
`width of the band matrix to be processed, and is independent of the length of the
`bend. Thus, a fixed size systolic array can pipeline band matrices with arbitrarily
`long bands. The pipelining aspect of our arrays is, of course, most effective -for
`band matrices with long bands. Band matrices are interesting in their own right,
`since many important scientific computations invalve band matrices.
`For
`these
`reasons, most of
`the results in this paper will be presented in terms of their
`applications to band matrices. All the results apply to dense matrices since s dense
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 7
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 7
`
`

`

`= ensllietheeieatanee _—————____..-
`
`
`
`
`
`
`
`'
`
`h,
`
`matrix can be viewed as a band matrix having the maximum-possible band width.
`
`2. The Basic Components and Systolic Array Structures
`
`2.1 The Inner Product Step Processor
`
`”
`
`The single operation common to all the computations considered in this paper is
`the so-called inner product step, C +«C+Ax 8. We postulate a processor which
`has three registers Ra, Rg, and Ro. Each register has two connections, one for input
`and one for output. Figure 2-1 shows two types of geometries for this processor.
`
`c=
`
`B=
`
`-— Cc
`
`—- B
`
`A
`
`‘
`
`b
`
`A
`
`(a)
`
`\
`
`1
`
`Figure 2-1: Geometries for the inner product slep processor.
`
`Type (a) geometry will be used for matrix-vector multiplication and solution of
`triangular linear systems (Sections 3 and 6), whereas type (b) geomeiry will be used
`for matrix multiplication and LU-decomposition (Sections 4 and 5). The processor is
`capable of performing the inner product step and is called the inner product step
`processor. We shall define a basic time unit
`in terms of the operation of
`this
`processor.
`In each unit ‘time interval, the processor shifts the data on its input lines
`denoted
`by A,
`B
`and C into Ra, Rg
`and Ro,
`respectively,
`computes
`Ro + Ro + Ra Rg, and makes the input values for Ra and Ry together with the
`new value of Ro available as outputs on the outpul lines denoted by A, B and C,
`respectively. All outputs are latched and the logic is clocked so that when one
`processor is connected to another, the changing output of one during a unit time
`interval will not interfere with the input to another during this time interval. This is
`not the only processing element we shall make use of, bul it will be the work horse.
`
`i
`
`A special processor for performing division will be specified later when it is used.
`Petitioner Microsoft Corporation - Ex. 1015, p. 8
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 8
`
`

`

`2.2 Systolic Arrays
`
`:
`'
`
`A systolic device is typically composed of many interconnected inner product step
`processors. The basic network organization we shail adopt is the mesh-connected
`scheme in which all connections from a processor are to neighboring processors.
`(See Figure 2-2.)
`
`LEEREAS
`
`(a) linearly connected
`
`
`
`(c) hexagonally connected
`
`(b) orthogonally connected
`{ILLIAC IV)
`
`Figyre 2-2: Mesh~connected systolic arrays.
`
`The most widely known system based on this organization is the ILLIAC IV (Barnes
`et al. [1968]).
`If diagonal connections are added in one direction only, we shall call
`the resulting scheme hexagonally mesh-connected or hex-connected for short. We
`shall demonstrate that linearly connected and hex-connected arrays are natural for
`matrix problems.
`
`the systolic array may have external
`Processors lying om the boundary of
`connections to the host memory. Thus, an input/output dala path of a boundary
`processor may sometimes be designated as an external input/output connection for
`the device. A boundary processor may receive input from the host memory through
`such an exiernal connection, or it may receive a fixed value such as zero. On the
`other hand, » boundary processor can send data to the host memory through an
`external output connection. An output of a boundary processor may sometimes be
`
`\
`
`t ®;
`

`
`i
`
`{
`
`ignored. This will be designated by omitting the corresponding outputline.
`Petitioner Microsoft Corporation - Ex. 1015, p. 9
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 9
`
`

`

`|
`
`In this paper we assume that the processors in a systolic array are synchronous
`as described in Section 2.1. However, it is possible to view the processors being
`asynchronous, each computing its output values whenall its inputs are available, as
`in a data flow model. For the results of
`this paper we believe the synchronous
`approach to be more direct and intuitive.
`
`The hardware demands of the systolic arrays in this paper are readily seen to be
`modest. The processing elements are uniform, interprocessor connections are simple
`and regular, and external connections are minimized.
`It
`is our belief
`that
`construction of’ these systolic arrays will prove to be cost-effective using, for
`instance, the modern VLSI technology.
`
`3. Matrix-Vector Multiplication on a Linear Systolic Array
`
`the problem of multiplying a matrix A= (aij) with a vector
`We consider
`x = (x ine The elements in the product y = (¥j-W¥p_)? can be computed by the
`following recurrences,
`
`y= 9,
`yfrel).
`yh) + BinXhe
`Yj
`ie
`yn),
`(See Figure 3-1 for
`Suppose A is an nxn band matrix with band width w = p+q-1.
`the case when p = 2 and q = 3.) Then the above recurrences can be evaluated by
`pipelining the x; and y; through a systolic array consisting of w linearly connected
`inner product step processors. Weillustrate the operation of the systolic array for
`the band matrix-vector multiplication problem in Figure 3-1.
`For
`this case the
`linearly connected systolic array has four inner product step processors.
`See
`Figure 3-2.
`
`The general scheme of the computation can be viewed as follows. The y;, which
`are inilially zero, are pumped to the left while the x; are pumped to the right and
`the 9; are marching down.
`(For the general problem of computing Ax+d where
`d=(dj,-d,)" is any given vector, y; should be initialized as d,). All the moves are
`synchronized.
`It turns out that each y; is able to accumulateall its terms, namely, i,
`b-2xj_9, 8)j-1%j-1+ 8j%) 999 9)541%j41, before it
`leaves the network. Figure 3-3
`iWustrates the first seven pulsations of the systolic array. Note that when y) and yo
`are output
`they have the correct values. Observe also thal at any given time
`
`6
`
`‘
`|
`
`|}
`
`|be
`Petitioner Microsoft Corporation - Ex. 1015, p. 10
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 10
`
`

`

`pa
`
`Vig
`
`
`
`0
`
`ihe
`
`a, a3
`
`ty ty Se
`
`, te 06 Se
`
`a,
`
`:
`
`x,
`
`x?
`
`x,
`
`X
`
`;
`
`y,
`
`Y:
`
`Ys
`
`Ys
`
`.
`
`_-
`
`q
`
`Figure 3-1: Multiplication of a vector by a band matrix with p = 2 and q = 3.
`'
`:
`'
`1 Ss
`'
`1
`'
`‘
`4
`'
`'
`4
`1
`'
`'
`: a,
`
`:
`
`a
`
`|
`
`4
`
`ht
`|
`
`
`7
`Petitioner Microsoft Corporation - Ex. 1015, p. 11
`
`!
`
`ttka
`
`|
`\
`i
`
`i
`4
`|
`
`1
`i
`}
`:
`j
`‘
`1
`;
`a
`i
`;
`ae
`7
`1
`J
`j
`a
`|
`|
`
`8.
`
`a,
`
`"5
`
`a,
`
`@
`
`yo

`e
`Sy
`-A
`~
`bos
`Sinagttit
`Size
`‘
`,
`'
`' "4 '
`
`1
`'
`‘
`'
`
`x
`
`a“oc--=5>
`
`x
`
`a
`
`8a!
`'
`,
`1
`'
`'
`'
`'
`a
`’
`4
`zo |
`'
`1
`'
`ace}
`1
`
`¥,
`
`Ge -<<=<=-
`
`Figure 3-2: The linearly connected systolic array for the matrix-vector
`multiplication problem in Figure 3-1.
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 11
`
`

`

`eel
`
`Pulse
`Number
`
`Configuration
`
`Comments
`
`—
`
`N
`
`w
`
`i
`
`=
`=
`
`=
`—
`
`.
`
`y,» initialized as zero,
`is pumped into the fourth
`processor.
`
`is pumped into the first
`x,
`processor while y,
`is moved
`left one place.
`(From now
`on the x, and y, keep moving
`right aad left, respectively.)
`
`¥,
`
`a, enters the second
`processor where y,
`is
`updated by y, ~ y, + 4, %,~
`Thus y, * &, X,-
`
`-
`=
`
`-"
`=
`
`=
`Y¥j=
`Hx, Pu —~
`‘
`
`-
`—
`
`=
`~
`
`Y=
`
`laIeis iitLs iitfs 4itES
`aaGtee
`
`
`=
`
`a,,and a,,enter the first
`and third processors,
`respectively.
`y,™ a,X,+ 4,2,
`and yz,™ 4,,%,-
`
`is pumped out
`y,
`2 7 8X, + Oy
`¥y3 ™ 4%, °
`
`27 Ot SQ2%zF Sts:
`Ys * MX * Sets"
`
`y, is pumped out.
`*‘= 5itiTe
`=
`x aie
`Ys" BnX it Gear e%3-
`x -
`a ne
`Ye" Xr
`
`on
`
`Figure 3-3: The first seven pulsations of the linear systolic array in Figure 3-2.
`
`Indeed, by coalescing pairs of adjacent processors,it
`alternate processors are idle.
`is possible to use w/2 processors in the network for a general band matrix with
`band width w.
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 12
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 12
`
`

`

`r=
`
`*
`
`;
`
`Ae
`
`-ystolic array more precisely, Assume that
`We now specify the operation of the
`the processors are numbered byintegers 1, 2,..., w from the left end processor to
`the right end processor. Each processor has three registers, Ra: R, and R.
`y? which
`Initially, all registers contain zeros.
`will hold entries in A, x and y, respectively.
`Each pulsation of the systolic array consists of the following operations, but for odd
`numbered pulses only odd numbered processors are activated and for even
`numbered pulses only even numbered processors are activated.
`
`1. Shift.
`
`‘
`
`;
`
`~ Ra gets a new element in the band of matrix A.
`
`from the left neighboring
`-R, gets the contents of register R,
`node.
`(The R, in processor 1 gets a new componentofx.)
`
`7 Ry gets the contents of register RL
`from the right neighboring
`node.
`(Processor
`1 outpuls its Ry contents and the Ry in
`
`processor w gets zero.)
`
`2. Multiply and Add.
`
`Ry «Ry +Ra xR *
`
`Using the type (a) inner product step processor postulated in section 2, we note
`thal the three shift operations in step 1 can be done simultaneously, and that each
`pulsation of the systolic array takes a unit of time. Suppose the bandwidth of A is
`w=p+q-l.
`It
`is readily seen that after w units of time the components of
`the
`product y = Ax are pumped out
`from the left end processor at
`the rate of one
`output every two units of time. Therefore, using our systolic network all the n
`components of y can be computed in 2n+w time units, as compared to the Q(wn) time
`needed for a sequential algorithm on a uniprocessor computer,
`
`4. Matrix Multiplication on a Hexagonal Systolic Array
`
`It is easy to
`This section considers the problem of multiplying two nxn matrices.
`see that the matrix product C = (cj) of A= (a,j) and B = (b;;) can be computed by
`the following recurrences.
`
`fie fh)
`+ BiKPKje
`¢jj
`- ai),
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 13
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 13
`
`xe
`
` ney Mere” “weil ” ha
`
`ie AESee
`
`

`

`
`
`*
`
`Let A and 6 be nxn band matrices of band width Ww) and wo, respectively. We show
`how the recurrences above can be evaluated by pipelining the ajjr bij and Cij
`through a systolic array having Ww wa hex-connected inner product step processors.
`‘We illustrate the general scheme by considering the matrix multiplication problem
`depicted in Figure 4-1. The diamond shaped systolic array for this case is shown in
`Figure 4-2, where processors are hex-connected and data flows are indicated by
`arrows.
`
`ay, 41
`a2,
`4n 4x
`
`0
`
`b,
`by,
`
`db,
`Bb,
`by by dy
`
`0
`
`fees
`
`4x)
`
`Faz 8g Fg
`
`By by dy Ds
`
`=
`
`Be
`
`‘
`
`Figure 4-1: Band matrix multiplication.
`
`The elements in the bands of A, B and C are pumped throughthe systolic network in
`three directions synchronously.
`Each Cj
`is
`initialized to zero as it enters the
`network through the bottom boundaries.
`(For the general problem of computing
`AB+D where D=(d;)) is any given matrix, ij should be initialized as djj-) One can
`easily see that with the type (b) inner product step processors described in Section
`2, each cj is able to accumulate all its terms before it leaves the network through
`the upper boundaries.
`Figure 4-3 shows
`four consecutive pulsations of
`the
`hexagonal systolic array.
`The reader is invited to study the data flow of
`this
`problem more closely by making tranparencies of the band matrices shown in the
`figures, and moving them over the network picture as described above.
`
`Let A and B be nxn band matrices of band width w; and wo, respectively. Then a
`systolic array of wyw2 hex-connected processors can pipeline the matrix
`multiplication AxB in 3nemin(w,, w>) units of time. Note that in any row or column of
`the network, out of every three consecutive processors, only one is active at given
`time.
`It is possible to use about (w 1¥2)/3 processors in the network for multiplying
`two band matrices wilh band widths w, and wo.
`
`10
`
`}
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 14
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 14
`
`

`

`+ell
`
`eeRO
`
`
`
`‘‘a‘‘‘R\«=‘\\fs'2'
`
`&":o'i1poseneeeeeeeS';¢aaronerieesweenne---554oo|&<i2<=as¥>.
`o+Bececneee-———-=1Haeraee.Ri
`ceceAESOGWoes
`placeliaiataaa=}i:e
`ioe
`ieeH)oe
`==2~c{]seeseeeeeeecee
`
`.=
`.¥uv“——eSeeeeeeseeee
`
`<——_
`
`Figure 4-2: The hex-connected systolic array for the matrix multiplication problem
`in Figure 4-1.
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 15
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 15
`
`
`
`
`
`
`

`

`(a)
`
`i
`—weeee eee ee eee eee ee ee ew BP Be Ee ge eee ee ee ew ew ew ee ee ee ee ee ee ee ee ee ee
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 16
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 16
`
`
`

`

`(c) peereer
`
`
`
`
`Figuro 4-3: Four pulsations of the hexagonal systolic array in Finure 4-2. |
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 17
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 17
`
`

`

`5. The LU-Decomposition of a Matrix on a Hexagonal Systolic Array
`The problem of factoring a matrix A into lower and upper triangular matrices L
`and U is called LU-decomposition. Figure 5-1 illustrates the LU-decomposition of a
`band matrix with p =4 and q=4. Once the L and U factors are known,
`it
`is
`relatively easy to invert A or solve the linear system Ax = b. We deal with the
`latter problem in section 6. This section describes a hexagonal systolic array for
`computing LU-decompositions.
`
`Oo
`
`9 Az As 4
`@, 2 Sa Sx Ses
`%, M2 4 Bn Ss
`By, My Fay
`*e2 ~
`
`.
`
`1
`1
`4,
`| wh 4
`= ly
`bee
`hg
`Me
`te
`
`0
`
`‘
`
`0
`
`A
`
`QO
`
`.
`
`f
`
`;
`
`2 .
`
`E |
`
`0
`
`Uy, Ue Ys Ure
`Uz Uzg Uy, Uy
`Usa Use Use
`
`0
`
`.
`
`:
`
`U
`
`Figure 5-1: The LU-decomposition of a band matrix.
`
`We assume that matrix A has the property that its LU-decomposition can be done
`by Gaussian elimination without pivoting.
`(This is true, for example, when A is a
`symmetric positive-definite, or an irreducible, diagonally dominant matrix.) The
`triangular matrices L = (ij) and U= (uj) are evaluated according to the following
`recurrences.
`
`ie
`

`
`ii<¢k,
` itiek,
`
`it i> ky
`
`\,
`
`{
`
`'
`
`|
`
`Pn np
`afk+1)=
`afk) + lix(-Upj
`=
`{ uct
`0
`1
`si
`{ aft?
`show that
`the evaluation of
`
`Ih
`
`0
`
`it k>j,
`
`these recurrences can be pipelined on «
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 18
`
`ths}. We
`Petitioner Microsoft Corporation - Ex. 1015, p. 18
`
`

`

`|
`
`p
`
`wee
`
`this
`hex-connected Systolic array of hex-connected processors. A global view of
`pipelined computalion is shown in Figure 5-2 for
`the LU-decomposition problem
`depicted in Figure 5-1. The systolic array in Figure 5-2 is constructed as follows,
`The processors below the upper boundaries are the standard type (b) inner product
`step processors and are hex-connected exactly same as the matrix multiplication
`network presented in Section 4. The processor at the top, denoted by a circle, is a
`special processor.
`It computes the reciprocal of its input and pumps the result
`southwest, and also pumps
`the same input northward unchanged.
`The other
`processors on the upper boundaries are again type (b)
`inner product
`step
`processors, but their orientation is changed:
`the ones on the upper left boundary
`are rotated 120 degrees clockwise;
`the ones on the upper right boundary are
`rotated 120 degrees counterclockwise.
`
`The flow of data on the systolic array is indicated by arrows in the figure. As in
`the hexagonal systolic array for matrix multiplication , each processor only operates
`every third time pulse.
`Figure 5-3 illustrates four consecutive pulsations of
`the
`systolic array. Note that in the figure, because A is a band matrix with p = 4 and
`q = 4 we have that afk)i= 4443,) and afk)a= aj4g for Lsksiandiz 2. Thus ago,
`for example, can be viewed as af) whenit enters the network.
`
`There are several equivalent systolic arrays that reflect only minor changes to
`the network presented in this section. For example, the elements of L and U can be
`retrieved as oulput
`in a number of different ways. Also, the “-1" input
`to the
`network can be changed to a "+1" if the special processor at the top of the network
`computes minus the reciprocal of its input.
`
`It A is an nxn band matrix with band width w =p+q-], a systolic array having no
`more than pq hex-connected processors can compute the
`mposition of A in
`Gn+minip,a) units of time,
`If A is an nxn dense matrix,
`this means
`that
`n@
`hex-connected processors can compute the L and U matrices in 4n units of time
`which includes I/O time.
`
`The remarkable fact that the matrix multiplication network forms a major part of
`the LU-decomposilion network is due to the similarity of their defining recurrences.
`In any row or column of the LU-decomposition systolic array, only one out of every
`three consecutive processors is active at a given time. As we observed for matrix
`multiplication, the number of processors can be reduced to about pq/3.
`
`.
`
`i P
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 19
`
`etitioner Microsoft Corporation - Ex. 1015, p. 19
`
`

`

`
`
`—sseeeeeeeeeeeee=
`
`Figure 5-2: The hex-connected systolic array for pipelining the LU-decomposition
`of the band matrix in Figure 5-1.
`
`16
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 20
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 20
`
`
`
`

`

`—oO—
`ee eee eee nese e see eee ee w eae See ees sess seses es eesesereresee wees
`
`7ro ae
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 21
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 21
`
`
`

`

`
`
`wee ew we ee ee ew eee ee ee wwe ee ee eee ee eee ew ew eee ee ee ew ep ewe ee eee SS
`
`Figure 5-3: Four pulsations of the hexagonal systolic array in Figure 5-2.
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 22
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 22
`
`
`
`
`

`

`———_—___--—_--
`
`6. Solving a Triangular Linear System on a Linear Systolic Array
`
`to solve a linear system Ax = b. Then after having done
`Suppose that we want
`the LU-decomposition of A (e.g., by methods described in Section 5), we still have to
`solve two triangular linear systems Ly = b and Ux = y. This section concerns itself
`with the solution of triangular linear systems. An upper triangular linear system can
`always be rewritten as a lower triangular linear system. Without loss of generality,
`this section deals exclusively with lower triangular linear systems.
`
`Let A= (aj) be a oo nxn band lower triangular matrix. Suppose that A
`and an n-vector b = (bj,..,b,) are given. The problem is to compute x = (xproXp)?
`such that Ax = b. The vector x can be computed by forward substitution:
`
`yfht le
`
`0,
`
`y$h) + aipxp,
`(b;-y6)/23,
`
`0
`
`dana
`ay ay
`
`A
`
`x,
`
`My
`My
`Xe
`My
`
`x
`
`=
`
`b,
`
`b,
`b,
`b,
`b,
`
`b
`
`ateeger
`
`Figure 6-1: The band (lower) triengular linear system where q = 4.
`
`(See Figure 6-1 for the
`Suppose that A is s band matrix with band width w =q.
`case when q = 4.) Then the above recurrences can be evaluated by @ systolic array
`similar lo that used for band matrix-vector multiplication in Section 3.
`(Observe the
`
`similarity of the defining recurrences for these two problems.) We illustrate our: pe
`Petitioner Microsoft Corporation - Ex. 1015, p. 23
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 23
`
`

`

`result by considering the linear system problem in Figure 6-1. For this case, the
`systolic array is described in Figure 6-2.
`
`“
`
`a
`
`_
`bt)
`
`a
`
`=
`
`*,
`
`! '
`
`t
`1
`'
`!
`‘a
`1"s
`i
`1
`'
`'
`.
`‘
`1 2
`'
`
`t'
`
`‘
`ft
`t
`'
`'
`\
`'
`‘
`'
`4
`a,,
`'
`aa
`we 8
`'
`a,
`'
`™ 4
`‘
`#
`ot
`'

`Sete fa

`'
`'
`'
`ai
`'
`'
`'
`'« Pid
`;
`'
`'
`'
`“
`'
`1
`a
`‘
`'
`'
`‘
`
`‘ 4
`'
`‘
`’
`
`
`*
`
`|
`
`f
`
`<--->,
`
`x,
`
`
`
`¥y S----
`
`ee
`
`b A
`
`Figure 6-2: The linearly connected systolic array for solving
`the triangular linear system in Figure 6-1.
`
`:
`
`|
`oni
`
`The y;, which are initially zero, are torced leftward through the systolic array
`while the Kip aij; and 6; are pumped as indicated in Figure 6-2. The left end
`processor is special in that it performs x+{b;~y,)/a;;.
`(In fect, the special processor
`introduced in section 5 to solve the LU-decomposition problem is s special case of
`
`atieRaeiain
`Petitioner Microsoft Corporation - Ex. 1015, p. 24
`
`Co eeeSate ee
`
`e| Ps a
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 24
`
`

`

`this more general processor.) Each y; accumulalns inner product terms in the rest of
`the processors as it moves to the left. At the time yj reaches the left end processor
`it has the value 3) 1% 1 +8(QXQt--*3)5-1 %j-]) and, consequently,
`the x; computed by
`xj+(b;-y;)/a;; at the processor will have the correct value. Figure 6-3 demonstrates
`the first seven pulsations of the systolic array. From the figure one can check that
`the final values of x), x9, xq and xg are all correct. With this systolic array we can
`solve an nxn band triangular linear system with band width w= q in 2n+q units of
`time, As we observed for the matrix-vector multiplication problem, the number of
`processors required by the array can be reduced to w/2.
`
`7. Applications and Comments
`
`7.1 Variants of the Systolic Array
`
`or
`
`If more information is available about the specific matrices involved, an optimized
`version of the systolic arrays presented above can be used.
`It is important that the
`reader understands the basic principles so that he can construct appropriate
`variants for his specific problems. No attempt is made hereto list all the possible
`variants.
`
`in Section 1, although most of our illustrations are of band
`As pointed out
`matrices, all the systolic arrays work for regular nxn dense matrices.
`In this case
`the band width of the matrix is w = 2n-1.
`If the band width of a matrix is so large
`that
`it
`requires more processors than a given array provides, then one should
`decompose the matrix and solve each subproblem on the network.
`Thus, for
`example, the matrix multiplication of two nxn matrices or the LU-decomposition of an
`nxn matrix can be done in O(n9x2) lime on a kxk systolic array.
`
`One can often reduce the number of processors required by a systolic array if the
`matrix is known to be sparse or symmetric. For example, the matrices arising from a
`set of finite differences or finite elements approximations to differential equations
`ere usually “sparse band matrices”. These are band matrices whose nonzero entries
`appear only in a few of those lines in the band which are parallel to the diagonal.
`In
`this case by introducing proper delays to each processor for shifting its data to its
`neighbors, the number of processors required by the systolic array in Section 3 can
`be reduced to the number of those diagonals which contain nonzero entries. This
`variant is useful for performing iterative methods involving sparse band matrices.
`Another example concerns the LU-decomposition problem considered in Section 5.
`If
`matrix A is symmetric positive definite, then it is possible to use only the left portion
`of the hex-connected network, since in this case U is simply OL’ where D is the
`
`21
`
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 25
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 25
`
`

`

`Pulse
`Number
`
`Configuration
`
`Comments
`
`y, enters processor 4,
`
`y, moves left one position.
`
`yz, enters processor 4.
`
`=
`
`& ah , % = 0.)
`
`(- y,)/a,-
`
`
`Yg * &y.%.+ &g3%3-
`
`~
`
`7a ~ Sug =
`
`x. * (b2 - y,)/a,,-
`¥3 on ay, %,-
`
`Ys _™ Mg, X +43, X2-
`yy" ay, x, s
`
`is pumped out
`x,
`Xs ™=(by- ¥3)/@s5-
`¥qy = ay, x,+ By, X°
`
`Vy = By, %,+ &y,Xz+ By; %,-
`Ys ™ &52%.-
`
`x,is pumped out.
`xq (by - yy)/ayy-
`
`:
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 26
`
`Petitioner Microsoft Corporation - Ex. 1015, p. 26
`
`

`

`diagonal matrix (ayy.
`
`the systolic network to solve a particular
`the size of
`The optimal choice of
`problem depends upon not only the problem but also the memory bandwidth to the
`host computer.
`For achieving high performance, it
`is desirable to have as many
`processors as possible in the network, provided they can all be kept busy doing
`useful computations.
`i
`
`is possible to use oug systolic arrays to solve some nonnumerical problems
`It
`when appropriate interpretations are given to the addition (+) and multiplication (x)
`operations. For example, some pattern matching problems can be viewed as matrix
`problems with comparison and Boolean aperations.
`It
`is possible to store a
`dynamically chazging data structure in a systolic array so that an order statistic can
`always be determined in constant
`lime. We shall report these results in a future
`paper.
`It can be instructive to view the + and x operations as operations in an
`
`abstract algebraic structure such as a semiring and then to examine how our results
`hold in such an abstract setting.
`
`1.
`
`7.2 Convolution, Filter, and Discrete Fourier Transform
`
`important problems which can be formulated as
`There are a number of
`matrix-vector multiplication problems and thus can be

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket