throbber

`
`Systolic architectures, which permit multiple computations
`for each memory access, can speed execution of
`compute-bound problems without increasing I/O requirements.
`
`
`
`
`
`
`xseslPD
`
`SX KX&ay
`
`
`at
`PO
`EXD
`
`
`RUS
`
`
`
`SyrahXIXXXDerrerrers
`
`
`
`
`PXEINK
`
`
`
`
`
`
`
`
`
`
`
`
`
`Why Systolic Architectures?
`
`H. T. Kung
`Carnegie-Mellon University
`
`. High-performance, special-purpose computer sys-
`tems are typically used to meet specific application re-
`quirements orto off-load computationsthat are especial-
`ly taxing to general-purpose computers. As hardware cost
`and size continue to drop and processing requirements
`becomewell-understood in areas such as signal and image
`processing, more special-purpose systems are being con-
`structed. However, since most of these systemsare built
`on an ad hoc basis for specific tasks, methodological
`workin this area is rare. Because the knowledge gained
`from individual experiences is neither accumulated nor
`properly organized, the sameerrors are repeated. I/O and
`computation imbalanceis a notable example—often, the
`fact that I/O interfaces cannot keep up with device speed
`is discovered only after constructing a high-speed,
`special-purpose device.
`Weintend to help correct this ad hoc approach bypro-
`viding a general guideline—specifically, the concept of
`systolic architecture, a general methodology for mapping
`high-level computations into hardwarestructures. In a
`systolic system, data flows from the computer memoryin
`a rhythmic fashion, passing through many processing
`elements before it returns to memory, muchas bloodcir-
`culates to and from the heart. The system works like an
`autombbile assemblyline where different people work on
`the samecarat different times and manycars are assem-
`bled simultaneously. An assembly line is always linear,
`however, and systolic systems are sometimés two-dimen-
`sional. They can be rectangular, triangular, or hexagana!
`to make use of higher degrees of parallelism. Moreover,
`to implement a variety of computations, data flow ina
`systolic system may be at multiple speeds in multiple di-
`rections—both inputs and (partial) results flow, whereas
`only results flow in classical pipelined systems. Generally
`speaking, a systolic system is easy to implement because
`of its regularity and easy to reconfigure (to meet various
`outside constraints) because of its modularity.
`The systolic architectural concept was developed at
`Carnegie-Mellon University,!? and versions of systolic
`processors are being designed andbuilt by several indus-
`trial and governmental organizations.®!° This article
`
`reviewsthe basic principle of systolic architectures and ex-
`plains why they shouldresult in cost-effective, high-
`performancespecial-purpose systems for a wide range of
`problems.
`
`Keyarchitectural issues in designing
`special-purpose systems
`
`Roughly, the cycle for developing a special-purpose
`system can be divided into three phases—task definition,
`design, and implementation. During task definition,
`some system performance bottleneck is identified, and a
`decision on whether or not,to resolve it with special-
`purpose hardware is made. The evaluation required for
`task definition is most fundamental, but sinceit is often
`application-dependent, we will concentrate only on archi-
`tecturalissues related to the design phase and will assume
`routine implementation.
`
`Simple and regular design. Cost-effectiveness has
`always been a chief concern in designing special-purpose
`systems; their cost must be low enoughto justify their
`limited applicability. Costs can beclassified as nonrecur-
`ring (design) and recurring (parts) costs. Part costs are
`dropping rapidly due to advances in integrated-circuit
`technology, but this advantage applies equally to both
`special-purpose and general-purpose systems. Further-
`more, since special-purpose systemsare seldom produced
`in large quantities, part costs are less important than
`design costs. Hence, the design cost of a special-purpose
`system mustberelatively small forit to be moreattractive
`than a general-purpose approach.
`Fortunately, special-purpose design costs can be reduced
`by the use of appropriate architectures. If a structure can
`truly be decomposedinto a few types of simple substruc-
`tures or building blocks, which are used repetitively with
`simple interfaces, great savings can be achieved. This is
`especially true for VLSI designs where a single chip com-
`prises hundreds of thousands of components. To cope
`with that complexity, simple and regular designs, similar
`
`January 1982
`
`0918-9162/82/0100-0037$00.75 © 1982 IEEE
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 37
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 37
`
`

`

`to someof the techniquesused in constructing large soft-
`ware systems, areessential.!! In addition, special-purpose
`systems based on simple, regular designs are likely to be
`modular and therefore adjustable to various performance
`goals—thatis, system cost can be madeproportional to
`the performance required. This suggests that meeting the
`architectural challenge for simple, regular designs yields
`cost-effective special-purpose systems.
`
`performance goal of a special-purpose system is—and
`should be no more than—a computation rate that bal-
`ances the available I/O bandwidth with the host. Since an
`accurate a priori estimate of available I/O bandwidth ina
`complex system is usually impossible, the design of a
`special-purpose system should be modular so that its
`structure can be easily adjusted to match a variety of I/O
`bandwidths.
`
`Concurrency and communication. Thereare essential-
`ly two waysto build a fast computer system. Oneis to use
`fast components, and the otheris to use concurrency. The
`last decade has seen an order of magnitude decrease in the
`cost and size of computer components but only anincre-
`mental
`increase in component speed.!? With current
`technology, tens of thousands of gates can be put ina
`single chip, but no gate is much faster than its TTL
`counterpart of 10 years ago. Since the technological trend
`clearly indicates a diminishing growth rate for component
`speed, any major improvement in computation speed
`must come from the concurrent use of many processing
`elements. The degree of concurrencyin a special-purpose
`system is largely determined by the underlying algorithm.
`Massive parallelism can be achieved if the algorithm is
`designed to introduce high degrees of pipelining and
`multiprocessing. When a large number of processing
`elements work simultaneously, coordination and com-
`munication become significant—especially with VLSI
`technology where routing costs dominate the power,
`time, and area required to implement a computation.!3
`The issue here is to design algorithms that support high
`degrees of concurrency, and in the meantime to employ
`only simple, regular communication and controlto enable
`efficient implementation.
`
`Balancing computation with I/O. Since a special-
`purpose system typically receives data and outputs results
`through an attached host, I/O considerations influence
`overall performance. (The host in this context can meana
`computer, a memory, a real-time device, etc. In practice,
`the special-purpose system may actually input from one
`“‘physical’’ host and output to another.) The ultimate
`
`INSTEAD OF:
`
`MEMORY
`
`5 MILLION
`OPERATIONS
`PER SECOND
`AT MOST
`
`THE SYSTOLIC ARRAY
`
`WE HAVE:
`
`100 ns
`
`MEMORY
`
`30 MOPS
`POSSIBLE
`
`Tele)
`
`Figure 1. Basic principle of a systolic system.
`
`38
`
`Supposethat the I/O bandwidth between the host anda
`special-purpose system is 10 million bytes per second, a
`rather high bandwidth for present technology. Assuming
`that at least two bytes are read from or written to the host
`for each operation, the maximum rate will be only 5
`million operations per second, no matter how fast the
`special-purpose system can operate (see Figure 1). Orders
`of magnitude improvements on this throughputare possi-
`ble only if multiple computations are performed per I/O
`access. However, the repetitive use of a data item requires
`it to be stored inside the system for a sufficient length of
`time. Thus, the I/O problem is related not only to the
`available I/O bandwidth, but also to the available
`memory internal to the system. The question then is how
`to arrange a computation together with an appropriate
`memorystructure so that computation time is balanced
`with I/O time.
`
`The I/O problem becomesespecially severe when a large
`computation is performed on a small special-purpose sys-
`tem. In this case, the computation must be decomposed.
`Executing subcomputations one at a time may require a
`substantial amountof I/O to storeor retrieve intermediate
`results. Consider, for example, performing the -pointfast
`Fourier transform using an S-point device when 7 is large
`and Sis small. Figure 2 depicts the n-point FFT computa-
`tion and a decomposition schemefor n = I6and S = 4. Note
`that each subcomputation block is sufficiently small so that
`it can be handled by the 4-point device. During execution,
`results of a block must be temporarily sent to the host and
`later retrieved to be combinedwith results of other blocks
`as they becomeavailable. With the decomposition scheme
`shownin Figure 2b, the total numberof I/O operationsis
`O(n log n/log S). In fact, it has been shownthat, to per-
`form the n-poirit FFT with a device of O(S) memory, at
`least this many I/O operations are needed for any decom-
`position scheme. !4 Thus, for the n-point FFT problem, an
`S-point device cannot achieve more than an O(log S)
`speed-up ratio over the conventional O(n log n) software
`implementation time, and sinceit is a consequence of the
`I/O consideration, this upper bound holds independently
`of device speed. Similar upper bounds have been estab-
`lished for speed-up ratios achievable by devices for other
`computationssuchas sorting and matrix multiplication. !4-15
`Knowing the I/O-imposed performancelimit helps pre-
`vent overkill in the design of a special-purpose device.
`
`than
`In practice, problems are typically ‘‘larger’’
`special-purpose devices. Therefore, questions such as
`how a computation can be decomposed to minimize I/O,
`how the I/O requirementis related to thesize of a special-
`purposesyste and its memory, and howthe I/O band-
`width limits the speed-up ratio achievable by a special-
`purpose system present another set of challenges to the
`system architect.
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 38
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 38
`
`

`

`Systolic architectures: the basic principle
`
`As a solution to the above challenges, we introduce
`systolic architectures, an architectural conceptoriginally
`proposed for VLSI implementation of some matrix oper-
`ations.» Examples of systolic architectures follow in the
`next section, which contains a walk-through of a family
`of designs for the convolution computation.
`A systolic system consists of a set of interconnected
`cells, each capable of performing some simple operation.
`Because simple,
`regular communication and control
`structures have substantial advantages over complicated
`ones in design and implementation, cells in a systolic
`system are typically interconnected to form a systolic ar-
`ray or a systolic tree. Information in a systolic system
`flows betweencells in a pipelined fashion, and communi-
`cation with the outside world occursonly at the ‘‘bound-
`ary cells.’’ For example,
`in a systolic array, only those
`cells on the array boundaries may be I/O ports for the
`system.
`tasks can be conceptually classified
`Computational
`into two families—compute-bound computations and
`I/O-bound computations. In a computation, if the total
`numberof operationsis larger than the total numberof
`input and output elements,
`then the computation is
`compute-bound, otherwiseit is 1/O-bound. For example,
`the ordinary matrix-matrix multiplication algorithm
`represents a compute-bound task, since every entry in a
`matrix is multiplied by all entries in some row or column
`of the other matrix. Adding two matrices, on the other
`hand, is I/O-bound, since the total number of addsis not
`larger than the total numberof entries in the two matrices.
`{t should be clear that any attempt to speed up an I/O-
`bound computation must rely on an increase in memory
`bandwidth. Memory bandwidth can be increased by the
`use of either fast components (which could be expensive)
`or interleaved memories (which could create complicated
`memory management problems). Speeding up a com-
`pute-bound computation, however, may often be accom-
`
`plished in a relatively simple and inexpensive manner,
`that is, by the systolic approach.
`The basic principle of a systolic architecture, a systolic
`arrayin particular, is illustrated in Figure 1. By replacing a
`single processing element with an array of PEs, or cells in
`the terminology of this article, a higher computation
`throughput can be achieved without increasing memory
`bandwidth. The function of the memoryin the diagram is
`analogous to thatof the heart; it ‘‘pulses’’ data (instead of
`blood) through the array of cells. The crux of this ap-
`proach is to ensure that once adata item is brought out
`from the memoryit can be used effectively at each cell it
`passes while being ‘‘pumped’”’ from cell to cell along the
`array. This is possible for a wide class of compute-bound
`computations where multiple operations are performed
`on each dataitem in a repetitive manner.
`Being able to use each input data item a numberof
`times (and thus achieving high computation throughput
`with only modest memory bandwidth) is just one of the
`many advantages of the systolic approach. Other advan-
`tages, such as modular expansibility, simple and regular
`data and control flows, use of simple and uniform cells,
`elimination of global broadcasting, and fan-in and (pos-
`sibly) fast response time, will be illustrated in various sys-
`tolic designs in the next section.
`
`A family of systolic designs
`for the convolution computation
`
`To provide concrete examples of various systolic struc-
`tures, this section presents a family of systolic designs for
`the convolution problem, which is defined as follows:
`
`Given the sequence of weights [w), W.,..., We}
`and the input sequence (.X),.%2,.--5Xnj,
`
`computethe result sequence {vy}, ¥2,---.¥ne1—-k3
`defined by
`
`i= Wixi WoXje to. + WEE kA 1
`
`
`
`4-PT.
`DEVICE
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 39
`
`Figure 2. (a) 16-point fast-Fourier-transform graph; (b) decomposing the FFT computation with n = 16 and S$ =4.
`
`January 1982
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 39
`
`

`

`
`
`Yout
`
`*
`
`Vin + Wo Xin
`
`Figure 3. Design B1: systolic convolution array (a) and cell
`(b) where x;’s are broadcast, w;’s stay, and y;’s move
`systolically.
`
`Figure 4. Design B2: systolic convolution array (a) and cell
`(b) where x;’s are broadcast, y;’s stay, and w;’s move
`systolically.
`
`We consider the convolution problem because it is a sim-
`ple problem with a variety of enlightening systolic solu-
`tions, becauseit is an important problemin its own right,
`and more importantly, because it
`is representative of a
`wide class of computations suited to systolic designs. The
`convolution problem can be viewed as a problem of com-
`bining two data streams, w,’s amd_x;’s, in a certain man-
`ner (for example, as in the above equation) to form a
`resultant data stream of y,’s. This type of computation is
`common to a number of computation routines, such as
`filtering, pattern matching, correlation,
`interpolation,
`polynomialevaluation (including discrete Fourier trans-
`forms), and polynomial multiplication and division. For
`example, if multiplication and addition are interpreted as
`comparison and boolean AND, respectively,
`then the
`convolution problem becomes the pattern matching
`problem.! Architectural concepts for the convolution
`problem can thus be applied to these other problems as
`well,
`
`The convolution problem is compute-bound, since
`each input x,is to be multiplied by each of the & weights. If
`the x; is input separately from memory for each multi-
`plication,
`then when & is
`large, memory bandwidth
`becomes a bottleneck, precluding a high-performance
`solution. As indicated earlier, a systolic architecture
`resolves this I/O bottleneck by making multiple use of
`each x, fetched from the memory. Based on this principle,
`several systolic designs for solving the convolution prob-
`lem are described below. For simplicity, all illustrations
`assumethat k=3.
`
`(Semi-) systolic convolution arrays with global data
`communication. If an x;, once brought out from the
`memory, is broadcast to a numberof cells, then the same
`x; can be used by all the cells. This broadcasting technique
`is probably one of the most obvious ways to make mul-
`tiple use of each input element. The opposite of broad-
`casting is fan-in, through which data items from anumber
`of cells can be collected. The fan-in techniquecan also be
`used in a straightforward mannerto resolve the I/O bot-
`tleneck problem. In the following, we describe systolic
`designs that utilize broadcasting and fan-in.
`Design Bl—broadcast inputs, move results, weights
`Stay. The systolic array and its cell definition are depicted
`
`40
`
`in Figure 3. Weights are preloaded to the cells, one at each
`cell, and stay at the cells throughout the computation.
`Partial results y; move systolically from cell to cell in the
`left-to-right direction, that is, each of them moves over
`the cell to its right during each cycle. At the beginning
`ofacycle, one x;is broadcast to all the cells and one yj, in-
`itialized as zero, enters the left-most cell. During cycle
`one, w, X; is accumulated to y, at the left-most cell, and
`during cycle two, w, x2 and w2 x2 are accumulated to y>
`and y, at the left-most and middle cells, respectively.
`Starting from cycle three, the final (and correct) values of
`¥y,¥2, -. are output from the right-most cell at the rate
`of one y; per cycle. The basic principle of this design was
`previously proposed for circuits to implement a pattern
`matching processor!® and for circuits to implement
`polynomial multiplication. !7-20
`Design B2—broadcast inputs, move weights, results
`stay. In design B2 (see Figure 4), each y; stays at a cell to
`accumulate its terms, allowing efficient use of available
`multiplier-accumulator hardware. (Indeed, this design is
`described in an application booklet for the TRW multi-
`plier-accumulator chips.2! The weights circulate around
`the array of cells, and the first weight w, is associated with
`a tag bit that signals the accumulator to output and resets
`its contents. * In design B1! (Figure 3), the systolic path for
`moving y,’s may be considerably wider than that for mov-
`ing w,’s in design B2 because for numerical accuracy y;’s
`typically carry more bits than w,’s. The use of multiplier-
`accumulators in design B2 mayalso help increase preci-
`sion of the results, since extra bits can be kept in these ac-
`cumulators with modest cost. Design B1, however, does
`have the advantage of not requiring a separate bus (or
`other global network), denoted by a dashedline in Figure
`4, for collecting outputs from individualcells.
`Design F—fan-in results, move inputs, weights stay. If
`weconsiderthe vector of weights (wz, We_],
`-
`-
`.
`» Wy) aS
`being fixed in space and input vector (X;, X,—1, -- +X)
`as sliding over the weights in the left-to-right direction,
`then the convolution problem is one that computesthe in-
`ner product of the weight vector and the section of input
`vector it overlaps. This view suggests the systolic array
`
`*To avoid complicated pictures, control structures such as the use of tag
`bits to gate outputs from cells are omitted from the diagramsof this article.
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 40
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 40
`
`

`

`
`
`
`shown in Figure 5. Weights are preloadedto the cells and
`stay there throughout the computation. During a cycle,
`all x;’s move one cell to the right, multiplications are per-
`formedat all cells simultaneously, and their results are
`fanned-in and summed using an adder to form a new yj.
`When the number of cells, 4, is large, the adder can be im-
`plemented as a pipelined addertree to avoid large delays
`ineachcycle. Designs of this type using unbounded fan-in
`have been known for quite along time, for example, inthe
`context of signal processing*? and in the context of pat-
`tern matching. ¥
`
`Figure 5. Design F: systolic convolution array (a) and cell
`(b) where w;’s stay, x;’s move systolically, and y;’s are
`formed through thefan-in of results from all the cells.
`
`Figure 6. Design Rt: systolic convolution array (a) and cell
`(b) where y;’s stay and x;’s and y;’s move in opposite direc-
`tions systolically.
`
`(Pure-) systolic convolution arrays without global data
`communication. Although global broadcasting or fan-in
`solves the I/O bottleneck problem, implementing it in a
`modular, expandable way presents another problem.
`Providing (or collecting) a data item to (or from) all the
`cells of a systolic array, during each cycle, requires the use
`of a bus or somesort of tree-like network. As the number
`of cells increases, wires become long for either a bus or
`tree structure; expanding these non-local communication
`paths to meet the increasing load is difficult without slow-
`ing down the system clock. This engineering difficulty of
`extending global networks is significant at chip, board,
`and higher levels of a computer system. Fortunately, as
`will be demonstrated below, systolic convolution arrays
`without global data communication do exist. Potentially,
`these arrays can be extended to include an arbitrarilylarge
`number of cells without encountering engineering diffi-
`culties (the problem of synchronizing a large systolic ar-
`ray is discussedlater).
`Design R1l—results stay, inputs and weights move in
`opposite directions. In design R1 (see Figure 6) each par-
`tial result y,stays at acell to accumulate its terms. The x,’s
`and w,’s movesystolically in opposite directions such that
`when an x meets a wat acell, they are multiplied and the
`resulting product is accumulated to the y staying at that
`cell. To ensure that each x; is able to meet every w,, con-
`secutive x;’s on the x data stream are separated by two cy-
`cle times and so are the w;’s on the w data stream.
`Like design B2, design R1 can makeefficient use of
`available multiplier-accumulator hardware; it can also
`use a tag bit associated with thefirst weight, w;, to trigger
`the output and reset the accumulator contents of a cell.
`
`January 1982
`
`Figure 7. Design R2: systolic convolution array (a) and cell
`(b) where y;’s stay and x;’s and w;’s both move inthe same
`direction but at different speeds.
`
`Design R1 has the advantagethat it does not requirea bus,
`or any other global network, for collecting output from
`cells; a systolic output path (indicated by broken arrows
`in Figure 6) is sufficient. Because consecutive w;,’s are well
`separated by two cycle times, a potential conflict—that
`more than one y; may reach a single latch on the systolic
`output path simultaneously—cannotoccur. It can also be
`easily checked that the y,’s will output from thesystolic
`output path in the natural ordering y), y2,.
`.
`.
`. The basic
`idea of this design, including that of the systolic output
`path, has been used to implement a pattern matching
`chip.!
`Notice that in Figure 6 only about one-half the cells are
`doing useful work at any time. To fully utilize the poten-
`tial throughput, two independent convolution computa-
`tions can be interleaved in the same systolic array, but
`cells in the array would have to be modified slightly to
`support the interleaved computation. For example, an
`additional accumulator would be required at each cell to
`hold a temporaryresult for the other convolution com-
`putation.
`Design R2— results stay, inputs and weights movein
`the same direction but at different speeds. One version of
`design R2 is illustrated in Figure 7. In this case both the x
`and w data streams move from left to right systolically,
`but the x;’s move twice as fast as the w,’s. Moreprecisely,
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 41
`
`41
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 41
`
`

`

`
`
`“— Yin + Wo Xin
`Yout
`= Xin
`X
`Xout — X
`
`Figure 8. Design W1: systolic convolution array (a) and
`cell (b) where w;’s stay and x;’s and y;’s move systolically
`in opposite directions.
`
`Figure 9. Design W2: systolic convolution array (a) and
`cell
`(b) where w;’s stay and x;’s and y;’s both move
`systolically in the same direction but at different speeds.
`
`namely, only approximately one-half the cells work at any
`given time unless two independent convolution computa-
`tions are interleaved in the same array. The next design,
`like design R2, overcomes this shortcoming by having
`both the x,’s and y;’s move in the same direction but at dif-
`ferent speeds.
`Design W2—weights stay, inputs and results move in
`the same direction but at different speeds. With design
`W2 (Figure 9) all the cells work all the time, but it loses one
`advantage of design W1, the constant response time. The
`output of y;nowtakes place k cycles after the last of its in-
`puts starts entering the left-most cell of the systolic array. .
`This design has been extended to implement 2-D convolu-
`tions,®?4 where high throughputs
`rather
`than fast
`responses are of concern. Similar to design R1, design W2
`has a dual version for which the x;’s move twice as fast as
`the y,’s.
`
`each w;stays inside everycell it passes for one extra cycle,
`thus taking twice as long to move throughthe array as any
`x). Inthis design, multiplier-accumulator hardware can be
`used effectively and so can the tag bit method to signal the
`output of the accumulator contents at each cell. Com-
`pared to design R1, this design has the advantagethat all
`cells work all the time when performing a single convolu-
`tion, but it requires an additional register in each cell to
`hold a w value. This algorithm has been used for im-
`plementing a pipeline multiplier.”
`There is a dual version of design R2; we can have the
`ws move twice as fast as the x;’s. To create delays for the
`x data stream, this dual design requires a register in each
`cell for storing an x rather than a w value. For cir-
`cumstances where the w,’s carry more bits than the x,’s,
`the dual design becomes attractive.
`Design WI—weights stay, inputs and results move in
`opposite directions.
`In design WI
`(and design W2,
`below), weights stay, one at each cell, but results and in-
`puts move svstolically. These designs are not geared tothe
`most effective use of available multiplicr-accumulator
`hardware, but
`for some other circumstances they are
`potentially more efficient than the other designs. Because
`the same set of weights is used for computing all the ¥;’s
`and different sets of the x;’s are used for computing dif-
`ferent ¥,’s, it is natural to have the w,'s preloaded to the
`cells and staythere, and let the xs and the y,’s move along
`the array. We will see some advantages of this arrange-
`ment in the systolic array depicted in Figure 8, which is a
`special case of a proposed systolic filtering array.? This
`design is fundamental in the sense that it can be naturally
`extended to perform recursive filtering?" and polynomial
`division.7*
`In design W1, the w,’s stay and the xs and ys move
`systolically in opposite directions. Similar to design R1,
`consecutive xs and y;’s are separated by two cycle times.
`Note that because the systolic path for moving the ¥,’s
`alreadyexists, there is no need for another systolic output
`path as in designs R] and R2. Furthermore, for each/, 4;
`outputs from the left-most cell during the same cycle asits
`last inpul, x). ,—1 (or X;42 for A = 3), enters that cell.
`Thus, this systolic array is capable of outputting a v,;every
`two cycle times with constant response time. Design W1,
`however, suffers from the same drawback as design RI,
`
`42
`
`Remarks. The designs presented above by no means ex-
`haust all the possible systolic designs for the convolution
`problem. For example,
`it
`is possible to have systolic
`designs where results, weights, and inputs all move during
`each eycle. It could also be advantageous toinclude inside
`each cell a ‘‘cell memory’? capable of storing a set of
`weights. With this feature, using a systolic control (or ad-
`dress) path, weights can be selected on-the-fly to imple-
`ment interpolation or adaptive filtering.24 Moreover, the
`flexibility introduced by the cell memories and systolic
`control can make the same systolic array implement dif-
`ferent functions, Indeed,
`the ESI. systolic processor*:!”
`utilizes cell memories to implement multiple functions in-
`cluding convolution and matrix multiplication,
`Once one systolic design is obtained for a problem, it is
`likely that a set of other systolic designs can be derived
`similarly. The challenge is to understand precisely the
`strengths and drawbacks of each design so that an ap-
`propriate design canbe selected for a given environment.
`For example,
`if there are more weights than cells,
`it’s
`useful to know that a scheme where partial results stay
`generally requires less 1/O than one wherepartial results
`move, since the latter scheme requires partial results to be
`input and output manytimes. A single multiplier-accu-
`mulator hardware component often represents a cost-
`effective implementation of the multiplier and adder
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 42
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 42
`
`

`

`RESULTS
`
`®) - [IGNORED]
`
`MULTIPLIER
`
`Figure 10. Overlapping the executions of multiply and add in design W1.
`
`needed by each cell of a systolic convolution array.
`However, for improving throughput, sometimes it may
`be worthwhile to implement multiplier and adder sepa-
`rately to allow overlapping of their executions. Figure 10
`depicts such a modification to design W1. Similar
`modifications can be madeto other systolic convolution
`arrays. Anotherinteresting scenario is the following one.
`Suppose that one or several cells are to be implemented
`directly with a single chip and the chip pin bandwidth is
`the implementation bottleneck. Then, since the basic cell
`of some semi-systolic convolution arrays such as designs
`BI and F require only three I/O ports, while that of a
`pure-systolic convolution array always requires four, a
`semi-systolic array may be preferable for saving pins,
`despite the fact that it requires global communication.
`
`Criteria and advantages
`
`Having described a family of systolic convolution ar-
`rays, we can now be moreprecise in suggesting and evalu-
`ating criteria for the design of systolic structures.
`
`(1) The design makes multiple use of each input data
`item, Because of this property, systolic systems can
`achieve high throughputs with modest I/O bandwidths
`for outside communication. To meet this criterion, one
`can either use global data communications, such as
`broadcast and unbounded fan-in, or have each input
`travel through an array of cells so that it is used at each
`cell. For modular expansibility of the resulting system,
`the second approach is preferable.
`(2) The design uses extensive concurrency. The process-
`ing power of a systolic architecture comes from concur-
`rent use of manysimple cells rather than sequential use of
`a few powerful processors as in many conventionalar-
`chitectures. Concurrency can be obtained by pipelining
`the stages involved in the computation of each single
`result (for example, design B1), by multiprocessing many
`results in parallel (designs R1 and R2), or by both. For
`some designs, such as WI, it
`is possible to completely
`overlap I/O and computation times to further increase
`concurrency and provide constant-time responses.
`
`January 1982
`
`To a given problem there could be both one- and two-
`dimensionalsystolic array solutions. For example, two-
`dimensional convolution can be performed by a one-
`dimensional systolic array*4?5 or a two-dimensional
`systolic array.© When the memoryspeed is more thancell
`speed,
`two-dimensional systolic arrays such as those
`depicted in Figure 11 should be used. At each cell cycle, all
`the I/O ports on the array boundaries can input or output
`data items to or from the memory; as a result,
`the
`available memory bandwidth canbefully utilized. Thus,
`the choice of a one- or two-dimensional scheme is very
`
`dependent on how cells and memories will be imple-
`
`mented.
`
`As in one-dimensional systolic arrays, data in two-
`dimensionalarrays mayflow in multiple directions and at
`multiple speeds. For examples of two-dimensional sys-
`tolic arrays, see Guibaset al.26 and Kung and Lehman*
`(type R), Kung and Leiserson> and Weiser and Davis?’
`(type H), and Bojanczyk et al.28 and Gentleman and
`Kung”? (type T).
`In practice, systolic arrays can be
`chained together to form powerful systems Such as the
`one depicted in Figure 12, which is capable of producing
`on-the-fly the least-squaresfit to

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket