`
`Systolic architectures, which permit multiple computations
`for each memory access, can speed execution of
`compute-bound problems without increasing I/O requirements.
`
`
`
`
`
`
`xseslPD
`
`SX KX&ay
`
`
`at
`PO
`EXD
`
`
`RUS
`
`
`
`SyrahXIXXXDerrerrers
`
`
`
`
`PXEINK
`
`
`
`
`
`
`
`
`
`
`
`
`
`Why Systolic Architectures?
`
`H. T. Kung
`Carnegie-Mellon University
`
`. High-performance, special-purpose computer sys-
`tems are typically used to meet specific application re-
`quirements orto off-load computationsthat are especial-
`ly taxing to general-purpose computers. As hardware cost
`and size continue to drop and processing requirements
`becomewell-understood in areas such as signal and image
`processing, more special-purpose systems are being con-
`structed. However, since most of these systemsare built
`on an ad hoc basis for specific tasks, methodological
`workin this area is rare. Because the knowledge gained
`from individual experiences is neither accumulated nor
`properly organized, the sameerrors are repeated. I/O and
`computation imbalanceis a notable example—often, the
`fact that I/O interfaces cannot keep up with device speed
`is discovered only after constructing a high-speed,
`special-purpose device.
`Weintend to help correct this ad hoc approach bypro-
`viding a general guideline—specifically, the concept of
`systolic architecture, a general methodology for mapping
`high-level computations into hardwarestructures. In a
`systolic system, data flows from the computer memoryin
`a rhythmic fashion, passing through many processing
`elements before it returns to memory, muchas bloodcir-
`culates to and from the heart. The system works like an
`autombbile assemblyline where different people work on
`the samecarat different times and manycars are assem-
`bled simultaneously. An assembly line is always linear,
`however, and systolic systems are sometimés two-dimen-
`sional. They can be rectangular, triangular, or hexagana!
`to make use of higher degrees of parallelism. Moreover,
`to implement a variety of computations, data flow ina
`systolic system may be at multiple speeds in multiple di-
`rections—both inputs and (partial) results flow, whereas
`only results flow in classical pipelined systems. Generally
`speaking, a systolic system is easy to implement because
`of its regularity and easy to reconfigure (to meet various
`outside constraints) because of its modularity.
`The systolic architectural concept was developed at
`Carnegie-Mellon University,!? and versions of systolic
`processors are being designed andbuilt by several indus-
`trial and governmental organizations.®!° This article
`
`reviewsthe basic principle of systolic architectures and ex-
`plains why they shouldresult in cost-effective, high-
`performancespecial-purpose systems for a wide range of
`problems.
`
`Keyarchitectural issues in designing
`special-purpose systems
`
`Roughly, the cycle for developing a special-purpose
`system can be divided into three phases—task definition,
`design, and implementation. During task definition,
`some system performance bottleneck is identified, and a
`decision on whether or not,to resolve it with special-
`purpose hardware is made. The evaluation required for
`task definition is most fundamental, but sinceit is often
`application-dependent, we will concentrate only on archi-
`tecturalissues related to the design phase and will assume
`routine implementation.
`
`Simple and regular design. Cost-effectiveness has
`always been a chief concern in designing special-purpose
`systems; their cost must be low enoughto justify their
`limited applicability. Costs can beclassified as nonrecur-
`ring (design) and recurring (parts) costs. Part costs are
`dropping rapidly due to advances in integrated-circuit
`technology, but this advantage applies equally to both
`special-purpose and general-purpose systems. Further-
`more, since special-purpose systemsare seldom produced
`in large quantities, part costs are less important than
`design costs. Hence, the design cost of a special-purpose
`system mustberelatively small forit to be moreattractive
`than a general-purpose approach.
`Fortunately, special-purpose design costs can be reduced
`by the use of appropriate architectures. If a structure can
`truly be decomposedinto a few types of simple substruc-
`tures or building blocks, which are used repetitively with
`simple interfaces, great savings can be achieved. This is
`especially true for VLSI designs where a single chip com-
`prises hundreds of thousands of components. To cope
`with that complexity, simple and regular designs, similar
`
`January 1982
`
`0918-9162/82/0100-0037$00.75 © 1982 IEEE
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 37
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 37
`
`
`
`to someof the techniquesused in constructing large soft-
`ware systems, areessential.!! In addition, special-purpose
`systems based on simple, regular designs are likely to be
`modular and therefore adjustable to various performance
`goals—thatis, system cost can be madeproportional to
`the performance required. This suggests that meeting the
`architectural challenge for simple, regular designs yields
`cost-effective special-purpose systems.
`
`performance goal of a special-purpose system is—and
`should be no more than—a computation rate that bal-
`ances the available I/O bandwidth with the host. Since an
`accurate a priori estimate of available I/O bandwidth ina
`complex system is usually impossible, the design of a
`special-purpose system should be modular so that its
`structure can be easily adjusted to match a variety of I/O
`bandwidths.
`
`Concurrency and communication. Thereare essential-
`ly two waysto build a fast computer system. Oneis to use
`fast components, and the otheris to use concurrency. The
`last decade has seen an order of magnitude decrease in the
`cost and size of computer components but only anincre-
`mental
`increase in component speed.!? With current
`technology, tens of thousands of gates can be put ina
`single chip, but no gate is much faster than its TTL
`counterpart of 10 years ago. Since the technological trend
`clearly indicates a diminishing growth rate for component
`speed, any major improvement in computation speed
`must come from the concurrent use of many processing
`elements. The degree of concurrencyin a special-purpose
`system is largely determined by the underlying algorithm.
`Massive parallelism can be achieved if the algorithm is
`designed to introduce high degrees of pipelining and
`multiprocessing. When a large number of processing
`elements work simultaneously, coordination and com-
`munication become significant—especially with VLSI
`technology where routing costs dominate the power,
`time, and area required to implement a computation.!3
`The issue here is to design algorithms that support high
`degrees of concurrency, and in the meantime to employ
`only simple, regular communication and controlto enable
`efficient implementation.
`
`Balancing computation with I/O. Since a special-
`purpose system typically receives data and outputs results
`through an attached host, I/O considerations influence
`overall performance. (The host in this context can meana
`computer, a memory, a real-time device, etc. In practice,
`the special-purpose system may actually input from one
`“‘physical’’ host and output to another.) The ultimate
`
`INSTEAD OF:
`
`MEMORY
`
`5 MILLION
`OPERATIONS
`PER SECOND
`AT MOST
`
`THE SYSTOLIC ARRAY
`
`WE HAVE:
`
`100 ns
`
`MEMORY
`
`30 MOPS
`POSSIBLE
`
`Tele)
`
`Figure 1. Basic principle of a systolic system.
`
`38
`
`Supposethat the I/O bandwidth between the host anda
`special-purpose system is 10 million bytes per second, a
`rather high bandwidth for present technology. Assuming
`that at least two bytes are read from or written to the host
`for each operation, the maximum rate will be only 5
`million operations per second, no matter how fast the
`special-purpose system can operate (see Figure 1). Orders
`of magnitude improvements on this throughputare possi-
`ble only if multiple computations are performed per I/O
`access. However, the repetitive use of a data item requires
`it to be stored inside the system for a sufficient length of
`time. Thus, the I/O problem is related not only to the
`available I/O bandwidth, but also to the available
`memory internal to the system. The question then is how
`to arrange a computation together with an appropriate
`memorystructure so that computation time is balanced
`with I/O time.
`
`The I/O problem becomesespecially severe when a large
`computation is performed on a small special-purpose sys-
`tem. In this case, the computation must be decomposed.
`Executing subcomputations one at a time may require a
`substantial amountof I/O to storeor retrieve intermediate
`results. Consider, for example, performing the -pointfast
`Fourier transform using an S-point device when 7 is large
`and Sis small. Figure 2 depicts the n-point FFT computa-
`tion and a decomposition schemefor n = I6and S = 4. Note
`that each subcomputation block is sufficiently small so that
`it can be handled by the 4-point device. During execution,
`results of a block must be temporarily sent to the host and
`later retrieved to be combinedwith results of other blocks
`as they becomeavailable. With the decomposition scheme
`shownin Figure 2b, the total numberof I/O operationsis
`O(n log n/log S). In fact, it has been shownthat, to per-
`form the n-poirit FFT with a device of O(S) memory, at
`least this many I/O operations are needed for any decom-
`position scheme. !4 Thus, for the n-point FFT problem, an
`S-point device cannot achieve more than an O(log S)
`speed-up ratio over the conventional O(n log n) software
`implementation time, and sinceit is a consequence of the
`I/O consideration, this upper bound holds independently
`of device speed. Similar upper bounds have been estab-
`lished for speed-up ratios achievable by devices for other
`computationssuchas sorting and matrix multiplication. !4-15
`Knowing the I/O-imposed performancelimit helps pre-
`vent overkill in the design of a special-purpose device.
`
`than
`In practice, problems are typically ‘‘larger’’
`special-purpose devices. Therefore, questions such as
`how a computation can be decomposed to minimize I/O,
`how the I/O requirementis related to thesize of a special-
`purposesyste and its memory, and howthe I/O band-
`width limits the speed-up ratio achievable by a special-
`purpose system present another set of challenges to the
`system architect.
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 38
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 38
`
`
`
`Systolic architectures: the basic principle
`
`As a solution to the above challenges, we introduce
`systolic architectures, an architectural conceptoriginally
`proposed for VLSI implementation of some matrix oper-
`ations.» Examples of systolic architectures follow in the
`next section, which contains a walk-through of a family
`of designs for the convolution computation.
`A systolic system consists of a set of interconnected
`cells, each capable of performing some simple operation.
`Because simple,
`regular communication and control
`structures have substantial advantages over complicated
`ones in design and implementation, cells in a systolic
`system are typically interconnected to form a systolic ar-
`ray or a systolic tree. Information in a systolic system
`flows betweencells in a pipelined fashion, and communi-
`cation with the outside world occursonly at the ‘‘bound-
`ary cells.’’ For example,
`in a systolic array, only those
`cells on the array boundaries may be I/O ports for the
`system.
`tasks can be conceptually classified
`Computational
`into two families—compute-bound computations and
`I/O-bound computations. In a computation, if the total
`numberof operationsis larger than the total numberof
`input and output elements,
`then the computation is
`compute-bound, otherwiseit is 1/O-bound. For example,
`the ordinary matrix-matrix multiplication algorithm
`represents a compute-bound task, since every entry in a
`matrix is multiplied by all entries in some row or column
`of the other matrix. Adding two matrices, on the other
`hand, is I/O-bound, since the total number of addsis not
`larger than the total numberof entries in the two matrices.
`{t should be clear that any attempt to speed up an I/O-
`bound computation must rely on an increase in memory
`bandwidth. Memory bandwidth can be increased by the
`use of either fast components (which could be expensive)
`or interleaved memories (which could create complicated
`memory management problems). Speeding up a com-
`pute-bound computation, however, may often be accom-
`
`plished in a relatively simple and inexpensive manner,
`that is, by the systolic approach.
`The basic principle of a systolic architecture, a systolic
`arrayin particular, is illustrated in Figure 1. By replacing a
`single processing element with an array of PEs, or cells in
`the terminology of this article, a higher computation
`throughput can be achieved without increasing memory
`bandwidth. The function of the memoryin the diagram is
`analogous to thatof the heart; it ‘‘pulses’’ data (instead of
`blood) through the array of cells. The crux of this ap-
`proach is to ensure that once adata item is brought out
`from the memoryit can be used effectively at each cell it
`passes while being ‘‘pumped’”’ from cell to cell along the
`array. This is possible for a wide class of compute-bound
`computations where multiple operations are performed
`on each dataitem in a repetitive manner.
`Being able to use each input data item a numberof
`times (and thus achieving high computation throughput
`with only modest memory bandwidth) is just one of the
`many advantages of the systolic approach. Other advan-
`tages, such as modular expansibility, simple and regular
`data and control flows, use of simple and uniform cells,
`elimination of global broadcasting, and fan-in and (pos-
`sibly) fast response time, will be illustrated in various sys-
`tolic designs in the next section.
`
`A family of systolic designs
`for the convolution computation
`
`To provide concrete examples of various systolic struc-
`tures, this section presents a family of systolic designs for
`the convolution problem, which is defined as follows:
`
`Given the sequence of weights [w), W.,..., We}
`and the input sequence (.X),.%2,.--5Xnj,
`
`computethe result sequence {vy}, ¥2,---.¥ne1—-k3
`defined by
`
`i= Wixi WoXje to. + WEE kA 1
`
`
`
`4-PT.
`DEVICE
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 39
`
`Figure 2. (a) 16-point fast-Fourier-transform graph; (b) decomposing the FFT computation with n = 16 and S$ =4.
`
`January 1982
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 39
`
`
`
`
`
`Yout
`
`*
`
`Vin + Wo Xin
`
`Figure 3. Design B1: systolic convolution array (a) and cell
`(b) where x;’s are broadcast, w;’s stay, and y;’s move
`systolically.
`
`Figure 4. Design B2: systolic convolution array (a) and cell
`(b) where x;’s are broadcast, y;’s stay, and w;’s move
`systolically.
`
`We consider the convolution problem because it is a sim-
`ple problem with a variety of enlightening systolic solu-
`tions, becauseit is an important problemin its own right,
`and more importantly, because it
`is representative of a
`wide class of computations suited to systolic designs. The
`convolution problem can be viewed as a problem of com-
`bining two data streams, w,’s amd_x;’s, in a certain man-
`ner (for example, as in the above equation) to form a
`resultant data stream of y,’s. This type of computation is
`common to a number of computation routines, such as
`filtering, pattern matching, correlation,
`interpolation,
`polynomialevaluation (including discrete Fourier trans-
`forms), and polynomial multiplication and division. For
`example, if multiplication and addition are interpreted as
`comparison and boolean AND, respectively,
`then the
`convolution problem becomes the pattern matching
`problem.! Architectural concepts for the convolution
`problem can thus be applied to these other problems as
`well,
`
`The convolution problem is compute-bound, since
`each input x,is to be multiplied by each of the & weights. If
`the x; is input separately from memory for each multi-
`plication,
`then when & is
`large, memory bandwidth
`becomes a bottleneck, precluding a high-performance
`solution. As indicated earlier, a systolic architecture
`resolves this I/O bottleneck by making multiple use of
`each x, fetched from the memory. Based on this principle,
`several systolic designs for solving the convolution prob-
`lem are described below. For simplicity, all illustrations
`assumethat k=3.
`
`(Semi-) systolic convolution arrays with global data
`communication. If an x;, once brought out from the
`memory, is broadcast to a numberof cells, then the same
`x; can be used by all the cells. This broadcasting technique
`is probably one of the most obvious ways to make mul-
`tiple use of each input element. The opposite of broad-
`casting is fan-in, through which data items from anumber
`of cells can be collected. The fan-in techniquecan also be
`used in a straightforward mannerto resolve the I/O bot-
`tleneck problem. In the following, we describe systolic
`designs that utilize broadcasting and fan-in.
`Design Bl—broadcast inputs, move results, weights
`Stay. The systolic array and its cell definition are depicted
`
`40
`
`in Figure 3. Weights are preloaded to the cells, one at each
`cell, and stay at the cells throughout the computation.
`Partial results y; move systolically from cell to cell in the
`left-to-right direction, that is, each of them moves over
`the cell to its right during each cycle. At the beginning
`ofacycle, one x;is broadcast to all the cells and one yj, in-
`itialized as zero, enters the left-most cell. During cycle
`one, w, X; is accumulated to y, at the left-most cell, and
`during cycle two, w, x2 and w2 x2 are accumulated to y>
`and y, at the left-most and middle cells, respectively.
`Starting from cycle three, the final (and correct) values of
`¥y,¥2, -. are output from the right-most cell at the rate
`of one y; per cycle. The basic principle of this design was
`previously proposed for circuits to implement a pattern
`matching processor!® and for circuits to implement
`polynomial multiplication. !7-20
`Design B2—broadcast inputs, move weights, results
`stay. In design B2 (see Figure 4), each y; stays at a cell to
`accumulate its terms, allowing efficient use of available
`multiplier-accumulator hardware. (Indeed, this design is
`described in an application booklet for the TRW multi-
`plier-accumulator chips.2! The weights circulate around
`the array of cells, and the first weight w, is associated with
`a tag bit that signals the accumulator to output and resets
`its contents. * In design B1! (Figure 3), the systolic path for
`moving y,’s may be considerably wider than that for mov-
`ing w,’s in design B2 because for numerical accuracy y;’s
`typically carry more bits than w,’s. The use of multiplier-
`accumulators in design B2 mayalso help increase preci-
`sion of the results, since extra bits can be kept in these ac-
`cumulators with modest cost. Design B1, however, does
`have the advantage of not requiring a separate bus (or
`other global network), denoted by a dashedline in Figure
`4, for collecting outputs from individualcells.
`Design F—fan-in results, move inputs, weights stay. If
`weconsiderthe vector of weights (wz, We_],
`-
`-
`.
`» Wy) aS
`being fixed in space and input vector (X;, X,—1, -- +X)
`as sliding over the weights in the left-to-right direction,
`then the convolution problem is one that computesthe in-
`ner product of the weight vector and the section of input
`vector it overlaps. This view suggests the systolic array
`
`*To avoid complicated pictures, control structures such as the use of tag
`bits to gate outputs from cells are omitted from the diagramsof this article.
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 40
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 40
`
`
`
`
`
`
`shown in Figure 5. Weights are preloadedto the cells and
`stay there throughout the computation. During a cycle,
`all x;’s move one cell to the right, multiplications are per-
`formedat all cells simultaneously, and their results are
`fanned-in and summed using an adder to form a new yj.
`When the number of cells, 4, is large, the adder can be im-
`plemented as a pipelined addertree to avoid large delays
`ineachcycle. Designs of this type using unbounded fan-in
`have been known for quite along time, for example, inthe
`context of signal processing*? and in the context of pat-
`tern matching. ¥
`
`Figure 5. Design F: systolic convolution array (a) and cell
`(b) where w;’s stay, x;’s move systolically, and y;’s are
`formed through thefan-in of results from all the cells.
`
`Figure 6. Design Rt: systolic convolution array (a) and cell
`(b) where y;’s stay and x;’s and y;’s move in opposite direc-
`tions systolically.
`
`(Pure-) systolic convolution arrays without global data
`communication. Although global broadcasting or fan-in
`solves the I/O bottleneck problem, implementing it in a
`modular, expandable way presents another problem.
`Providing (or collecting) a data item to (or from) all the
`cells of a systolic array, during each cycle, requires the use
`of a bus or somesort of tree-like network. As the number
`of cells increases, wires become long for either a bus or
`tree structure; expanding these non-local communication
`paths to meet the increasing load is difficult without slow-
`ing down the system clock. This engineering difficulty of
`extending global networks is significant at chip, board,
`and higher levels of a computer system. Fortunately, as
`will be demonstrated below, systolic convolution arrays
`without global data communication do exist. Potentially,
`these arrays can be extended to include an arbitrarilylarge
`number of cells without encountering engineering diffi-
`culties (the problem of synchronizing a large systolic ar-
`ray is discussedlater).
`Design R1l—results stay, inputs and weights move in
`opposite directions. In design R1 (see Figure 6) each par-
`tial result y,stays at acell to accumulate its terms. The x,’s
`and w,’s movesystolically in opposite directions such that
`when an x meets a wat acell, they are multiplied and the
`resulting product is accumulated to the y staying at that
`cell. To ensure that each x; is able to meet every w,, con-
`secutive x;’s on the x data stream are separated by two cy-
`cle times and so are the w;’s on the w data stream.
`Like design B2, design R1 can makeefficient use of
`available multiplier-accumulator hardware; it can also
`use a tag bit associated with thefirst weight, w;, to trigger
`the output and reset the accumulator contents of a cell.
`
`January 1982
`
`Figure 7. Design R2: systolic convolution array (a) and cell
`(b) where y;’s stay and x;’s and w;’s both move inthe same
`direction but at different speeds.
`
`Design R1 has the advantagethat it does not requirea bus,
`or any other global network, for collecting output from
`cells; a systolic output path (indicated by broken arrows
`in Figure 6) is sufficient. Because consecutive w;,’s are well
`separated by two cycle times, a potential conflict—that
`more than one y; may reach a single latch on the systolic
`output path simultaneously—cannotoccur. It can also be
`easily checked that the y,’s will output from thesystolic
`output path in the natural ordering y), y2,.
`.
`.
`. The basic
`idea of this design, including that of the systolic output
`path, has been used to implement a pattern matching
`chip.!
`Notice that in Figure 6 only about one-half the cells are
`doing useful work at any time. To fully utilize the poten-
`tial throughput, two independent convolution computa-
`tions can be interleaved in the same systolic array, but
`cells in the array would have to be modified slightly to
`support the interleaved computation. For example, an
`additional accumulator would be required at each cell to
`hold a temporaryresult for the other convolution com-
`putation.
`Design R2— results stay, inputs and weights movein
`the same direction but at different speeds. One version of
`design R2 is illustrated in Figure 7. In this case both the x
`and w data streams move from left to right systolically,
`but the x;’s move twice as fast as the w,’s. Moreprecisely,
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 41
`
`41
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 41
`
`
`
`
`
`“— Yin + Wo Xin
`Yout
`= Xin
`X
`Xout — X
`
`Figure 8. Design W1: systolic convolution array (a) and
`cell (b) where w;’s stay and x;’s and y;’s move systolically
`in opposite directions.
`
`Figure 9. Design W2: systolic convolution array (a) and
`cell
`(b) where w;’s stay and x;’s and y;’s both move
`systolically in the same direction but at different speeds.
`
`namely, only approximately one-half the cells work at any
`given time unless two independent convolution computa-
`tions are interleaved in the same array. The next design,
`like design R2, overcomes this shortcoming by having
`both the x,’s and y;’s move in the same direction but at dif-
`ferent speeds.
`Design W2—weights stay, inputs and results move in
`the same direction but at different speeds. With design
`W2 (Figure 9) all the cells work all the time, but it loses one
`advantage of design W1, the constant response time. The
`output of y;nowtakes place k cycles after the last of its in-
`puts starts entering the left-most cell of the systolic array. .
`This design has been extended to implement 2-D convolu-
`tions,®?4 where high throughputs
`rather
`than fast
`responses are of concern. Similar to design R1, design W2
`has a dual version for which the x;’s move twice as fast as
`the y,’s.
`
`each w;stays inside everycell it passes for one extra cycle,
`thus taking twice as long to move throughthe array as any
`x). Inthis design, multiplier-accumulator hardware can be
`used effectively and so can the tag bit method to signal the
`output of the accumulator contents at each cell. Com-
`pared to design R1, this design has the advantagethat all
`cells work all the time when performing a single convolu-
`tion, but it requires an additional register in each cell to
`hold a w value. This algorithm has been used for im-
`plementing a pipeline multiplier.”
`There is a dual version of design R2; we can have the
`ws move twice as fast as the x;’s. To create delays for the
`x data stream, this dual design requires a register in each
`cell for storing an x rather than a w value. For cir-
`cumstances where the w,’s carry more bits than the x,’s,
`the dual design becomes attractive.
`Design WI—weights stay, inputs and results move in
`opposite directions.
`In design WI
`(and design W2,
`below), weights stay, one at each cell, but results and in-
`puts move svstolically. These designs are not geared tothe
`most effective use of available multiplicr-accumulator
`hardware, but
`for some other circumstances they are
`potentially more efficient than the other designs. Because
`the same set of weights is used for computing all the ¥;’s
`and different sets of the x;’s are used for computing dif-
`ferent ¥,’s, it is natural to have the w,'s preloaded to the
`cells and staythere, and let the xs and the y,’s move along
`the array. We will see some advantages of this arrange-
`ment in the systolic array depicted in Figure 8, which is a
`special case of a proposed systolic filtering array.? This
`design is fundamental in the sense that it can be naturally
`extended to perform recursive filtering?" and polynomial
`division.7*
`In design W1, the w,’s stay and the xs and ys move
`systolically in opposite directions. Similar to design R1,
`consecutive xs and y;’s are separated by two cycle times.
`Note that because the systolic path for moving the ¥,’s
`alreadyexists, there is no need for another systolic output
`path as in designs R] and R2. Furthermore, for each/, 4;
`outputs from the left-most cell during the same cycle asits
`last inpul, x). ,—1 (or X;42 for A = 3), enters that cell.
`Thus, this systolic array is capable of outputting a v,;every
`two cycle times with constant response time. Design W1,
`however, suffers from the same drawback as design RI,
`
`42
`
`Remarks. The designs presented above by no means ex-
`haust all the possible systolic designs for the convolution
`problem. For example,
`it
`is possible to have systolic
`designs where results, weights, and inputs all move during
`each eycle. It could also be advantageous toinclude inside
`each cell a ‘‘cell memory’? capable of storing a set of
`weights. With this feature, using a systolic control (or ad-
`dress) path, weights can be selected on-the-fly to imple-
`ment interpolation or adaptive filtering.24 Moreover, the
`flexibility introduced by the cell memories and systolic
`control can make the same systolic array implement dif-
`ferent functions, Indeed,
`the ESI. systolic processor*:!”
`utilizes cell memories to implement multiple functions in-
`cluding convolution and matrix multiplication,
`Once one systolic design is obtained for a problem, it is
`likely that a set of other systolic designs can be derived
`similarly. The challenge is to understand precisely the
`strengths and drawbacks of each design so that an ap-
`propriate design canbe selected for a given environment.
`For example,
`if there are more weights than cells,
`it’s
`useful to know that a scheme where partial results stay
`generally requires less 1/O than one wherepartial results
`move, since the latter scheme requires partial results to be
`input and output manytimes. A single multiplier-accu-
`mulator hardware component often represents a cost-
`effective implementation of the multiplier and adder
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 42
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 42
`
`
`
`RESULTS
`
`®) - [IGNORED]
`
`MULTIPLIER
`
`Figure 10. Overlapping the executions of multiply and add in design W1.
`
`needed by each cell of a systolic convolution array.
`However, for improving throughput, sometimes it may
`be worthwhile to implement multiplier and adder sepa-
`rately to allow overlapping of their executions. Figure 10
`depicts such a modification to design W1. Similar
`modifications can be madeto other systolic convolution
`arrays. Anotherinteresting scenario is the following one.
`Suppose that one or several cells are to be implemented
`directly with a single chip and the chip pin bandwidth is
`the implementation bottleneck. Then, since the basic cell
`of some semi-systolic convolution arrays such as designs
`BI and F require only three I/O ports, while that of a
`pure-systolic convolution array always requires four, a
`semi-systolic array may be preferable for saving pins,
`despite the fact that it requires global communication.
`
`Criteria and advantages
`
`Having described a family of systolic convolution ar-
`rays, we can now be moreprecise in suggesting and evalu-
`ating criteria for the design of systolic structures.
`
`(1) The design makes multiple use of each input data
`item, Because of this property, systolic systems can
`achieve high throughputs with modest I/O bandwidths
`for outside communication. To meet this criterion, one
`can either use global data communications, such as
`broadcast and unbounded fan-in, or have each input
`travel through an array of cells so that it is used at each
`cell. For modular expansibility of the resulting system,
`the second approach is preferable.
`(2) The design uses extensive concurrency. The process-
`ing power of a systolic architecture comes from concur-
`rent use of manysimple cells rather than sequential use of
`a few powerful processors as in many conventionalar-
`chitectures. Concurrency can be obtained by pipelining
`the stages involved in the computation of each single
`result (for example, design B1), by multiprocessing many
`results in parallel (designs R1 and R2), or by both. For
`some designs, such as WI, it
`is possible to completely
`overlap I/O and computation times to further increase
`concurrency and provide constant-time responses.
`
`January 1982
`
`To a given problem there could be both one- and two-
`dimensionalsystolic array solutions. For example, two-
`dimensional convolution can be performed by a one-
`dimensional systolic array*4?5 or a two-dimensional
`systolic array.© When the memoryspeed is more thancell
`speed,
`two-dimensional systolic arrays such as those
`depicted in Figure 11 should be used. At each cell cycle, all
`the I/O ports on the array boundaries can input or output
`data items to or from the memory; as a result,
`the
`available memory bandwidth canbefully utilized. Thus,
`the choice of a one- or two-dimensional scheme is very
`
`dependent on how cells and memories will be imple-
`
`mented.
`
`As in one-dimensional systolic arrays, data in two-
`dimensionalarrays mayflow in multiple directions and at
`multiple speeds. For examples of two-dimensional sys-
`tolic arrays, see Guibaset al.26 and Kung and Lehman*
`(type R), Kung and Leiserson> and Weiser and Davis?’
`(type H), and Bojanczyk et al.28 and Gentleman and
`Kung”? (type T).
`In practice, systolic arrays can be
`chained together to form powerful systems Such as the
`one depicted in Figure 12, which is capable of producing
`on-the-fly the least-squaresfit to