`
`Systolic architectures, which permit multiple computations
`for each memory access, can speed execution of
`compute-bound problems without increasing I/ 0 requirements.
`
`O1
`
`
`
`
`‘:’v‘o"“o«3:034}
`
`
`
`
`345493.!
`:33; 921933331
`
`
`
`
`
`
`
`
`
`
`
`
`
`w to 93,2 mania“...
`53.32:.133» s'vz..>33'».3>3.>z€v:
`
`
`
`
`
`
`
`
`
`
`
`
`
`Why Systolic Architectures?
`
`H. T. Kung
`Carnegie-Mellon University
`
`sys—
`.. High-performance, special-purpose computer
`tems are typically used to meet specific application re—
`quirements or to off—load computations that are especial—
`ly taxing to general-purpose computers. As hardware cost
`and size continue to drop and processing requirements
`become well-understood in areas such as signal and image
`processing, more special—purpose systems are being con-
`structed. However, since most of these systems are built
`on an ad hoc basis for specific tasks, methodological
`work in this area is rare. Because the knowledge gained
`from individual experiences is neither accumulated nor
`properly organized, the same errors are repeated. 1/0 and
`computation imbalance is a notable example—often, the
`fact that 1/0 interfaces cannot keep up with device speed
`is discovered only after constructing a high-speed,
`special-purpose device.
`We intend to help correct this ad hoc approach by pro—
`viding a general guideline—specifically, the concept of
`systolic architecture, a general methodology for mapping
`high-level computations into hardware structures. In a
`systolic system, data flows from the computer memory in
`a rhythmic fashion, passing through many processing
`elements before it returns to memory, much as blood cir—
`culates to and from the heart. The system works like an
`automobile assembly line where different people work on
`the same car at different times and many cars are assem-
`bled simultaneously. An assembly line is always linear,
`however, and systolic systems are sometimes two-dimen-
`sional. They can be rectangular, triangular, or hexagonal
`to make use of higher degrees of parallelism. Moreover,
`to implement a variety of computations, data flow in a
`systolic system may be at multiple speeds in multiple di-
`rections—both inputs and (partial) results flow, whereas
`only results flow in classical pipelined systems. Generally
`speaking, a systolic system is easy to implement because
`of its regularity and easy to reconfigure (to meet various
`outside constraints) because of its modularity.
`The systolic architectural concept was developed at
`Carnegie-Mellon University,"7 and versions of systolic
`processors are being designed and built by several indus—
`trial and governmental organizations.3'10 This article
`
`reviews the basic principle ofsystolic architectures and ex-
`plains why they should result
`in cost—effective, high-
`performance specialApurpose systems for a wide range of
`problems.
`
`Key architectural issues in designing
`special-purpose systems
`
`Roughly, the cycle for developing a special-purpose
`system can be divided into three phases—task definition,
`design, and implementation. During task definition,
`some system performance bottleneck is identified, and a
`decision on whether or not, to resolve it with special—
`purpose hardware is made. The evaluation required for
`task definition is most fundamental, but since it is often
`application—dependent, we will concentrate only on archi-
`tectural issues related to the design phase and will assume
`routine implementation.
`
`Simple and regular design. Cost—effectiveness has
`always been a chief concern in designing special-purpose
`systems; their cost must be low enough to justify their
`limited applicability. Costs can be classified as nonrecur—
`ring (design) and recurring (parts) costs. Part costs are
`dropping rapidly due to advances in integrated—circuit
`technology, but this advantage applies equally to both
`special-purpose and general—piirpose systems. Further-
`more, since special-purpose systems are seldom produced
`in large quantities, part costs are less important than
`design costs. Hence, the design cost of a special—purpose
`system must be relatively small for it to be more attractive
`than a general—purpose approach.
`Fortunately, special-purpose design costs can be reduced
`by the use of appropriate architectures. If a structure can
`truly be decomposed into a few types of simple substruc-
`tures or building blocks, which are used repetitively with
`simple interfaces, great savings can be achieved. This is
`especially true for VLSI designs where a single chip com—
`prises hundreds of thousands of components. To cope
`with that complexity, simple and regular designs, similar
`
`January 1982
`
`(YJlS-9l62/82/01000037SOOJS 31 1982 lEEE
`
`37
`
`Petitioner Microsoft Corporation - EX. 1016, p. 37
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 37
`
`
`
`to some of the techniques used in constructing large soft-
`ware systems, are essential. ” In addition, special—purpose
`systems based on simple, regular designs are likely to be
`modular and therefore adjustable to various performance
`goals—that is, system cost can be made proportional to
`the performance required. This suggests that meeting the
`architectural challenge for simple, regular designs yields
`cost-effective special-purpose systems.
`
`performance goal of a special-purpose system is~and
`should be no more than—a computation rate that bal-
`ances the available I/O bandwidth with the host. Since an
`accurate a priori estimate ofavailable I/O bandwidth in a
`complex system is usually impossible, the design of a
`special-purpose system should be modular so that its
`structure can be easily adjusted to match a variety of I/O
`bandwidths.
`
`Concurrency and communication. There are essential-
`ly two ways to build a fast computer system. One is to use
`fast components, and the other is to use concurrency. The
`last decade has seen an order of magnitude decrease in the
`cost and size of computer components but only an incre-
`mental
`increase in component speed.12 With current
`technology, tens of thousands of gates can be put in a
`single chip, but no gate is much faster than its TTL
`counterpart of 10 years ago. Since the technological trend
`clearly indicates a diminishing growth rate for component
`speed, any major improvement in computation speed
`must come from the concurrent use of many processing
`elements. The degree of concurrency in a special-purpose
`system is largely determined by the underlying algorithm.
`Massive parallelism can be achieved if the algorithm is
`designed to introduce high degrees of pipelining and
`multiprocessing. When a large number of processing
`elements work simultaneously, coordination and com-
`munication become significant—especially with VLSI
`technology where routing costs dominate the power,
`time, and area required to implement a computation.”
`The issue here is to design algorithms that support high
`degrees of concurrency, and in the meantime to employ
`only simple, regular communication and control to enable
`efficient implementation.
`
`Balancing computation with 1/0. Since a special-
`purpose system typically receives data and outputs results
`through an attached host, I/O considerations influence
`overall performance. (The host in this context can mean a
`computer, a memory, a real—time device, etc. In practice,
`the special-purpose system may actually input from one
`“physical” host and output to another.) The ultimate
`
`Suppose that the I/O bandwidth between the host and a
`special-purpose system is 10 million bytes per second, a
`rather high bandwidth for present technology. Assuming
`that at least two bytes are read from or written to the host
`for each operation, the maximum rate will be only 5
`million operations per second, no matter how fast the
`special-purpose system can operate (see Figure 1). Orders
`ofmagnitude improvements on this throughput are possi—
`ble only if multiple computations are performed per I/O
`access. However, the repetitive use of a data item requires
`it to be stored inside the system for a sufficient length of
`time. Thus, the [/0 problem is related not only to the
`available I/O bandwidth, but also to the available
`memory internal to the system. The question then is how
`to arrange a computation together with an appropriate
`memory structure so that computation time is balanced
`with I/O time.
`
`The [/0 problem becomes especially severe when a large
`computation is performed on a small special-purpose sys-
`tem. In this case, the computation must be decomposed.
`Executing subcomputations one at a time may require a
`substantial amount of I/O to store or retrieve intermediate
`results. Consider, for example, performing the n—point fast
`Fourier transform using an S-point device when n is large
`and S is small. Figure 2 depicts the n—point FFT computa-
`tion and a decomposition scheme for n = 16 and S = 4. Note
`that each subcomputation block is sufficiently small so that
`it can be handled by the 4-point device. During execution,
`results of a block must be temporarily sent to the host and
`later retrieved to be combined with results of other blocks
`as they become available. With the decomposition scheme
`shown in Figure 2b, the total number of I/O operations is
`O(n log n/log S). In fact, it has been shown that, to per—
`form the n—point FFT with a device of 0(5) memory, at
`least this many [/0 operations are needed for any decom—
`position scheme. 14 Thus, for the n-point FFT problem, an
`S—point device cannot achieve more than an 0(log S)
`speed—up ratio over the conventional O(n log n) software
`implementation time, and since it is a consequence of the
`I/O consideration, this upper bound holds independently
`of device speed. Similar upper bounds have been estab—
`lished for speed-up ratios achievable by devices for other
`computations such as sorting and matrix multiplication. ”'15
`Knowing the I/O-imposed performance limit helps pre-
`vent overkill in the design of a special—purpose device.
`
`In practice, problems are typically “larger” than
`special-purpose devices. Therefore, questions such as
`how a computation can be decomposed to minimize l/O,
`how the I/O requirement is related to the size of a special—
`purpose system and its memory, and how the I/O band-
`width limits the speed—up ratio achievable by a special-
`purpose system present another set of challenges to the
`system architect.
`
`COM PUTER
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 38
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 38
`
`5 MILLION
`OPERATIONS
`PER SECOND
`AT MOST
`
`30 MOPS
`POSSIBLE
`
`THE SYSTOLIC ARRAY
`
`INSTEAD (IF:
`
`MEMORY
`
`WE HAVE:
`
`100 ns
`
`MEMORY
`
`naanna
`
`Figure 1. Basic principle of a systolic system.
`
`38
`
`
`
`Systolic architectures: the basic principle
`
`As a solution to the above challenges, we introduce
`systolic architectures, an architectural concept originally
`proposed for VLSI implementation of some matrix oper—
`ations.5 Examples of systolic architectures follow in the
`next section, which contains a walk—through of a family
`of designs for the convolution computation.
`A systolic system consists of a set of interconnected
`cells, each capable of performing some simple operation.
`Because simple,
`regular communication and control
`structures have substantial advantages over complicated
`ones in design and implementation, cells in a systolic
`system are typically interconnected to form a systolic ar—
`ray or a systolic tree. Information in a systolic system
`flows between cells in a pipelined fashion, and communi—
`cation with the outside world occurs only at the “bound—
`ary cells.” For example,
`in a systolic array, only those
`cells on the array boundaries may be I/O ports for the
`system.
`tasks can be conceptually classified
`Computational
`into two families—compute-bound computations and
`I/O—bound computations. In a computation, if the total
`number of operations is larger than the total number of
`input and output elements,
`then the computation is
`compute-bound, otherwise it is I/O—bound. For example,
`the ordinary matrix—matrix multiplication algorithm
`represents a compute—bound task, since every entry in a
`matrix is multiplied by all entries in some row or column
`of the other matrix. Adding two matrices, on the other
`hand, is I/O—bound, since the total number of adds is not
`largerthanthetotalnumber ofentriesinthetwo matrices.
`It should be clear that any attempt to speed up an I/O-
`bound computation must rely on an increase in memory
`bandwidth. Memory bandwidth can be increased by the
`use of either fast components (which could be expensive)
`or interleaved memories (which could create complicated
`memory management problems). Speeding up a com—
`pute-bound computation, however, may often be accom-
`
`plished in a relatively simple and inexpensive manner,
`that is, by the systolic approach.
`The basic principle of a systolic architecture, a systolic
`array in particular, is illustrated in Figure 1. By replacing a
`single processing element with an array of PBS, or cells in
`the terminology of this article, a higher computation
`throughput can be achieved without increasing memory
`bandwidth. The function of the memory in the diagram is
`analogoustothat ofthe heart; it “pulses” data(instead of
`blood) through the array of cells. The crux of this ap-
`proach is to ensure that once a‘data item is brought out
`from the memory it can be used effectively at each cell it
`passes while being “pumped” from cell to cell along the
`array. This is possible for a wide class of compute-bound
`computations where multiple operations are performed
`on each data item in a repetitive manner.
`Being able to use each input data item a number of
`times (and thus achieving high computation throughput
`with only modest memory bandwidth) is just one of the
`many advantages of the systolic approach. Other advan—
`tages, such as modular expansibility, simple and regular
`data and control flows, use of simple and uniform cells,
`elimination of global broadcasting, and fan-in and (pos—
`sibly) fast response time, will be illustrated in various sys-
`tolic designs in the next section.
`
`A family of systolic designs
`for the convolution computation
`
`To provide concrete examples of various systolic struc—
`tures, this section presents a family of systolic designs for
`the convolution problem, which is defined as follows:
`
`.
`.
`.
`Given the sequence ofweights fwl, wz,
`and the input sequence {xl,x2,
`.
`.
`. ,xnl,
`
`, Wki
`
`compute the result sequence Ly] , yz,
`defined by
`
`.
`
`.
`
`. ,yn+1,kj
`
`inWlXi+ W2Xi+1 + -
`
`-
`
`- + kai+k71
`
`4-PT.
`
`
`
`‘YM'
`{mgoggmull
`_ magmwlll
`.‘VIIIOZM‘VXOI’II
`IOX'IA“\VXOIOIOIII
`I'AV-“IOIOX z x z
`XOXOXOXOIOXOXOI
`XWA'IIIOXOXOXOXOX‘
`I xvi/III XOXOXQ
`zmanxm
`'
`XOXOX X'lllA“\
`AMXOXWfl“
`
`DEVICE
`
`Figure 2. (a) 16-point tast-Fourier-transtorm graph; (b) decomposing the FFT computation with n = 16 and S = 4.
`
`January 1982
`
`39
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 39
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 39
`
`
`
`
`
`Figure 3. Design 81: systolic convolution array(a)and cell
`(b) where x,-‘s are broadcast, w,-’s stay, and y,-’s move
`systolically.
`
`Figure 4. Design 82: systolic convolution array (a) and cell
`(b) where x,-’s are broadcast, y,-’s stay, and w,-’s move
`systolically.
`
`We consider the convolution problem because it is a sim—
`ple problem with a variety of enlightening systolic solui
`tions, because it is an important problem in its own right,
`and more importantly, because it
`is representative of a
`wide class ofcomputations suited to systolic designs. The
`convolution problem can be viewed as a problem of com-
`bining two data streams, w, ’s amd xj’s, in a certain man-
`ner (for example, as in the above equation) to form a
`resultant data stream ofy, ’s. This type of computation is
`common to a number of computation routines, such as
`filtering, pattern matching, correlation,
`interpolation,
`polynomial evaluation (including discrete Fourier trans-
`forms), and polynomial multiplication and division. For
`example, if multiplication and addition are interpreted as
`comparison and boolean AND, respectively,
`then the
`convolution problem becomes the pattern matching
`problem.1 Architectural concepts for the convolution
`problem can thus be applied to these other problems as
`well.
`
`The convolution problem is compute—bound, since
`each input x, is to be multiplied by each ofthe k weights. If
`the X, is input separately from memory for each multi-
`plication,
`then when k is
`large, memory bandwidth
`becomes a bottleneck, precluding a high-performance
`solution. As indicated earlier, a systolic architecture
`resolves this 1/0 bottleneck by making multiple use of
`each X, fetched from the memory. Based on this principle,
`several systolic designs for solving the convolution prob—
`lem are described below. For simplicity, all illustrations
`assume that k : 3.
`
`(Semi-) systolic convolution arrays with global data
`communication. If an x,, once brought out from the
`memory, is broadcast to a number of cells, then the same
`x,- can be used by all the cells. This broadcasting technique
`is probably one of the most obvious ways to make mul—
`tiple use of each input element. The opposite of broad—
`casting is fan—in, through which data items from a number
`of cells can be collected. The fan—in technique can also be
`used in a straightforward manner to resolve the I/O bot—
`tleneck problem. In the following, we describe systolic
`designs that utilize broadcasting and fan—in.
`Design Bl—broadcast inputs, move results, weights
`stay. The systolic array and its cell definition are depicted
`
`40
`
`in Figure 3. Weights are preloaded to the cells, one at each
`cell, and stay at the cells throughout the computation.
`Partial results y,- move systolically from cell to cell in the
`left—to—right direction, that is, each of them moves over
`the cell to its right during each cycle. At the beginning
`ofa cycle, one X,- is broadcast to all the cells and one y,, in—
`itialized as zero, enters the left-most cell. During cycle
`one, wl X) is accumulated to y, at the left-most cell, and
`during cycle two, wl X2 and wz x2 are accumulated to y2
`and y] at the left—most and middle cells, respectively.
`Starting from cycle three, the final (and correct) values of
`y1,y2,
`.
`.
`. are output from the right-most cell at the rate
`ofone y,- per cycle. The basic principle of this design was
`previously proposed for circuits to implement a pattern
`matching processor16 and for circuits to implement
`polynomial multiplication. ”'20
`Design BZ—broadcast inputs, move weights, results
`stay. In design 82 (see Figure 4), each y,- stays at a cell to
`accumulate its terms, allowing efficient use of available
`multiplier—accumulator hardware. (Indeed, this design is
`described in an application booklet for the TRW multi-
`plier—accumulator chips.21 The weights circulate around
`the array ofcells, and the first weight wl is associated with
`a tag bit that signals the accumulator to output and resets
`its contents. * In design Bl (Figure 3), the systolic path for
`moving y,~’s may be considerably wider than that for mov—
`ing w,»’s in design B2 because for numerical accuracy y,-’s
`typically carry more bits than w,-’s. The use ofmultiplier—
`accumulators in design B2 may also help increase preci—
`sion ofthe results, since extra bits can be kept in these ac—
`cumulators with modest cost. Design Bl, however, does
`have the advantage of not requiring a separate bus (or
`other global network), denoted by a dashed line in Figure
`4, for collecting outputs from individual cells.
`Design F—fan-irz results, move inputs, weights stay. If
`we consider the vector ofweights (wk, wk, 1,
`.
`.
`.
`, w] )as
`.
`being fixed in space and input vector (x,,, x”, 1,
`.
`. ,xl)
`as sliding over the weights in the left—to—right direction,
`then the convolution problem is one that computes the in»
`ner product of the weight vector and the section of input
`vector it overlaps. This view suggests the systolic array
`
`’To avoid complicated pictures, control structures such as the use of tag
`bits to gate outputs from cells are omitted from the diagrams ofthis article.
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 40
`
`COMPUTER
`
`Petitioner Microsoft Corporation - EX. 1016, p. 40
`
`
`
`
`——>
`
`shown in Figure 5. Weights are preloaded to the cells and
`stay there throughout the computation. During a cycle,
`all X,‘s move one cell to the right, multiplications are per
`formed at all cells simultaneously, and their results are
`fanned-in and summed using an adder to form a new y,.
`When the number of cells, k, is large, the adder can be im—
`plemented as a pipelined adder tree to avoid large delays
`in each cycle. Designs of this type using unbounded fan»in
`have been known for quite a long time, for example, in the
`context of signal processing33 and in the context of pat,
`tern matching.“
`
`Figure 5. Design F: systolic convolution array (a) and cell
`(b) where wi’s stay, x,-’s move systolically, and y,-’s are
`formed through the tan-in of results from all the cells.
`
`Figure 6. Design R1: systolic convolution array (a) and cell
`(b) where y,- ‘s stay and x,- ’s and yi’s move in opposite direc-
`tions systolically.
`
`W3
`W5
`——>W4-—>W2
`X3
`
`(Pure-) systolic convolution arrays without global data
`communication. Although global broadcasting or fan—in
`solves the [/0 bottleneck problem, implementing it in a
`modular, expandable way presents another problem.
`Providing (or collecting) a data item to (or from) all the
`cells of a systolic array, during each cycle, requires the use
`of a bus or some sort of tree—like network. As the number
`of cells increases, wires become long for either a bus or
`tree structure; expanding these non—local communication
`paths to meet the increasing load is difficult without slow—
`ing down the system clock. This engineering difficulty of
`extending global networks is significant at chip, board,
`and higher levels of a computer system. Fortunately, as
`will be demonstrated below, systolic convolution arrays
`without global data communication do exist. Potentially,
`these arrays can be extended to include an arbitrarily large
`number of cells without encountering engineering diffii
`culties (the problem of synchronizing a large systolic ar-
`ray is discussed later).
`Design RI—results stay, inputs and weights move in
`opposite directions. In design R1 (see Figure 6) each par—
`tial result y,- stays at a cell to accumulate its terms. The x,- ’s
`and w, ’5 move systolically in opposite directions such that
`when an x meets a w at a cell, they are multiplied and the
`resulting product is accumulated to the y staying at that
`cell. To ensure that each x, is able to meet every wi, con-
`secutive x,-’s on the xdata stream are separated by two cy—
`cle times and so are the w,-’s on the w data stream.
`Like design B2, design R1 can make efficient use of
`available multiplier-accumulator hardware; it can also
`use a tag bit associated with the first weight, w, to trigger
`the output and reset the accumulator contents of a cell.
`
`January 1982
`
`Figure 7. Design R2: systolic convolution array (a) and cell
`(b)where yi’s stay and x,-’s and w,-’s both move in the same
`direction but at different speeds.
`
`Design R1 has the advantage that it does not requirea bus,
`or any other global network, for collecting output from
`cells; a systolic output path (indicated by broken arrows
`in Figure 6) is sufficient. Because consecutive w,’s are well
`separated by two cycle times, a potential conflict—that
`more than one y, may reach a single latch on the systolic
`output path simultaneously—cannot occur. It can also be
`easily checked that the y,’s will output from the systolic
`output path in the natural orderingyl,yz,. .
`. .The basic
`idea of this design, including that of the systolic output
`path, has been used to implement a pattern matching
`chip.1
`Notice that in Figure 6 only about one—half the cells are
`doing useful work at any time. To fully utilize the poten—
`tial throughput, two independent convolution computa—
`tions can be interleaved in the same systolic array, but
`cells in the array would have to be modified slightly to
`support the interleaved computation. For example, an
`additional accumulator would be required at each cell to
`hold a temporary result for the other convolution com-
`putation.
`Design R2—results stay, inputs and weights move in
`the same direction but at different speeds. One version of
`design R2 is illustrated in Figure 7. In this case both thex
`and w data streams move from left to right systolically,
`but the xi’s move twice as fast as the w,-’s. More precisely,
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 41
`
`41
`
`Petitioner Microsoft Corporation - EX. 1016, p. 41
`
`
`
`
`
`yOUl “ Yin t W ' Xin
`X
`e Xm
`Xout “ X
`
`Figure 8. Design W1: systolic convolution array (a) and
`cell (b) where w,- ’5 stay and x,-’s and y,-‘s move systolically
`in opposite directions.
`
`Figure 9. Design W2: systolic convolution array (a) and
`cell
`(b) where wi's stay and x,-’s and y,-’s both move
`systolically in the same direction but at different speeds.
`
`namely, only approximately one—halfthe cells work at any
`given time unless two independent convolution computa—
`tions are interleaved in the same array. The next design,
`like design R2, overcomes this shortcoming by having
`both the Xi’s and yl’s move in the same direction but at dif
`ferent speeds.
`Design W2—weights stay, inputs and results move in
`the same direction but at different speeds. With design
`W2 (Figure 9) all the cells work all the time, but it loses one
`advantage of design W 1, the constant response time. The
`output ofyinow takes place k cycles after the last of its in-
`puts starts entering the left—most cell ot'the systolic array,.
`This design has been extended to implement 2—D convolui
`tionsf’v24 where high throughputs
`rather
`than fast
`responses are ofconcern. Similar to design R1, design W2
`has a dual version for which the .v,’s move twice as fast as
`the y,-’s.
`
`each w,» stays inside every cell it passes for one extra cycle,
`thus taking twice as long to move through the array as any
`X1. In this design, multiplier-accumulator hardware can be
`used effectively and so can the tag bit method to signal the
`output of the accumulator contents at each cell. Com-
`pared to design R1, this design has the advantage that all
`cells work all the time when performing a single convolu—
`tion, but it requires an additional register in each cell to
`hold a w value. This algorithm has been used for im—
`plementing a pipeline multiplier.22
`There is a dual version of design R2; we can have the
`w,‘s move twice as fast as the X,’s. To create delays for the
`X data stream, this dual design requires a register in each
`cell for storing an x rather than a w value. For cir-
`cumstances where the w,’s carry more bits than the XI’S,
`the dual design becomes attractive.
`Design Wlivveig/tts slut, inputs and results more in
`opposite (liter/ions.
`ln design W1
`(and design W2,
`below), weights stay, one at each cell, bttt results and in-
`puts move systolically. These designs are not geared to the
`most effective use of available multiplieivaccumulator
`hardware, but
`for some other circumstances they are
`potentially more efficient than the other designs. Because
`the same set of weigltts is used for computing all they/s
`and different sets of the x,'s are used for computing diil
`ferent ifs, it is natural to have the w,'s preloaded to the
`cells and stay there, and let thexfs and the_v,’s move along
`the array. We will see some adv antages of this arrange—
`ment in the systolic array depicted in Figure 8, which is a
`special case of a proposed systolic filtering array.‘ This
`design is fundamental iii the sense that it can be naturally
`extended to perform recursive filteringl'7 and polynomial
`division-71
`In design W1, the vv,’s stay and the ,\',’s and yfs move
`systolically in opposite directions. Similar to design Rl,
`consecutive .v',’s and _v,’s are separated by two cycle times.
`Note that because the systolic path for moving the .v,’s
`already exists, there is no need for another systolic output
`path as in designs R1 and R2. Furthermore, for each i, y,
`outputs from the 1eft~most cell during the same cycle as its
`last input, Laid, (or X,+3 for it : 3), enters that cell.
`Thus, this systolic array is capable of outputting ay,every
`two cycle times with constant response time. Design W1,
`however, suffers from the satne drawback as design R1,
`
`42
`
`Remarks. The designs presented above by no means ex,
`haust all the possible systolic designs for the convolution
`problem. For example,
`it
`is possible to have systolic
`designs where results, weights, and inputs all move during
`each cycle. It could also be advantageous to include inside
`each cell a “cell memory" capable of storing a set of
`weights. With this feature, using a systolic control (or ad,
`dress) path, weights cart be selected on-the-fly to imple7
`ment interpolation or adaptive filtering.24 Moreover, the
`flexibility introduced by the cell memories and systolic
`control can make the same systolic array implement dif-
`ferent functions.
`indeed,
`the E81. systolic processorMU
`utili/es cell memories to implement multiple functions iii-
`cluding convolution and matrix multiplication.
`Once one systolic design is obtained for a problem, it is
`likely that a set of other systolic designs can be derived
`similarly. The challenge is to understand precisely the
`strengths and drawbacks of each design so that an ap
`propriate design can be selected for a given environment.
`For example,
`if there are more weights than cells,
`it’s
`useful to know that a scheme where partial results stay
`generally requires less 1/O than one where partial results
`move, since the latter scheme requires partial results to be
`input and output many times. A single multiplier»accu-
`mulator hardware component often represents a cost-
`cffeetive implementation of the multiplier and adder
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 42
`
`Petitioner Microsoft Corporation - EX. 1016, p. 42
`
`
`
`RESULTS
`
`[IGNORED]
`
`MULTIPLIER
`
`Q):
`
`Figure 10. Overlapping the executions of multiply and add in design W1.
`
`needed by each cell of a systolic convolution array.
`However, for improving throughput, sometimes it may
`be worthwhile to implement multiplier and adder sepa—
`rately to allow overlapping of their executions. Figure 10
`depicts such a modification to design W1. Similar
`modifications can be made to other systolic convolution
`arrays. Another interesting scenario is the following one.
`Suppose that one or several cells are to be implemented
`directly with a single chip and the chip pin bandwidth is
`the implementation bottleneck. Then, since the basic cell
`of some semi—systolic convolution arrays such as designs
`BI and F require only three I/O ports, while that of a
`pure—systolic convolution array always requires four, a
`semi—systolic array may be preferable for saving pins,
`despite the fact that it requires global communication.
`
`Criteria and advantages
`
`Having described a family of systolic convolution ar-
`rays, we can now be more precise in suggesting and evalu-
`ating criteria for the design of systolic structures.
`
`(1) The design makes multiple use of each input data
`item. Because of this property, systolic systems can
`achieve high throughputs with modest I/O bandwidths
`for outside communication. To meet this criterion, one
`can either use global data communications, such as
`broadcast and unbounded fan—in, or have each input
`travel through an array of cells so that it is used at each
`cell. For modular expansibility of the resulting system,
`the second approach is preferable.
`(2) The design uses extensive concurrency. The process—
`ing power of a systolic architecture comes from concur—
`rent use of many simple cells rather than sequential use of
`a few powerful processors as in many conventional ar-
`chitectures. Concurrency can be obtained by pipelining
`the stages involved in the computation of each single
`result (for example, design Bl), by multiprocessing many
`results in parallel (designs R1 and R2), or by both. For
`some designs, such as W1,
`it
`is possible to completely
`overlap I/O and computation times to further increase
`concurrency and provide constant-time responses.
`
`January 1982
`
`To a given problem there could be both one- and two—
`dimensional systolic array solutions. For example, two—
`dimensional convolution can be performed by a one-
`dimensional systolic arrayz‘b25 or a two—dimensional
`systolic array.6 When the memory speed is more than cell
`speed,
`two-dimensional systolic arrays such as those
`depicted in Figure 11 should be used. At each cell cycle, all
`the 1/0 ports on the array boundaries can input or output
`data items to or from the memory; as a result,
`the
`available memory bandwidth can be fully utilized. Thus,
`the choice of a one- or two—dimensional scheme is very
`depenient on how cells and memories will be imple-
`mente .
`As in one-dimensional systolic arrays, data in two—
`dimensional arrays may flow in multiple