throbber

`
`Systolic architectures, which permit multiple computations
`for each memory access, can speed execution of
`compute-bound problems without increasing I/ 0 requirements.
`
`O1
`
`
`
`
`‘:’v‘o"“o«3:034}
`
`
`
`
`345493.!
`:33; 921933331
`
`
`
`
`
`
`
`
`
`
`
`
`
`w to 93,2 mania“...
`53.32:.133» s'vz..>33'».3>3.>z€v:
`
`
`
`
`
`
`
`
`
`
`
`
`
`Why Systolic Architectures?
`
`H. T. Kung
`Carnegie-Mellon University
`
`sys—
`.. High-performance, special-purpose computer
`tems are typically used to meet specific application re—
`quirements or to off—load computations that are especial—
`ly taxing to general-purpose computers. As hardware cost
`and size continue to drop and processing requirements
`become well-understood in areas such as signal and image
`processing, more special—purpose systems are being con-
`structed. However, since most of these systems are built
`on an ad hoc basis for specific tasks, methodological
`work in this area is rare. Because the knowledge gained
`from individual experiences is neither accumulated nor
`properly organized, the same errors are repeated. 1/0 and
`computation imbalance is a notable example—often, the
`fact that 1/0 interfaces cannot keep up with device speed
`is discovered only after constructing a high-speed,
`special-purpose device.
`We intend to help correct this ad hoc approach by pro—
`viding a general guideline—specifically, the concept of
`systolic architecture, a general methodology for mapping
`high-level computations into hardware structures. In a
`systolic system, data flows from the computer memory in
`a rhythmic fashion, passing through many processing
`elements before it returns to memory, much as blood cir—
`culates to and from the heart. The system works like an
`automobile assembly line where different people work on
`the same car at different times and many cars are assem-
`bled simultaneously. An assembly line is always linear,
`however, and systolic systems are sometimes two-dimen-
`sional. They can be rectangular, triangular, or hexagonal
`to make use of higher degrees of parallelism. Moreover,
`to implement a variety of computations, data flow in a
`systolic system may be at multiple speeds in multiple di-
`rections—both inputs and (partial) results flow, whereas
`only results flow in classical pipelined systems. Generally
`speaking, a systolic system is easy to implement because
`of its regularity and easy to reconfigure (to meet various
`outside constraints) because of its modularity.
`The systolic architectural concept was developed at
`Carnegie-Mellon University,"7 and versions of systolic
`processors are being designed and built by several indus—
`trial and governmental organizations.3'10 This article
`
`reviews the basic principle ofsystolic architectures and ex-
`plains why they should result
`in cost—effective, high-
`performance specialApurpose systems for a wide range of
`problems.
`
`Key architectural issues in designing
`special-purpose systems
`
`Roughly, the cycle for developing a special-purpose
`system can be divided into three phases—task definition,
`design, and implementation. During task definition,
`some system performance bottleneck is identified, and a
`decision on whether or not, to resolve it with special—
`purpose hardware is made. The evaluation required for
`task definition is most fundamental, but since it is often
`application—dependent, we will concentrate only on archi-
`tectural issues related to the design phase and will assume
`routine implementation.
`
`Simple and regular design. Cost—effectiveness has
`always been a chief concern in designing special-purpose
`systems; their cost must be low enough to justify their
`limited applicability. Costs can be classified as nonrecur—
`ring (design) and recurring (parts) costs. Part costs are
`dropping rapidly due to advances in integrated—circuit
`technology, but this advantage applies equally to both
`special-purpose and general—piirpose systems. Further-
`more, since special-purpose systems are seldom produced
`in large quantities, part costs are less important than
`design costs. Hence, the design cost of a special—purpose
`system must be relatively small for it to be more attractive
`than a general—purpose approach.
`Fortunately, special-purpose design costs can be reduced
`by the use of appropriate architectures. If a structure can
`truly be decomposed into a few types of simple substruc-
`tures or building blocks, which are used repetitively with
`simple interfaces, great savings can be achieved. This is
`especially true for VLSI designs where a single chip com—
`prises hundreds of thousands of components. To cope
`with that complexity, simple and regular designs, similar
`
`January 1982
`
`(YJlS-9l62/82/01000037SOOJS 31 1982 lEEE
`
`37
`
`Petitioner Microsoft Corporation - EX. 1016, p. 37
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 37
`
`

`

`to some of the techniques used in constructing large soft-
`ware systems, are essential. ” In addition, special—purpose
`systems based on simple, regular designs are likely to be
`modular and therefore adjustable to various performance
`goals—that is, system cost can be made proportional to
`the performance required. This suggests that meeting the
`architectural challenge for simple, regular designs yields
`cost-effective special-purpose systems.
`
`performance goal of a special-purpose system is~and
`should be no more than—a computation rate that bal-
`ances the available I/O bandwidth with the host. Since an
`accurate a priori estimate ofavailable I/O bandwidth in a
`complex system is usually impossible, the design of a
`special-purpose system should be modular so that its
`structure can be easily adjusted to match a variety of I/O
`bandwidths.
`
`Concurrency and communication. There are essential-
`ly two ways to build a fast computer system. One is to use
`fast components, and the other is to use concurrency. The
`last decade has seen an order of magnitude decrease in the
`cost and size of computer components but only an incre-
`mental
`increase in component speed.12 With current
`technology, tens of thousands of gates can be put in a
`single chip, but no gate is much faster than its TTL
`counterpart of 10 years ago. Since the technological trend
`clearly indicates a diminishing growth rate for component
`speed, any major improvement in computation speed
`must come from the concurrent use of many processing
`elements. The degree of concurrency in a special-purpose
`system is largely determined by the underlying algorithm.
`Massive parallelism can be achieved if the algorithm is
`designed to introduce high degrees of pipelining and
`multiprocessing. When a large number of processing
`elements work simultaneously, coordination and com-
`munication become significant—especially with VLSI
`technology where routing costs dominate the power,
`time, and area required to implement a computation.”
`The issue here is to design algorithms that support high
`degrees of concurrency, and in the meantime to employ
`only simple, regular communication and control to enable
`efficient implementation.
`
`Balancing computation with 1/0. Since a special-
`purpose system typically receives data and outputs results
`through an attached host, I/O considerations influence
`overall performance. (The host in this context can mean a
`computer, a memory, a real—time device, etc. In practice,
`the special-purpose system may actually input from one
`“physical” host and output to another.) The ultimate
`
`Suppose that the I/O bandwidth between the host and a
`special-purpose system is 10 million bytes per second, a
`rather high bandwidth for present technology. Assuming
`that at least two bytes are read from or written to the host
`for each operation, the maximum rate will be only 5
`million operations per second, no matter how fast the
`special-purpose system can operate (see Figure 1). Orders
`ofmagnitude improvements on this throughput are possi—
`ble only if multiple computations are performed per I/O
`access. However, the repetitive use of a data item requires
`it to be stored inside the system for a sufficient length of
`time. Thus, the [/0 problem is related not only to the
`available I/O bandwidth, but also to the available
`memory internal to the system. The question then is how
`to arrange a computation together with an appropriate
`memory structure so that computation time is balanced
`with I/O time.
`
`The [/0 problem becomes especially severe when a large
`computation is performed on a small special-purpose sys-
`tem. In this case, the computation must be decomposed.
`Executing subcomputations one at a time may require a
`substantial amount of I/O to store or retrieve intermediate
`results. Consider, for example, performing the n—point fast
`Fourier transform using an S-point device when n is large
`and S is small. Figure 2 depicts the n—point FFT computa-
`tion and a decomposition scheme for n = 16 and S = 4. Note
`that each subcomputation block is sufficiently small so that
`it can be handled by the 4-point device. During execution,
`results of a block must be temporarily sent to the host and
`later retrieved to be combined with results of other blocks
`as they become available. With the decomposition scheme
`shown in Figure 2b, the total number of I/O operations is
`O(n log n/log S). In fact, it has been shown that, to per—
`form the n—point FFT with a device of 0(5) memory, at
`least this many [/0 operations are needed for any decom—
`position scheme. 14 Thus, for the n-point FFT problem, an
`S—point device cannot achieve more than an 0(log S)
`speed—up ratio over the conventional O(n log n) software
`implementation time, and since it is a consequence of the
`I/O consideration, this upper bound holds independently
`of device speed. Similar upper bounds have been estab—
`lished for speed-up ratios achievable by devices for other
`computations such as sorting and matrix multiplication. ”'15
`Knowing the I/O-imposed performance limit helps pre-
`vent overkill in the design of a special—purpose device.
`
`In practice, problems are typically “larger” than
`special-purpose devices. Therefore, questions such as
`how a computation can be decomposed to minimize l/O,
`how the I/O requirement is related to the size of a special—
`purpose system and its memory, and how the I/O band-
`width limits the speed—up ratio achievable by a special-
`purpose system present another set of challenges to the
`system architect.
`
`COM PUTER
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 38
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 38
`
`5 MILLION
`OPERATIONS
`PER SECOND
`AT MOST
`
`30 MOPS
`POSSIBLE
`
`THE SYSTOLIC ARRAY
`
`INSTEAD (IF:
`
`MEMORY
`
`WE HAVE:
`
`100 ns
`
`MEMORY
`
`naanna
`
`Figure 1. Basic principle of a systolic system.
`
`38
`
`

`

`Systolic architectures: the basic principle
`
`As a solution to the above challenges, we introduce
`systolic architectures, an architectural concept originally
`proposed for VLSI implementation of some matrix oper—
`ations.5 Examples of systolic architectures follow in the
`next section, which contains a walk—through of a family
`of designs for the convolution computation.
`A systolic system consists of a set of interconnected
`cells, each capable of performing some simple operation.
`Because simple,
`regular communication and control
`structures have substantial advantages over complicated
`ones in design and implementation, cells in a systolic
`system are typically interconnected to form a systolic ar—
`ray or a systolic tree. Information in a systolic system
`flows between cells in a pipelined fashion, and communi—
`cation with the outside world occurs only at the “bound—
`ary cells.” For example,
`in a systolic array, only those
`cells on the array boundaries may be I/O ports for the
`system.
`tasks can be conceptually classified
`Computational
`into two families—compute-bound computations and
`I/O—bound computations. In a computation, if the total
`number of operations is larger than the total number of
`input and output elements,
`then the computation is
`compute-bound, otherwise it is I/O—bound. For example,
`the ordinary matrix—matrix multiplication algorithm
`represents a compute—bound task, since every entry in a
`matrix is multiplied by all entries in some row or column
`of the other matrix. Adding two matrices, on the other
`hand, is I/O—bound, since the total number of adds is not
`largerthanthetotalnumber ofentriesinthetwo matrices.
`It should be clear that any attempt to speed up an I/O-
`bound computation must rely on an increase in memory
`bandwidth. Memory bandwidth can be increased by the
`use of either fast components (which could be expensive)
`or interleaved memories (which could create complicated
`memory management problems). Speeding up a com—
`pute-bound computation, however, may often be accom-
`
`plished in a relatively simple and inexpensive manner,
`that is, by the systolic approach.
`The basic principle of a systolic architecture, a systolic
`array in particular, is illustrated in Figure 1. By replacing a
`single processing element with an array of PBS, or cells in
`the terminology of this article, a higher computation
`throughput can be achieved without increasing memory
`bandwidth. The function of the memory in the diagram is
`analogoustothat ofthe heart; it “pulses” data(instead of
`blood) through the array of cells. The crux of this ap-
`proach is to ensure that once a‘data item is brought out
`from the memory it can be used effectively at each cell it
`passes while being “pumped” from cell to cell along the
`array. This is possible for a wide class of compute-bound
`computations where multiple operations are performed
`on each data item in a repetitive manner.
`Being able to use each input data item a number of
`times (and thus achieving high computation throughput
`with only modest memory bandwidth) is just one of the
`many advantages of the systolic approach. Other advan—
`tages, such as modular expansibility, simple and regular
`data and control flows, use of simple and uniform cells,
`elimination of global broadcasting, and fan-in and (pos—
`sibly) fast response time, will be illustrated in various sys-
`tolic designs in the next section.
`
`A family of systolic designs
`for the convolution computation
`
`To provide concrete examples of various systolic struc—
`tures, this section presents a family of systolic designs for
`the convolution problem, which is defined as follows:
`
`.
`.
`.
`Given the sequence ofweights fwl, wz,
`and the input sequence {xl,x2,
`.
`.
`. ,xnl,
`
`, Wki
`
`compute the result sequence Ly] , yz,
`defined by
`
`.
`
`.
`
`. ,yn+1,kj
`
`inWlXi+ W2Xi+1 + -
`
`-
`
`- + kai+k71
`
`4-PT.
`
`
`
`‘YM'
`{mgoggmull
`_ magmwlll
`.‘VIIIOZM‘VXOI’II
`IOX'IA“\VXOIOIOIII
`I'AV-“IOIOX z x z
`XOXOXOXOIOXOXOI
`XWA'IIIOXOXOXOXOX‘
`I xvi/III XOXOXQ
`zmanxm
`'
`XOXOX X'lllA“\
`AMXOXWfl“
`
`DEVICE
`
`Figure 2. (a) 16-point tast-Fourier-transtorm graph; (b) decomposing the FFT computation with n = 16 and S = 4.
`
`January 1982
`
`39
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 39
`
`Petitioner Microsoft Corporation - EX. 1016, p. 39
`
`

`

`
`
`Figure 3. Design 81: systolic convolution array(a)and cell
`(b) where x,-‘s are broadcast, w,-’s stay, and y,-’s move
`systolically.
`
`Figure 4. Design 82: systolic convolution array (a) and cell
`(b) where x,-’s are broadcast, y,-’s stay, and w,-’s move
`systolically.
`
`We consider the convolution problem because it is a sim—
`ple problem with a variety of enlightening systolic solui
`tions, because it is an important problem in its own right,
`and more importantly, because it
`is representative of a
`wide class ofcomputations suited to systolic designs. The
`convolution problem can be viewed as a problem of com-
`bining two data streams, w, ’s amd xj’s, in a certain man-
`ner (for example, as in the above equation) to form a
`resultant data stream ofy, ’s. This type of computation is
`common to a number of computation routines, such as
`filtering, pattern matching, correlation,
`interpolation,
`polynomial evaluation (including discrete Fourier trans-
`forms), and polynomial multiplication and division. For
`example, if multiplication and addition are interpreted as
`comparison and boolean AND, respectively,
`then the
`convolution problem becomes the pattern matching
`problem.1 Architectural concepts for the convolution
`problem can thus be applied to these other problems as
`well.
`
`The convolution problem is compute—bound, since
`each input x, is to be multiplied by each ofthe k weights. If
`the X, is input separately from memory for each multi-
`plication,
`then when k is
`large, memory bandwidth
`becomes a bottleneck, precluding a high-performance
`solution. As indicated earlier, a systolic architecture
`resolves this 1/0 bottleneck by making multiple use of
`each X, fetched from the memory. Based on this principle,
`several systolic designs for solving the convolution prob—
`lem are described below. For simplicity, all illustrations
`assume that k : 3.
`
`(Semi-) systolic convolution arrays with global data
`communication. If an x,, once brought out from the
`memory, is broadcast to a number of cells, then the same
`x,- can be used by all the cells. This broadcasting technique
`is probably one of the most obvious ways to make mul—
`tiple use of each input element. The opposite of broad—
`casting is fan—in, through which data items from a number
`of cells can be collected. The fan—in technique can also be
`used in a straightforward manner to resolve the I/O bot—
`tleneck problem. In the following, we describe systolic
`designs that utilize broadcasting and fan—in.
`Design Bl—broadcast inputs, move results, weights
`stay. The systolic array and its cell definition are depicted
`
`40
`
`in Figure 3. Weights are preloaded to the cells, one at each
`cell, and stay at the cells throughout the computation.
`Partial results y,- move systolically from cell to cell in the
`left—to—right direction, that is, each of them moves over
`the cell to its right during each cycle. At the beginning
`ofa cycle, one X,- is broadcast to all the cells and one y,, in—
`itialized as zero, enters the left-most cell. During cycle
`one, wl X) is accumulated to y, at the left-most cell, and
`during cycle two, wl X2 and wz x2 are accumulated to y2
`and y] at the left—most and middle cells, respectively.
`Starting from cycle three, the final (and correct) values of
`y1,y2,
`.
`.
`. are output from the right-most cell at the rate
`ofone y,- per cycle. The basic principle of this design was
`previously proposed for circuits to implement a pattern
`matching processor16 and for circuits to implement
`polynomial multiplication. ”'20
`Design BZ—broadcast inputs, move weights, results
`stay. In design 82 (see Figure 4), each y,- stays at a cell to
`accumulate its terms, allowing efficient use of available
`multiplier—accumulator hardware. (Indeed, this design is
`described in an application booklet for the TRW multi-
`plier—accumulator chips.21 The weights circulate around
`the array ofcells, and the first weight wl is associated with
`a tag bit that signals the accumulator to output and resets
`its contents. * In design Bl (Figure 3), the systolic path for
`moving y,~’s may be considerably wider than that for mov—
`ing w,»’s in design B2 because for numerical accuracy y,-’s
`typically carry more bits than w,-’s. The use ofmultiplier—
`accumulators in design B2 may also help increase preci—
`sion ofthe results, since extra bits can be kept in these ac—
`cumulators with modest cost. Design Bl, however, does
`have the advantage of not requiring a separate bus (or
`other global network), denoted by a dashed line in Figure
`4, for collecting outputs from individual cells.
`Design F—fan-irz results, move inputs, weights stay. If
`we consider the vector ofweights (wk, wk, 1,
`.
`.
`.
`, w] )as
`.
`being fixed in space and input vector (x,,, x”, 1,
`.
`. ,xl)
`as sliding over the weights in the left—to—right direction,
`then the convolution problem is one that computes the in»
`ner product of the weight vector and the section of input
`vector it overlaps. This view suggests the systolic array
`
`’To avoid complicated pictures, control structures such as the use of tag
`bits to gate outputs from cells are omitted from the diagrams ofthis article.
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 40
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 40
`
`

`

`
`——>
`
`shown in Figure 5. Weights are preloaded to the cells and
`stay there throughout the computation. During a cycle,
`all X,‘s move one cell to the right, multiplications are per
`formed at all cells simultaneously, and their results are
`fanned-in and summed using an adder to form a new y,.
`When the number of cells, k, is large, the adder can be im—
`plemented as a pipelined adder tree to avoid large delays
`in each cycle. Designs of this type using unbounded fan»in
`have been known for quite a long time, for example, in the
`context of signal processing33 and in the context of pat,
`tern matching.“
`
`Figure 5. Design F: systolic convolution array (a) and cell
`(b) where wi’s stay, x,-’s move systolically, and y,-’s are
`formed through the tan-in of results from all the cells.
`
`Figure 6. Design R1: systolic convolution array (a) and cell
`(b) where y,- ‘s stay and x,- ’s and yi’s move in opposite direc-
`tions systolically.
`
`W3
`W5
`——>W4-—>W2
`X3
`
`(Pure-) systolic convolution arrays without global data
`communication. Although global broadcasting or fan—in
`solves the [/0 bottleneck problem, implementing it in a
`modular, expandable way presents another problem.
`Providing (or collecting) a data item to (or from) all the
`cells of a systolic array, during each cycle, requires the use
`of a bus or some sort of tree—like network. As the number
`of cells increases, wires become long for either a bus or
`tree structure; expanding these non—local communication
`paths to meet the increasing load is difficult without slow—
`ing down the system clock. This engineering difficulty of
`extending global networks is significant at chip, board,
`and higher levels of a computer system. Fortunately, as
`will be demonstrated below, systolic convolution arrays
`without global data communication do exist. Potentially,
`these arrays can be extended to include an arbitrarily large
`number of cells without encountering engineering diffii
`culties (the problem of synchronizing a large systolic ar-
`ray is discussed later).
`Design RI—results stay, inputs and weights move in
`opposite directions. In design R1 (see Figure 6) each par—
`tial result y,- stays at a cell to accumulate its terms. The x,- ’s
`and w, ’5 move systolically in opposite directions such that
`when an x meets a w at a cell, they are multiplied and the
`resulting product is accumulated to the y staying at that
`cell. To ensure that each x, is able to meet every wi, con-
`secutive x,-’s on the xdata stream are separated by two cy—
`cle times and so are the w,-’s on the w data stream.
`Like design B2, design R1 can make efficient use of
`available multiplier-accumulator hardware; it can also
`use a tag bit associated with the first weight, w, to trigger
`the output and reset the accumulator contents of a cell.
`
`January 1982
`
`Figure 7. Design R2: systolic convolution array (a) and cell
`(b)where yi’s stay and x,-’s and w,-’s both move in the same
`direction but at different speeds.
`
`Design R1 has the advantage that it does not requirea bus,
`or any other global network, for collecting output from
`cells; a systolic output path (indicated by broken arrows
`in Figure 6) is sufficient. Because consecutive w,’s are well
`separated by two cycle times, a potential conflict—that
`more than one y, may reach a single latch on the systolic
`output path simultaneously—cannot occur. It can also be
`easily checked that the y,’s will output from the systolic
`output path in the natural orderingyl,yz,. .
`. .The basic
`idea of this design, including that of the systolic output
`path, has been used to implement a pattern matching
`chip.1
`Notice that in Figure 6 only about one—half the cells are
`doing useful work at any time. To fully utilize the poten—
`tial throughput, two independent convolution computa—
`tions can be interleaved in the same systolic array, but
`cells in the array would have to be modified slightly to
`support the interleaved computation. For example, an
`additional accumulator would be required at each cell to
`hold a temporary result for the other convolution com-
`putation.
`Design R2—results stay, inputs and weights move in
`the same direction but at different speeds. One version of
`design R2 is illustrated in Figure 7. In this case both thex
`and w data streams move from left to right systolically,
`but the xi’s move twice as fast as the w,-’s. More precisely,
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 41
`
`41
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 41
`
`

`

`
`
`.Votit “ Yin
`X
`e Xm
`Xout “ X
`
`7* W ' Xin
`
`Figure 8. Design W1: systolic convolution array (a) and
`cell (b) where w,- ’5 stay and x,-’s and y,-‘s move systolically
`in opposite directions.
`
`Figure 9. Design W2: systolic convolution array (a) and
`cell
`(b) where wi's stay and x,-’s and y,-’s both move
`systolically in the same direction but at different speeds.
`
`namely, only approximately one—halfthe cells work at any
`given time unless two independent convolution computa—
`tions are interleaved in the same array. The next design,
`like design R2, overcomes this shortcoming by having
`both the Xi’s and yl’s move in the same direction but at dif
`ferent speeds.
`Design W2—weights stay, inputs and results move in
`the same direction but at different speeds. With design
`W2 (Figure 9) all the cells work all the time, but it loses one
`advantage of design W 1, the constant response time. The
`output ofyinow takes place k cycles after the last of its in-
`puts starts entering the left—most cell ot'the systolic array,.
`This design has been extended to implement 2—D convolui
`tionsf’v24 where high throughputs
`rather
`than fast
`responses are ofconcern. Similar to design R1, design W2
`has a dual version for which the .v,’s move twice as fast as
`the y,-’s.
`
`each w,» stays inside every cell it passes for one extra cycle,
`thus taking twice as long to move through the array as any
`X1. In this design, multiplier-accumulator hardware can be
`used effectively and so can the tag bit method to signal the
`output of the accumulator contents at each cell. Com-
`pared to design R1, this design has the advantage that all
`cells work all the time when performing a single convolu—
`tion, but it requires an additional register in each cell to
`hold a w value. This algorithm has been used for im—
`plementing a pipeline multiplier.22
`There is a dual version of design R2; we can have the
`w,‘s move twice as fast as the X,’s. To create delays for the
`X data stream, this dual design requires a register in each
`cell for storing an x rather than a w value. For cir-
`cumstances where the w,’s carry more bits than the XI’S,
`the dual design becomes attractive.
`Design Wlivveig/tts slut, inputs and results more in
`opposite (liter/ions.
`ln design W1
`(and design W2,
`below), weights stay, one at each cell, bttt results and in-
`puts move systolically. These designs are not geared to the
`most effective use of available multiplieivaccumulator
`hardware, but
`for some other circumstances they are
`potentially more efficient than the other designs. Because
`the same set of weigltts is used for computing all they/s
`and different sets of the x,'s are used for computing diil
`ferent ifs, it is natural to have the w,'s preloaded to the
`cells and stay there, and let thexfs and the_v,’s move along
`the array. We will see some adv antages of this arrange—
`ment in the systolic array depicted in Figure 8, which is a
`special case of a proposed systolic filtering array.‘ This
`design is fundamental iii the sense that it can be naturally
`extended to perform recursive filteringl'7 and polynomial
`division-71
`In design W1, the vv,’s stay and the ,\',’s and yfs move
`systolically in opposite directions. Similar to design Rl,
`consecutive .v',’s and _v,’s are separated by two cycle times.
`Note that because the systolic path for moving the .v,’s
`already exists, there is no need for another systolic output
`path as in designs R1 and R2. Furthermore, for each i, y,
`outputs from the 1eft~most cell during the same cycle as its
`last input, Laid, (or X,+3 for it : 3), enters that cell.
`Thus, this systolic array is capable of outputting ay,every
`two cycle times with constant response time. Design W1,
`however, suffers from the satne drawback as design R1,
`
`42
`
`Remarks. The designs presented above by no means ex,
`haust all the possible systolic designs for the convolution
`problem. For example,
`it
`is possible to have systolic
`designs where results, weights, and inputs all move during
`each cycle. It could also be advantageous to include inside
`each cell a “cell memory" capable of storing a set of
`weights. With this feature, using a systolic control (or ad,
`dress) path, weights cart be selected on-the-fly to imple7
`ment interpolation or adaptive filtering.24 Moreover, the
`flexibility introduced by the cell memories and systolic
`control can make the same systolic array implement dif-
`ferent functions.
`indeed,
`the E81. systolic processorMU
`utili/es cell memories to implement multiple functions iii-
`cluding convolution and matrix multiplication.
`Once one systolic design is obtained for a problem, it is
`likely that a set of other systolic designs can be derived
`similarly. The challenge is to understand precisely the
`strengths and drawbacks of each design so that an ap
`propriate design can be selected for a given environment.
`For example,
`if there are more weights than cells,
`it’s
`useful to know that a scheme where partial results stay
`generally requires less 1/O than one where partial results
`move, since the latter scheme requires partial results to be
`input and output many times. A single multiplier»accu-
`mulator hardware component often represents a cost-
`cffeetive implementation of the multiplier and adder
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 42
`
`Petitioner Microsoft Corporation - Ex. 1016, p. 42
`
`

`

`RESULTS
`
`[IGNORED]
`
`MULTIPLIER
`
`Q):
`
`Figure 10. Overlapping the executions of multiply and add in design W1.
`
`needed by each cell of a systolic convolution array.
`However, for improving throughput, sometimes it may
`be worthwhile to implement multiplier and adder sepa—
`rately to allow overlapping of their executions. Figure 10
`depicts such a modification to design W1. Similar
`modifications can be made to other systolic convolution
`arrays. Another interesting scenario is the following one.
`Suppose that one or several cells are to be implemented
`directly with a single chip and the chip pin bandwidth is
`the implementation bottleneck. Then, since the basic cell
`of some semi—systolic convolution arrays such as designs
`BI and F require only three I/O ports, while that of a
`pure—systolic convolution array always requires four, a
`semi—systolic array may be preferable for saving pins,
`despite the fact that it requires global communication.
`
`Criteria and advantages
`
`Having described a family of systolic convolution ar-
`rays, we can now be more precise in suggesting and evalu-
`ating criteria for the design of systolic structures.
`
`(1) The design makes multiple use of each input data
`item. Because of this property, systolic systems can
`achieve high throughputs with modest I/O bandwidths
`for outside communication. To meet this criterion, one
`can either use global data communications, such as
`broadcast and unbounded fan—in, or have each input
`travel through an array of cells so that it is used at each
`cell. For modular expansibility of the resulting system,
`the second approach is preferable.
`(2) The design uses extensive concurrency. The process—
`ing power of a systolic architecture comes from concur—
`rent use of many simple cells rather than sequential use of
`a few powerful processors as in many conventional ar-
`chitectures. Concurrency can be obtained by pipelining
`the stages involved in the computation of each single
`result (for example, design Bl), by multiprocessing many
`results in parallel (designs R1 and R2), or by both. For
`some designs, such as W1,
`it
`is possible to completely
`overlap I/O and computation times to further increase
`concurrency and provide constant-time responses.
`
`January 1982
`
`To a given problem there could be both one- and two—
`dimensional systolic array solutions. For example, two—
`dimensional convolution can be performed by a one-
`dimensional systolic arrayz‘b25 or a two—dimensional
`systolic array.6 When the memory speed is more than cell
`speed,
`two-dimensional systolic arrays such as those
`depicted in Figure 11 should be used. At each cell cycle, all
`the 1/0 ports on the array boundaries can input or output
`data items to or from the memory; as a result,
`the
`available memory bandwidth can be fully utilized. Thus,
`the choice of a one- or two—dimensional scheme is very
`depenient on how cells and memories will be imple-
`mente .
`As in one-dimensional systolic arrays, data in two—
`dimensional arrays may flow in multi

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket