throbber
1054
`
`[EEE TRANSACTIONS ON COMPUTERS. VOL. c-3l. N0. 11. NovEMttER 1982
`
`Wavefront Array Processor: Language,
`Architecture, and Applications
`
`lEEE, RON J. GAL-EZER.
`tEEE, K. S. ARUN, STUDENT MEMBER,
`SUNsYUAN KUNG, MEMBER,
`STUDENT MEMBER,
`IEEE, AND D. V. BHASKAR RAO, STUDENT MEMBER. IEEE
`
`Abstract—This paper describes the development of a wavefront—
`based language and architecture for a programmable special-purpose
`multiprocessor array. Based on the notion of computational wavefront,
`the hardware of the processor array is designed to provide a computing
`medium that preserves the key properties of the wavefront. In con-
`junction, a wavefront language {MDFL} is introduced that drastically
`reduces the complexity of the description of parallel algorithm and
`simulates the wavefront propagation across the computing network.
`Together. the hardware and the language lead to a programmable
`wavefront array processor (WAP). The WAP blends the advantages
`of the dedicated systolic array and the general-purpose data-flow
`machine, and provides a powerful tool for the high-speed execution
`of a large class of matrix operations and related algorithms which have
`widespread applications.
`
`index Term—Asynchrony, computational wavefront, concurrency,
`data-flow computing, matrix data-flow language, signal processing,
`systolic array. VLSI array processor. wavefront architecture.
`
`I.
`
`lNTRODUCTlON
`
`A. VLSI Signal Processing
`
`[TI-i the rapidly growing microelectronics technology
`leading the way. modern signal processing is under-
`going a major revolution. The past two decades have witnessed
`a steep increase in the complexity of computations. processing
`speed requirements, and the volume of data handled in various
`signal processing applications. The availability of low cost. high
`density, fast VLSI devices has opened a new avenue for
`implementing these increasingly sophisticated algorithms and
`systems [1}. [2]. While a major improvement in device speed
`is foreseen, it is in no way comparable to the rate ofittcreasc
`of throughput required by modern real-time signal processing.
`In order to achieve such increases in throughput rate, the only
`effective solution appears to be highly coucurrent processing.
`Consequently, it has become a major challenge to update the
`current signal processing techniques so as to maximally exploit
`their potential for concurrent execution.
`In a broad sense, the answer to this challenge lies in a
`cross~disciplinary research encompassing the areas of algo-
`rithm analysis, parallel computer design, and system appli-
`
`Manuscript received January I l. l982: revised June 18. 1982. This work
`was supported in part by the Office of Naval Research under Contract
`NOOGI4-80-C-045'l. N00014-8l-K-019l . by the National Science Foundation
`under Grant PICS-804653 l. and by the Defense Advanced Research Projects
`Agency under Contract MDA903-79-C-0630.
`The authors are with the Department of Electrical Engineering—Systems.
`University of Southern California. Los Angeles. CA 90089.
`
`cations. In this paper, we therefore introduce an integrated
`research approach aimed at incorporating the vast VLSI
`computational capability into modern signal processing ap-
`plications.
`The traditional design of parallel computers and languages
`is deemed unsuitable for our purposes. It usaally suffers from
`heavy supervisory overhead incurred by synchronization,
`communication, and scheduling tasks, which severely hamper
`the throughput rate which is critical to real-time signal pro-
`cessing. In fact. these are the key barriers inherent in very large
`scale computing structure design. Moreover. although VLSI
`provides the capability of implementing a large array of pro-
`cessors on one chip, it imposes its own constraints on the sys-
`tem. Large design and layout costs [2] suggest the utilization
`of a repetitive modular structure. In addition, communication,
`which costs the most in VLSI chips in terms of area, time, and
`energy. has to be restricted (to localized communication) [1].
`In general. highly concurrent systems require this locality
`property in order to reduce interdependence and ensuing
`waiting delays that result from excessive communication [1 ].
`This locality constraint prevents the utilization of centralized
`control and global synchronization. The resulting use of
`asynchronous distributed control and localized data flow is
`an effective approach to the design of very large scale, highly
`concurrent computing structures.
`
`3. A Special-Parpose VLSI Array Processor
`
`The above restrictions imposed by VLSI will render the
`general-purpose array processor
`rather
`inefficient. We
`therefore restrict ourselves to a special class of applications,
`i.e., recursive' and local data-dependent algorithms. to con-
`form with the constraints imposed by VLSI. This restriction.
`however. incurs little loss of generality, as a great majority of
`signal processing algorithms possess these properties. One
`typical example is a class of matrix algorithms. It has recently
`been indicated that a major portion of the computational needs
`for signal processing and applied mathematical problems can,
`in fact, be reduced to a basic set of matrix operations and other
`related algorithms [3].
`[4]. Therefore. a special-purpose
`parallel machine for processing these typical computational
`algorithms will be cost effective and attractive in VLSI system
`design.
`
`' In a recursive algorithm. all processors do nearly identical tasks. and each
`processor repeats a fixed set of tasks on sequentially available data.
`
`0018-9340/82/1IOU-105430035 © 19321EEE
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1054
`Petitioner Microsoft Corporation - Ex. 1021, p. 1054
`
`

`

`KUNG or at: WAVEFRONT ARRAY PROCESSOR
`
`1055
`
`MEMORY MODULES
`
`grammable computing network, which we will call the wave-
`front array processor (WAP). The WAP is, in a sense, an
`optimal rrodeoff between the globaliy synchronized and
`dedicatedsysiolic array [1], [15], [16} (that works on a sim-
`ilar set of algorithms). and the general-purpose data-flow
`muftiprocessors [1?1—[23]. It provides a powerful tool for the
`high-speed execution of a large class of algorithms which have
`widespread applications.
`
`C. Organization
`
`The organization of the rest of the paper is as follows. Sec-
`tion II elaborates on the computational wavefront. Section III
`proposes a wavefront-oriented language (MDFL) for pro-
`gramming the WAP, and provides a programming method-
`ology. Section W explains a pessible hardware organization
`and architecture for the WAP. Section V illustrates some
`
`applications of the WAP by means of MDFL programs. Sec-
`tion VI compares the WAP to other array processors, such as
`the systolic array, data-flow multiprocessors, and conventional
`SIMD arrays like the Illiac IV.
`
`II. CONCEPT OF COMPUTATIONAL WAVEFRONT
`
`The wavefront array processor is configured in a square
`array of N X N processing elements with regular and local
`interconnections (cf. Fig. 1).
`The computing network serves as a (data) wave-propagating
`medium. The notion of computational wavefront can be best
`explained by a simpie example. To this end. we shall consider
`matrix multiplication as being representative. Let A = log},
`= {by-l. and C = A X B = {CU-l all beN X N matrices. The
`matrix A can be decomposed into columns A,- and matrix B
`into rows By, and therefore.
`
`c=A]*B]+A2*Bg+ "+AN*BN.
`
`(1)
`
`The matrix multiplication can then be carried out in N re-
`cursions, executing
`
`C(")=CU‘")+A;c *Bk
`
`(2)
`
`, N.
`-
`-
`recursively for k = l, 2, -
`The exploitation of parallelism is now evident with the
`availability of N X N processors. A parallel algorithm for
`matrix multiplication is fairly simple. However, most existing
`parallel programs would need global interconnections in the
`computing network, while localized interconnections and data
`flow are much more desirable in VLSl systems.
`The topology of the matrix multiplication algorithm can be
`mapped naturally onto the square, orthogonal N x N matrix
`array of the WAP (cf. Fig. 1). To create a smooth data
`movement in a localized communication network, we make
`use of the computational wavefront concept. For the purpose
`of this example, a wavefront in the processing array will cor-
`respond to a mathematical recursion in the algorithm. Suc-
`cessive pipelining of the wavefronts will accomplish the com-
`putation of all recursions.
`As an example, the computational wavefront for the first
`recursion in matrix multiplication will now be examined.
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1055
`Petitioner Microsoft Corporation - EX. 1021, p. 1055
`
`
`
`m
`
`I
`I
`,
`.
`, —~ ., H.H
`«'
`-
`a
`l.‘//
`f,’/
`a
`"/
`."//
`5
`o
`I
`x
`’
`.
`gums—.HH ./ ..._.
`E
`'
`a
`'
`a
`
`,K
`
`I
`
`34¢.
`""9
`/
`
`
`
`FIRST WAVE — — —
`SECOND W-M'E- '---~-
`
`i
`
`Fig. l. The WAP configuration.
`
`Very significantly, these algorithms involve repeated ap-
`plication of relatively simple operations with regular localized
`data flow in a homogeneous computing network. This leads
`to an important notion of computational wavefront. which
`portrays the computation activities in a manner resembling
`a wave propagation phenomenon. More precisely, the recursive
`nature ofthe algorithm, in conjunction with the localized data
`dependency, points to a continuously advancing wave of data
`and computational activity. The computational sequence starts
`with one element and propagates through the processor array,
`closely resembling a physical wave phenomenon (cf. Fig. I).
`In fact, all algorithms that have locality and recursioity (and
`are thus implementable on our VLSI array processor) will
`exhibit this wave phenomenon. Therefore, the notion of a
`computational wavefront has attracted the attention of many
`researchers [2], [5]—[l3].
`The wavefront concept provides a firm theoretical founda-
`tion for the design of highly parallel array processors and
`concurrent languages. In addition, this concept appears to have
`some distinct advantages.
`First. the wavefront notion drastically reduces the com-
`plexity in the description of parallel algorithms. The mecha-
`nism provided for this description is a special-purpose, wave-
`front-oriented language [7’], [9], [10]. Rather than requiring
`a program for each processor in the array, this language allows
`the programmer to address on entire front ofprocessors.
`Second, the wavefront notion leads to a wavefront-based
`
`architecture which preserves Huygen’s principle [14]. and
`ensures that wavefronts never intersect. Therefore, a wavefront
`
`architecture can provide asynchronous waiting capability, and
`consequently, can cope with timing uncertainties, such as local
`clocking, random delay in communications, and fluctuations
`of computing times. In short, the notion lends itself to a
`(asynchronous) data-flow computing structure that conforms
`well with the constraints of VLSI.
`The integration of the wavefront concept, the wavefront
`language, and the wavefront architecture leads to a pro-
`
`

`

`l056
`
`[EEE TRANSACTIONS ON COMPUTERS. voL_ (3.3!, N0. 11, NOVEMBER 1982
`
`Suppose that the registers of all the processing elements (PE’s)
`are initially set to zero:
`
`Cl? = 0
`
`for all (i,j);
`
`the entries of A are stored in the memory modules to the left
`(in columns), and those of B in the memory modules on the top
`(in rows). The process starts with PE (1, l), where
`
`Cll} = Clot) + an * bit
`
`(3)
`
`scription of computational wavefronts and the corresponding
`data flow in a large class of algorithms (which exhibit the re-
`cursivity and locality mentioned earlier).
`Since matrix algorithms are typical of this class, we call the
`language the matrix data-flow language (MDFL). This
`wavefront language permits the array processor to be pro-
`grammable, broadens the range of applications, and also makes
`possible the simulation and verification of parallel algo-
`rithms.
`
`is computed. The computational activity then propagates to
`the neighboring PE’s (l, 2) and (2, l), which will execute
`
`A. Matrix Data-Flow Language
`
`and
`
`cl? = Citi) + an * biz
`
`_
`
`Cii} = Chi) + a2] * bl l-
`
`(4)
`
`(5)
`
`The next front of activity will be at PE’s (3, l), (2, 2), and
`(l, 3), thus creating a computation wavefront traveling down
`the processor array. This computational wavefront is similar
`to optical wavefronts (they both obey Huygen’s principle) since
`each processor acts as a secondary source and is responsible
`for the propagation of the wavefront. It may be noted that wave
`propagation implies localized data flow. Once the wavefront
`sweeps through all the cells, the first recursion is over. As the
`first wave propagates, we can execute an identical second re-
`cursion in parallel by pipelining a second wavefront immedi—
`ately alter the first one. For example, the (l, 1) processor will
`execute
`
`Cl” = Clil + an * bzl
`
`(6)
`
`and soon. In general. the (i,j)th processor will execute the lcth
`recursion,
`
`Cll)=an *blj+arz*bzr+'”+alk*bki'
`
`(7)
`
`The pipelining is feasible because the wavefronts of two suc-
`cessive recursions will never intersect (Huygen's wavefront
`principle), as the processors executing the recursions at any
`given instant will be different, thus avoiding any contention
`problems.
`Locality, regularity, recursiuity, and concurrency lead to
`the wavefront phenomenon. Thus, all algorithms which possess
`these properties will exhibit computational wavefronts. The
`notion of computational wavefronts leads to a wavefront—based
`language (a modified data~flow language) to program the
`processor array. This language (called the matrix data-flow
`language) is especially powerful for matrix and other related
`algorithms. It will be developed in the next section.
`
`Ill. A WAVEFRONT LANGUAGE: MATRIX DATA-FLOW
`LANGUAGE
`
`In contrast to the heavy burden of scheduling, resource
`sharing. as well as the control of processor interactions en-
`countered in programming a general-purpose multiprocessor,
`the computational wavefront notion can facilitate the de-
`scription of parallel algorithms and drastically reduce the
`complexity of parallel programming. In this section, we in-
`troduce a wavefront language that is tailored towards the de-
`
`There exist two approaches to programming the WAP: a
`local approach describing the actions of each processing ele-
`ment, and a global approach-describing the actions of each
`wavefront. To allow the user to program the WAP in both of
`these fashions, two versions of MDFL are proposed: global and
`local MDFL.
`
`A global MDFL program describes the algorithm from the
`viewpoint of a wavefront, while a local MDFL program de-
`scribes the operations of an individual processor. More pre-
`cisely, the perspective of a global MDFL programmer is of one
`wavefront passing across all the processors, while the per-
`spective of a local MDFL programmer is that of one processor
`encountering a series of wavefronts.
`From the macroscopic point of view, a higher level language,
`closer to the algorithm, is desired for reducing the heavy bur-
`den on the programmer. Global MDFL provides such a tool,
`as it is easier to view the algorithm as a series of wavefronts.
`At the microscopic level, each PE executes its own set of in-
`structions and performs localized interprocessor transactions.
`Implementing a program on such a system requires trans-
`forming the high level description (in global MDFL) into a set
`of lower level programs (in local MDFL) for the individual
`processors.2 We have developed a compiler to do such a map-
`ping of global wavefront—oriented MDFL programs into local
`programs for individual processors.
`
`3. MDFL Instruction Set
`
`At each front of activity, the computational wavefront
`performs similar tasks in all the processors involved (cf. Section
`II, matrix multiplication example and Section III-D on pro-
`gramming methodology). Hence, global (wavefront) in-
`structions and local (processor) instructions are nearly iden-
`tical. Most of the MDFL instruction set is, therefore, common
`to both global and local MDFL.
`Table l is a complete list of the MDFL instruction reper—
`toire. For the complete semantics and more detailed syntax,
`the reader is referred to a recent publication [10]. Unless
`otherwise specified, the following instructions belong to both
`global and local MDFL. The proposed instruction set is
`functionally complete, but improvements and modifications
`are still underway to incorporate additional features such as
`double precision. etc. There also appears to be room for the
`
`3 it suffices, in general, to describe a wavefront algorithm at four locations
`of the array: corner, Firisow, FirstColumn, and Interior PE’s. Consequently,
`every global MDFL program gets compiled into four slightly different local
`programs.
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1056
`Petitioner Microsoft Corporation - EX. 1021, p. 1056
`
`

`

`KUNG er «1'..- WAVEFRONT ARRAY PROCESSOR
`
`I 05?
`
`TABLE [
`MDFL INSTRUCTION SET
`
`Data Transfer Instructions
`
`(DIRECTION):
`FLOW (SOURCE REGISTER),
`FETCH (DESTINATIGN REGIS”ER>,
`(DIRECTION);
`READ;
`
`Recursion Dr tented Instructions
`
`REPEAT ... UNTTL “ERMINATRD:
`NHILE HAVEFRONT [N ARRAY D0
`BEGIN ... END;
`SET COUNT (“UNBER OF WAVEEROWTS>;
`DECREMENT COUNT;
`ENDPRDGRAM.
`
`Conditional Instructions
`
`I? EQUAL THEN (STATEMENT);
`I? NOT-EOUAL THEN (STATEHENT):
`IF GRERTER THEN (STATEMENT);
`IF LESS—THAN THEN (STATEMENT):
`IF (DIRECTION) DISABLED T32“ (STATE“ENT>:
`CASE KIND 5
`[1,1]
`{I,*I
`{‘,1)
`INT
`ENDCASE;
`
`(STATEVENT>;
`:
`(STATEMENT):
`:
`(STATEMENT):
`:
`: <5TRTE"E“T>:
`
`Internal Processor Instructions
`
`(DESTINATION);
`(SOURCE),
`(SOURCE 11>, <SOURCE 12>.
`(SOURCE :1).
`(SOURCE 12>,
`(SOURCE It), (SOURCE t2).
`(SOURCE It),
`(SOURCE |2>,
`(SOURCE),
`(DESTINATION);
`(SOURCE #1),
`(SOURCE l2);
`<SOURCB>;
`
`TSR
`non
`SUB
`«ULT
`otv
`SORT
`cue
`TST
`STORE;
`MOP:
`RESET;
`HEGIN ... END;
`DISABLE-SELF;
`
`(DESTINATION);
`(DESTINATION);
`(DESTINATION);
`(DESTINATION);
`
`development of a higher level language suitable for the non-
`specialist programmer.
`
`C. Programming Methodology
`
`In this subsection, we provide guidelines for programming
`in global MDFL. The most straightforward method of pro
`gramming the WAP would be to explicitly spell out the actions
`ofeach wavefront at each of its (2n - 1) positions (fronts) (cf.
`Fig. 1). Nevertheless, the regularity and recursivity in almost
`all matrix algorithms allows us to assume the following.
`i) Space Invariance: The tasks performed by a wavefront
`in a particular kind of processor must be identical at all
`(2:: - I) fronts.
`2) Time Invariance: Recursions are identical.
`Accordingly, global MD FL provides two repetitive con-
`structs, the space repetitive construct
`WHILE WAVEFRONT IN ARRAY DO
`
`BEGIN (TASK T) END
`
`(so that T is repeated at all fronts). and the time repetitive
`construct
`
`REPEAT (ONE RECURSION) UNTIL TERMINATED
`
`(so that the same recursion is repeated).
`The REPEAT construct is inherently concurrent in that
`
`successive wavefronts are pipelined through the array. As soon
`as the kth wavefront is propagated, the (l, 1) processor ini-
`tiates the (k + l)st wavefront.
`To allow for more than one wavefront per recursion, the
`complete global MDFL program will have the syntax
`BEGIN
`
`SET COUNT (
`REPEAT
`
`>:
`
`(TASKS A );
`WHILE WAVEFRONT IN ARRAY DO
`BEGIN
`
`(TASKS a);
`ENIX
`WHILE WAVEFRONT IN ARRAY D0
`BEGIN
`
`(TASKS C);
`END;
`(TASKS D);
`DECREMENT COUNT;
`UNTIL TERMINATED;
`ENDPROGRAM.
`
`Each recursion will execute the instructions within the
`REPEAT -
`'
`- UNTIL construct. The number of recursions is set
`
`by SET COUNT. In this example, a recursion consists of two
`wavefronts. At the start. tasks A are performed only at the (l,
`1) processor. The first wavefront of each recursion will perform
`tasks 3 at each of its (2:: — I) fronts. The second wave will
`execute tasks C in each of these fronts immediately after tasks
`3 have been concluded. it should be noted that the number of
`
`different wavefronts within a recursion may vary from one
`application to another. At the end of each recursion, COUNT
`is decremented. When it becomes zero, TERMINATED is set
`and a “phase” of identical recursions is over.
`The corresponding local MDFL program for interior pro-
`cessors (cf. Section III-A) will be
`
`REPEAT
`
`(TASKS 3")
`(TASKS C’)
`UNTIL TERMINATED;
`
`where 3’ and C’ are the compiled versions of B and C, with
`only relevant portions of the CASE statement extracted. The
`conversion of a global program into its local versions is thus
`fairly straightforward.
`Certain syntax rules [10] are needed to ensure that there
`is no circular waiting for data between adjacent PE’s. They also
`ensure that neighbors will not contend for the interprocessor
`bus at the same time. Concisely, the rules dictate that PETCHes
`in one direction precede (and equal in number) the FLOWS in
`the opposite direction, and that the data FETCHing precede
`any computation (cf. data-driven computation in Section
`IV-A). The complete proof of the claim that these rules will
`prevent deadlock and bus contention is omitted here [24].
`Based on the above guidelines, a complete global MDFL
`program for matrix multiplication follows.
`More programming examples will be provided in Section
`
`V.
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1057
`Petitioner Microsoft Corporation - EX. 1021, p. 1057
`
`

`

`IEEE TRANSACTIONS ON COMPUTERS. vOL. c-31. No. 11. NOVEMBER 1982
`
`“158
`
`Array Size:
`
`Computation:
`
`Program 1. Matrix multiplication.
`M x N
`
`C = A x B
`th
`k wavefront:
`
`lki
`
`c
`
`ik-li
`
`=
`
`c
`
`ij
`ij
`k = l, 2.
`
`+ a
`
`b
`ik kj
`, N
`
`...
`
`Initial:
`
`"atria A is stored in the Memory I“odthle
`
`["“i
`
`on the left
`
`(stored row by row).
`
`“atrix F is in
`
`stored column by column.
`the top and is
`M“ on
`The result will be in the C registers.
`
`Pinal:
`
`1:
`
`5:
`
`10:
`
`15:
`
`REST“
`sn'r coo-w 3;
`REPEAT:
`WHILE WhVRPRONT TN ARRAY DO
`RESIN
`
`FETCH R. UP:
`FETCH A. LEFT:
`now A, mom-,-
`PLOW a, Down;
`I C <- C + as;
`vut'r A. n. or
`non C. n. C;
`
`esp;
`DECREVENT COUNM;
`uurtL *ERMINA"ED:
`ENDFROGRAM.
`
`*
`
`* Comments are enclosed between '5' and ';'
`
`D. Summary of MDFL Features
`
`Future VLSI multiprocessors must support massiVe con-
`currency to achieve a significant increase in performance;
`consequently, a base language for parallel computers must
`allow expression of concurrency of program execution on a
`large scale [20]. However, few languages support the idea of
`large scale concurrency. and only weak notions of locality exist
`in most of them [25].
`It must be noted that a parallel program is not just an en-
`semble of separate programs for individual processors. More
`importantly. it should also define coordination between PE's
`including their interdependence for data and the sequencing
`of their tasks. As such. it is a tall Order.
`
`At present, data-flow languages appear to be the best can-
`didate as the base language of parallel computers. Data-flow
`languages are asynchronous. data driven, and algorithmic.
`They avoid centralized control and shared memory to achieve
`asynchrony and maximai concurrency [20]. [26]—[29].
`MDFL basically possesses all these properties of data-flow
`languages. It shares the principle of data-activated compu-
`tation and its consequent advantages. In addition, MDFL has
`a rather distinctive feature of regularity, and is built around
`the notion of locality. It is very close to the algorithms; hence.
`MDFL programs are modular and easy to understand. MDFL
`permits the programmer to address a front of processors at the
`same time, instead of programming individual processors
`separately. Being wavefront oriented, it permits viewing the
`algorithm as the repetition ofa computational sequence (i.c.,
`the wavefront) progressing through the array. In short, the
`wavefront notion makes it possible to program an array of
`asynchronous processors in a simplistic fashion. and leads to
`simple readable MDFL programs.
`
`IV. WAVEFRONT ARCHITECTURE
`
`The hardware of the processing array is designed to support
`MDFL. The main architectural considerations include four
`
`general aspects: 1) interprocessor communications, based on
`wavefront requirements; 2) the basic PE configuration; 3)
`interfacing with the host computer; and finally. 4) the ex-
`tendability and reconfigurability issues.
`
`A. Interprocesmr Communication
`
`The configuration of the processor array is as shown in the
`schematic diagram of Fig. 1. It provides for data and control
`communications between orthogonally adjacent processors and
`links to memory module; through the first row and first column
`of processors.
`To simulate the phenomenon of wavefront propagation, the
`processors in the array must wait for a primary wavefront (of
`data), then perform its computation and, finally act as a sec-
`ondary source of new wavefronts. To implement this wait,
`processors are provided with data transfer buffers. Hence, a
`FETCHing of data involves an inherent WAlTing for the buffer
`to be filled (DATA READY) by the adjacent data sourcing
`processor. Thus, if the software ensures that the processor al-
`ways performs data FETCH before the computation (cf. syntax
`rules of Section III-C), the processing With? not be initiated until
`the arrival oftire data wavefront (this is similar to the concept
`of data flow machines [171-[23], [26]—{29]). Each processor
`can FLOW data to the input buffers of the neighboring PE's.
`thus acting as a secondary source of data wavefronts (Huygen’s
`principle). To avoid the overrunning of data wavefronts (in
`conformation with Huygen’s principle), the processor hard-
`ware ensures that a processor cannot send new data to the
`buffer unless the old data have been used by the neighbor.
`Thus.
`the wavefront concept suggests that interprocessor
`communication employ buffers and “DATA READYg‘DATA
`USED” flags between adjacent processors.
`The WAITS for wavefronts ofdata allow for giobaiiy asyn-
`chronous operation of processors, i.e., there is no need for
`global synchronization. Synchronization is a very critical issue
`in parallel processing, especially when one considers large scale
`systems. Two opposite timing schemes come to mind, namely,
`the synchronous and the asynchronous timing approaches. In
`the synchronous scheme, there is a global clock network which
`distributes the clocking signals over the entire chip. The global
`clock beats out the rhythm to which all the PE‘s in the array
`execute their tasks. In the basic synchronous configuration.
`all the PE‘s operate in unison. all performing the same identical
`operation. In contrast. the asynchronous scheme involves no
`global clock, and information transfer is by mutual conve-
`nience between each PE and its immediate neighbors.
`Whenever the data are available, the transmitting PB informs
`the receiver of the fact. and the receiver accepts the data when
`it needs them. It then conveys to the sender the information
`that the data have been used. This scheme can be implemented
`by means ofa simple handshaking protocol [10]. [30] (cf. Fig.
`2).
`
`3. PE Configuration
`
`The proposed hardware for the processing elements of the
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1058
`Petitioner Microsoft Corporation - EX. 1021, p. 1058
`
`

`

`1059
`
`CONTROL
`BUSES
`
`
`CONTROL
`HEM DRY
`UNIT
`
`BLOCK;
`ABSOLUTE
`LDADER
`
`
`
`REGISTER
`BLOCK
`
`4”
`
`DIRECTIONAL
`DATA DUE E5
`
`lB
`
`— JNSTRUCTIDN
`DATA ”0
`
`REGISTER
`Nu LT‘IP LEXE R5
`— STATUS FLAGS
`AND
`SF
`BU F F E R5
`PMB — PROGRAM
`MEMORY BUFFER
`FOR LOADING
`
`Fig. 3. Architecture of an [NTerior PE.
`
`great delay involved in accessing the host computer frequently,
`we propose a globally parallel, locally sequential (GPLS)
`scheme, which is explained below.
`Suppose that the array size is N X N, and the size of the
`operand matrix A is N’ X N’ where N’ = mN and m is an in—
`teger. Then a partitioning of A into submatrices ofsize in will
`be
`
`A = [Au],
`
`and
`
`A“ = [fl(i—I)m+k,U—l)m+p],
`
`k,p =1, .
`
`.
`
`.
`
`‘ m.
`
`Each processor then works on an m X m submatrix of data
`stored in its local memory. While the processors can work in
`parallel, an individual processor works on its in X in data so»
`quentially, one after the other; hence. the term GPLS. We
`assert that the inclusion of !ocal memory [two-dimensional
`array M(r',j) and linear array G(i)] will not affect the global
`interconnection. More curiously, global MDFL also needs only
`a minimal modification for the extension. An extended MDFL
`
`syntax and semantics have been developed to effectively (i.e.,
`transparently) handle larger size matrix operations. A sample
`program for multiplication of matrices of larger size (mN ><
`mN) is listed here.
`
`Program 2.seem
`Extended matrix multiplication.
`EQUIVALENCE ts, MD).-
`EQUIVALENEE (Em).-
`srr cuum IN:
`nzprar
`HHILE Havernour [N ARMY no
`SCAN or
`flow I TD in D0
`BEGIN
`szrcn a, up;
`rncn I. LEFT;
`nun A, a, n;
`we
`1:, E.
`c,-
`FLou
`a, RIGHT;
`fiLOlI
`3. film}
`nova RIGHT. noun;
`END:
`DECREHENT count;
`out” IERMNM'ED;
`ENOPROGEM.
`
`Here. each PE scans its submatrix in local memory MU, j)
`row by row, and C refers to M(i,j) where 1' and} are the cur-
`rent values of the scan counters I and J. Similarly, 3 refers to
`GO).
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1059
`Petitioner Microsoft Corporation - EX. 1021, p. 1059
`
`KUNG er a!"- WAVEFRONT ARRAY PROCESSOR
`
`Patti—4
`
`| I
`
`l-—~PEtia
` DATA
`
`BUFFER
`can sent
`I DATfi
`._.!I—.I_
`
`
`
`
`Darn
`TRANSFER
`CONTROL
`FLIP FLOP
`DATE READY
`
`DATA
`
`
`lusts
`|
`I
`I
`
`Fig. 2. The handshaking scheme.
`
`WAP involves a simple modular structure. A basic architecture
`is schematically shown in Fig. 3. Each processor is a hardware
`interpreter of local MDFL. and reflects a relatively coriven-
`tional architecture consisting of an internal program memory,
`a control unit, an ALU, and a set of registers. The only ex-
`ception is the PE's interfacing with its neighbors. Every pro-
`cessing element has a bidirectional buffer and independent
`status and control flags to each of its four adjacent elements
`(cf. Fig. 3). These buffers are supported by an appropriate
`multiplexing subsystem, under the control unit’s supervision.
`Except for the extended capabilities of the (l, 1) processor and
`the memory links of the first row and first column of proces~
`sors, all of the procrsssing elements are essentially identical.
`Hence, from both the software (cf. footnote 2, Section III-A)
`and hardware viewpoints, it suffices to group the PBS into four
`types: Corner, FirstRow, FirstCol, and Interior processors.
`For signal processing applications, the most important re-
`quirements are the accuracy and speed of the arithmetic units.
`These are determined by extensive numerical error analysis
`of the algorithms under consideration. Fortunately, most
`matrix algorithms have a rather complete error analysis doc-
`umentation [31], [32]. Another important factor in the se-
`lection of the ALU is the speed/hardware tradeoff. The de-
`cision between serial versus parallel and fixed- versus float-
`ing-point arithmetic will be affected by this consideratiou. In
`addition, the potential for concurrent execution and internal
`pipelining within the processing elements may be exploited for
`further speedup of the processing rate.
`
`C. Interfacing with the Host Computer and Exteudability
`
`The interaction between the host computer and the WAP
`includes loading of programs and 1/0 of data before and after
`processing. All these transactions may be carried out by means
`of a smart interface. the nature of which is currently under
`
`investigation. This interfacing will minimize the host com-
`puter's involvement in the transaction, permitting efficient
`transfer of large blocks of data.
`The word “extendability" refers to the capability of handling
`matrix operations of larger order with a Smaller sized array
`processor. Software solutions to the extendability problem are
`often adopted via efficient partitioning (decomposition) of the
`algorithms' tasks (e.g., the divide-and-conquer method for
`matrix inversion [16], [33]) and intensive utilization of the host
`computer’s memory.
`In order to handle a larger matrix without incurring the
`
`

`

`I 060
`
`IEEE TRANSACTIONS ON COMPUTERS. vOL. (2.3]. N0. 1]. NOVEMBER I982
`
`spectively. Each of the above three categories will be separately
`discussed in Sections V-A, V-B. and V-C. respectively. Ulti-
`mately, real system applications will be the only objective
`criterion for justifying and rcoonfirming the usefulness of the
`WAP. Investigations are. therefore, being conducted into its
`use in such diverse fields as image analysis, speech recognition,
`Kalman filtering, nonlinear filtering, adaptive filtering.
`smoothing. prediction, optimization. detection, spectrum
`analysis, and other advanced signal processing applications
`[12].
`In fact, some recent reports have indicated a promising and
`encouraging trend. For instance. a computing structure similar
`to the WAP has been applied for parallel processing (for dy-
`namic time warping [34]) in speech recognition. It has also
`been recently reported [1 3] that the wavefront concept can be
`effectively utilized to solve the singular value decomposition
`(SVD) problem, which is crucial for image processing and
`beam-forming applications. We are currentl

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket