throbber
1054
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 11, NOVEMBER 1982
`
`Wavefront Array Processor: Language,
`Architecture, and Applications
`
`IEEE, K. S. ARUN, STUDENT MEMBER, IEEE, RON J. GAL-EZER,
`SUN-YUAN KUNG, MEMBER,
`STUDENT MEMBER,
`IEEE, AND D. V. BHASKAR RAO, STUDENT MEMBER, IEEE
`
`Abstract—This paper describes the development of a wavefront-
`based language and architecture for a programmable special-purpose
`multiprocessor array. Based on the notion of computational wavefront,
`the hardwareof the processor array is designed to provide a computing
`medium that preserves the key properties of the wavefront. In con-
`junction, a wavefront language (MDFL)is introduced that drastically
`reduces the complexity of the description of parallel algorithms and
`simulates the wavefront propagation across the computing network.
`Together, the hardware and the language lead to a programmable
`wavefront array processor (WAP). The WAP blends the advantages
`of the dedicated systolic array and the general-purpose data-flow
`machine, and provides a powerful tool for the high-speed execution
`of a large class of matrix operations and related algorithms which have
`widespread applications.
`
`Index Terms—Asynchrony, computational wavefront, concurrency,
`data-flow computing, matrix data-flow language, signal processing,
`systolic array, VLSI array processor, wavefront architecture.
`
`I.
`
`INTRODUCTION
`
`A, VLSI Signal Processing
`
`ITH the rapidly growing microelectronics technology
`leading the way, modern signal processing is under-
`going a major revolution. The past two decades have witnessed
`a steep increase in the complexity of computations, processing
`speed requirements, and the volumeof data handled in various
`signal processing applications. The availability of low cost, high
`density, fast VLSI devices has opened a new avenue for
`implementing these increasingly sophisticated algorithms and
`systems[1], [2]. While a major improvementin device speed
`is foreseen,it is in no way comparableto therate of increase
`of throughput required by modern real-time signal processing.
`In order to achieve such increases in throughput rate, the only
`effective solution appears to be highly concurrent processing.
`Consequently, it has become a major challenge to update the
`currentsignal processing techniques so as to maximally exploit
`their potential for concurrent execution.
`In a broad sense, the answer to this challenge lies in a
`cross-disciplinary research encompassing the areasof algo-
`rithm analysis, parallel computer design, and system appli-
`
`Manuscript received January 11, 1982; revised June 18, 1982. This work
`was supported in part by the Office of Naval Research under Contract
`N00014-80-C-0457, NO0014-81-K-0191, by the National Science Foundation
`under Grant ECS-80-16581, and by the Defense Advanced Research Projects
`Agency under Contract MDA903-79-C-0680.
`The authors are with the Department of Electrical Engineering-Systems,
`University of Southern California, Los Angeles, CA 90089.
`
`cations. In this paper, we therefore introduce an integrated
`research approach aimed at incorporating the vast VLSI
`computational capability into modern signal processing ap-
`plications.
`Thetraditional design of parallel computers and languages
`is deemed unsuitable for our purposes. It usually suffers from
`heavy supervisory overhead incurred by synchronization,
`communication, and scheduling tasks, which severely hamper
`the throughput rate whichis critical to real-time signal pro-
`cessing. In fact, these are the key barriers inherent in very large
`scale computing structure design. Moreover, although VLSI
`provides the capability of implementing a large arrayof pro-
`cessors on one chip, it imposes its own constraints on the sys-
`tem. Large design and layoutcosts [2] suggest the utilization
`of a repetitive modular structure. In addition, communication,
`which costs the most in VLSI chips in termsof area, time, and
`energy, has to be restricted (to /ocalized communication)[1].
`In general, highly concurrent systems require this locality
`property in order to reduce interdependence and ensuing
`waiting delays that result from excessive communication [1].
`This locality constraint prevents the utilization of centralized
`control and global synchronization. The resulting use of
`asynchronous distributed control and localized data flowis
`an effective approachto the design of very large scale, highly
`concurrent computing structures.
`
`B. A Special-Purpose VLSI Array Processor
`
`The aboverestrictions imposed by VLSI will render the
`general-purpose array processor
`rather
`inefficient. We
`therefore restrict ourselves to a special class of applications,
`ie., recursive! and local data-dependentalgorithms, to con-
`form with the constraints imposed by VLSI. Thisrestriction,
`however,incurslittle loss of generality, as a great majority of
`signal processing algorithms possess these properties. One
`typical exampleis a class of matrix algorithms.It has recently
`beenindicated that a major portion of the computational needs
`for signal processing and applied mathematical problemscan,
`in fact, be reduced to a basic set of matrix operations and other
`related algorithms [3],
`[4]. Therefore, a special-purpose
`parallel machine for processing these typical computational
`algorithms will be cost effective and attractive in VLSI system
`design.
`
`' Ina recursive algorithm,all processors do nearly identical tasks, and each
`processor repeats a fixed set of tasks on sequentially available data.
`
`0018-9340/82/1100-1054$00.75 © 1982 IEEE
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1054
`Petitioner Microsoft Corporation - Ex. 1021, p. 1054
`
`

`

`KUNG et al: WAVEFRONT ARRAY PROCESSOR
`
`1055
`
`MEMORY MODULES
`
`
`
`MODULES
`
`MEMORY
`
` oe
`mE
`Fig. 1. The WAPconfiguration,
`
`FIRST WAVE — — ——
`SECOND WAVE -------
`
`oy
`
`Very significantly, these algorithms involve repeated ap-
`plication ofrelatively simple operations with regular localized
`data flow in a homogeneous computing network. This leads
`to an important notion of computational wavefront, which
`portrays the computation activities in a manner resembling
`a wave propagation phenomenon. Moreprecisely, the recursive
`natureof the algorithm,in conjunction with the localized data
`dependency, points to a continuously advancing wave ofdata
`and computationalactivity. The computational sequence starts
`with one element and propagates through the processorarray,
`closely resembling a physical wave phenomenon(cf. Fig. 1).
`In fact, all algorithms that have /ocality and recursivity (and
`are thus implementable on our VLSIarray processor) will
`exhibit this wave phenomenon. Therefore, the notion of a
`computational wavefront has attracted the attention of many
`researchers [2], [5]-[13].
`The wavefront concept provides a firm theoretical founda-
`tion for the design of highly parallel array processors and
`concurrent languages. In addition, this concept appears to have
`somedistinct advantages.
`First, the wavefront notion drastically reduces the com-
`plexity in the description of parallel algorithms. The mecha-
`nism provided for this descriptionis a special-purpose, wave-
`front-oriented language [7], [9], [10]. Rather than requiring
`a program for each processorin the array, this language allows
`the programmerto address an entirefront ofprocessors.
`Second, the wavefront notion leads to a wavefront-based
`architecture which preserves Huygen’s principle [14], and
`ensures that wavefronts never intersect. Therefore, a wavefront
`architecture can provide asynchronous waiting capability, and
`consequently, can cope with timing uncertainties, such as local
`clocking, random delay in communications, and fluctuations
`of computing times. In short, the notion lends itself to a
`(asynchronous) data-flow computing structure that conforms
`well with the constraints of VLSI.
`The integration of the wavefront concept, the wavefront
`language, and the wavefront architecture leads to a pro-
`
`grammable computing network, which wewill call the wave-
`front array processor (WAP). The WAPis, in a sense, an
`optimal tradeoff between the globally synchronized and
`dedicated systolic array [1], [15], [16] (that works on a sim-
`ilar set of algorithms), and the general-purpose data-flow
`multiprocessors [17|-[23]. It provides a powerful tool for the
`high-speed execution of a large class of algorithms which have
`widespread applications.
`
`C. Organization
`
`The organization ofthe rest of the paperis as follows. Sec-
`tion II elaborates on the computational wavefront. Section III
`proposes a wavefront-oriented language (MDFL) for pro-
`gramming the WAP,andprovides a programming method-
`ology. Section IV explains a possible hardware organization
`and architecture for the WAP. Section V illustrates some
`applications of the WAP by means of MDFLprograms.Sec-
`tion VI compares the WAPto other array processors, such as
`the systolic array, data-flow multiprocessors, and conventional
`SIMDarrayslike the Illiac IV.
`
`IJ. CONCEPT OF COMPUTATIONAL WAVEFRONT
`
`The wavefront array processor is configured in a square
`array of N X N processing elements with regular and local
`interconnections(cf. Fig. 1).
`The computing networkserves as a (data) wave-propagating
`medium. The notion of computational wavefront can be best
`explained by a simple example. To this end, we shall consider
`matrix multiplication as being representative. Let A = {a,j},
`B = {b;;|, and C = A X B = {c;;} all be N X N matrices. The
`matrix A can be decomposed into columns A; and matrix B
`into rows B;, and therefore,
`
`C= A, « B, + Ar* Bo t+ *>+ Ayn * By.
`
`(1)
`
`The matrix multiplication can then be carried out in N re-
`cursions, executing
`
`CH) = CANN) + Ay * By
`
`(2)
`
`recursively for k = 1, 2,---, N.
`The exploitation of parallelism is now evident with the
`availability of N X N processors. A parallel algorithm for
`matrix multiplication is fairly simple. However, most existing
`parallel programs would need global interconnections in the
`computing network, while localized interconnections and data
`flow are much moredesirable in VLSI systems.
`The topology of the matrix multiplication algorithm can. be
`mapped naturally onto the square, orthogonal N X N matrix
`array of the WAP (cf. Fig. 1). To create a smooth data
`movement in a localized communication network, we make
`use of the computational wavefront concept. For the purpose
`of this example, a wavefrontin the processing array will cor-
`respond to a mathematical recursion in the algorithm. Suc-
`cessive pipelining of the wavefronts will accomplish the com-
`putationofall recursions.
`As an example, the computational wavefront forthe first
`recursion in matrix multiplication will now be examined.
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1055
`Petitioner Microsoft Corporation - Ex. 1021, p. 1055
`
`

`

`1056
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 11, NOVEMBER 1982
`
`Suppose that the registers of all the processing elements (PE’s)
`are initially set to zero:
`ci =0
`
`for all (i, j);
`
`the entries of A are stored in the memory modulestotheleft
`(in columns), and those of B in the memory modules on the top
`(in rows). The process starts with PE (1, 1), where
`CiP = CP + ay, *d),
`
`(3)
`
`is computed. The computational activity then propagates to
`the neighboring PE’s (1, 2) and (2, 1), which will execute
`
`scription of computational wavefronts and the corresponding
`dataflow ina large classof algorithms (which exhibit the re-
`cursivity and locality mentioned earlier).
`Since matrix algorithmsare typicalof this class, we call the
`language the matrix data-flow language (MDFL). This
`wavefront language permits the array processor to be pro-
`grammable, broadensthe rangeofapplications, and also makes
`possible the simulation and verification of parallel algo-
`rithms.
`
`A. Matrix Data-Flow Language
`
`and
`
`CHP = Ci + ayy * diz
`
`CSP = CHP+ aa * by).
`
`(4)
`
`(5)
`
`There exist two approaches to programming the WAP: a
`local approach describing the actions of each processingele-
`ment, and a global approach describing the actions of each
`wavefront. To allow the user to program the WAPin both of
`these fashions, two versions of MDFLare proposed:global and
`The nextfront ofactivity will be at PE’s (3, 1), (2, 2), and
`local MDFL.
`(1, 3), thus creating a computation wavefront traveling down
`A global MDFLprogram describes the algorithm from the
`the processor array. This computational wavefrontis similar
`viewpoint of a wavefront, while a local MDFL program de-
`to optical wavefronts (they both obey Huygen’s principle) since
`scribes the operations of an individual processor. More pre-
`each processor acts as a secondary source andis responsible
`cisely, the perspective of a global MDFL programmeris of one
`for the propagation of the wavefront. It may be noted that wave
`wavefront passing across all the processors, while the per-
`propagation implies localized data flow. Once-the wavefront
`spective of a local MDFL programmeris that of one processor
`sweepsthroughall the cells, the first recursionis over. As the
`encountering a series of wavefronts.
`first wave propagates, we can execute an identical second re-
`From the macroscopic pointof view, a higherlevel language,
`cursion in parallel by pipelining a second wavefront immedi-
`closer to the algorithm,is desired for reducing the heavy bur-
`ately after the first one. For example, the (1, 1) processorwill
`den on the programmer. Global MDFLprovides suchatool,
`execute
`as it is easier to view the algorithm as a series of wavefronts.
`At the microscopic level, each PE executesits own set ofin-
`structions and performslocalized interprocessor transactions.
`Implementing a program on such a system requires trans-
`forming the high level description (in global MDFL)intoa set
`of lower level programs(in local MDFL)for the individual
`processors.* We have developed a compiler to do such a map-
`ping of global wavefront-oriented MDFLprogramsintolocal
`programsfor individual processors.
`
`Ch? = Ch) + ayo * bay
`
`(6)
`
`(7)
`
`and so on. In general, the (i, /)th processorwill execute the kth
`recursion,
`CHP = iy * bij + a2 * bay + +++ + ain * Dy.
`The pipelining is feasible because the wavefronts of two suc-
`cessive recursions will never intersect (Huygen’s wavefront
`principle), as the processors executing the recursions at any
`given instant will be different, thus avoiding any contention
`problems.
`Locality, regularity, recursivity, and concurrency lead to
`the wavefront phenomenon. Thus,all algorithms which possess
`these properties will exhibit computational wavefronts. The
`notion of computational wavefronts leads to a wavefront-based
`language (a modified data-flow language) to program the
`processor array. This language (called the matrix data-flow
`language)is especially powerful for matrix and other related
`algorithms.It will be developed in the next section.
`
`Ill. A WAVEFRONT LANGUAGE: MATRIX DATA-FLOW
`LANGUAGE
`
`In contrast to the heavy burden of scheduling, resource
`sharing, as well as the control of processor interactions en-
`countered in programming a general-purpose multiprocessor,
`the computational wavefront notion can facilitate the de-
`scription of parallel algorithms and drastically reduce the
`complexity of parallel programming. In this section, we in-
`troduce a wavefront languagethatis tailored towardsthe de-
`
`B. MDFLInstruction Set
`
`At each front of activity, the computational wavefront
`performs similartasks in all the processors involved (cf. Section
`II, matrix multiplication example and Section III-D on pro-
`gramming methodology). Hence, global (wavefront) in-
`structions andlocal (processor) instructions are nearly iden-
`tical. Most of the MDFLinstructionset is, therefore, common
`to both global and local MDFL.
`Table I is a complete list of the MDFL instruction reper-
`toire. For the complete semantics and more detailed syntax,
`the reader is referred to a recent publication [10]. Unless
`otherwise specified, the following instructions belong to both
`global and local MDFL. The proposed instruction set is
`functionally complete, but improvements and modifications
`are still underway to incorporate additional features such as
`double precision, etc. There also appears to be room for the
`
`2 It suffices, in general, to describe a wavefrontalgorithm at four locations
`of the array: corner, FirstRow, FirstColumn,and Interior PE’s. Consequently,
`every global MDFLprogram gets compiled into four slightly different local
`programs.
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1056
`Petitioner Microsoft Corporation - Ex. 1021, p. 1056
`
`

`

`KUNG et al; WAVEFRONT ARRAY PROCESSOR
`
`1057
`
`TABLE I
`MDEFLINSTRUCTION SET
`
`Data Transfer Instructions
`
`FLOW <SOURCE REGISTER>, <DIRECTION>;
`FETCH <DESTINATION REGISTER>, <DIRECTION>;
`READ;
`
`Recursion Oriented Instructions
`
`REPEAT ... UNTTL "ERMINATED;
`WHILE WAVEFRONT (TN ARRAY DO
`BFGIN ... END;
`SFT COUNT <NUMBER OF WAVEFRONTS>;
`DECREMENT COUNT;
`ENDPROGRAM,
`
`Conditional Instructions
`
`IF FOUAL THEN <STATEMENT™>;
`IF NOT-EQUAL THEN <STATEMENT>;
`IF GREATER THEN <STATEMENT>;
`IF LESS-THAN THEN <STATEMEN™>;
`IF <DIRECTION> DISABLED THEN <STATEMENT>;
`CASE KIND =
`(1,1)
`(1,*)
`(*,1)
`INT
`ENDCASE;
`
`=: <STATEMENT>;
`+ <STATEMENT>;
`: <STATEMENTS;
`: <STATEMEMT>;
`
`successive wavefronts are pipelined through the array. As soon
`as the kth wavefront is propagated, the (1, 1) processorini-
`tiates the (kK + 1)st wavefront.
`To allow for more than one wavefront per recursion, the
`complete global MDFLprogram will have the syntax
`BEGIN
`
`SET COUNT{ );
`REPEAT
`
`(TASKS 4A);
`WHILE WAVEFRONTIN ARRAY DO
`BEGIN
`
`(TASKS B);
`END;
`WHILE WAVEFRONTIN ARRAY DO
`BEGIN
`
`(TASKS C);
`END;
`(TASKS DP);
`DECREMENT COUNT;
`UNTIL TERMINATED;
`ENDPROGRAM.
`
`Internal Processer Instructions
`
`<SOURCE>, <DESTTNATTON>;
`<SOURCE #1>, <SOURCE 42>, <DESTINATTON>;
`<SOURCF #1>, <SOURCFE #2>, <DESTINATTON>;
`<SOURCE #1>, <SOURCE #2>, <DESTINATION>;
`<SOURCE #1>, <SOURCE #2>, <DESTINATTON>;
` <SOURCR>, <DESTINATION>;
`<SOURCE #1>, <SOURCE #2>;
`<SOURCE>;
`
`TSR
`ADD
`SUB
`MULT
`DIV
`SORT
`CMP
`TST
`STORE;
`NOP;
`RESET;
`BEGIN ... END;
`DISABLE-SELF ;
`
`developmentofa higher level language suitable for the non-
`specialist programmer.
`
`C. Programming Methodology
`
`In this subsection, we provide guidelines for programming
`in global MDFL. The most straightforward method of pro-
`gramming the WAP would be to explicitly spell out the actions
`of each wavefrontat eachofits (2m — 1) positions (fronts) (cf.
`Fig. 1). Nevertheless, the regularity and recursivity in almost
`all matrix algorithms allows us to assumethe following.
`1) Space Invariance: Thetasks performed by a wavefront
`in a particular kind of processor must be identical at all
`(2n — 1) fronts.
`2) Time Invariance: Recursionsare identical.
`Accordingly, global MDFL provides two repetitive con-
`structs, the space repetitive construct
`WHILE WAVEFRONTIN ARRAY DO
`
`BEGIN (TASK 7) END
`
`(so that T is repeated at all fronts), and the time repetitive
`construct
`
`REPEAT (ONE RECURSION) UNTIL TERMINATED
`
`(so that the same recursion is repeated).
`The REPEAT construct is inherently concurrent in that
`
`Each recursion will execute the instructions within the
`REPEAT «** UNTIL construct. The numberof recursionsisset
`by SET COUNT.In this example, a recursion consists of two
`wavefronts. At thestart, tasks A are performed only at the (1,
`1) processor. Thefirst wavefront of each recursion will perform
`tasks B at each ofits (2n — 1) fronts. The second wavewill
`execute tasks C in eachofthese fronts immediately after tasks
`B have been concluded. It should be noted that the number of
`different wavefronts within a recursion may vary from one
`application to another. At the end of each recursion, COUNT
`is decremented. Whenit becomes zero, TERMINATEDisset
`and a “phase”ofidentical recursionsis over.
`The corresponding local MDFL program forinterior pro-
`cessors (cf. Section III-A) will be
`
`REPEAT
`
`(TASKS 8B’)
`(TASKSC’)
`UNTIL TERMINATED;
`
`where B’ and C’ are the compiled versions of B and C, with
`only relevant portions of the CASE statement extracted. The
`conversion of a global programinto its local versionsis thus
`fairly straightforward.
`Certain syntax rules [10] are needed to ensure that there
`is no circular waiting for data between adjacent PE’s. Theyalso
`ensure that neighbors will not contend for the interprocessor
`busat the sametime. Concisely,the rules dictate that FETCHes
`in one direction precede (and equal in number) the FLOWsin
`the opposite direction, and that the data FETCHing precede
`any computation (cf. data-driven computation in Section
`IV-A). The complete proofof the claim that these rules will
`prevent deadlock and bus contention is omitted here [24].
`Based on the above guidelines, a complete global MDFL
`program for matrix multiplication follows.
`More programming examples will be provided in Section
`
`V.
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1057
`Petitioner Microsoft Corporation - Ex. 1021, p. 1057
`
`

`

`1058
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 11, NOVEMBER 1982
`
`Array Size:
`
`Program 1. Matrix multiplication.
`Nx N
`
`IV. WAVEFRONT ARCHITECTURE
`
`The hardwareof the processing array is designed to support
`Cc=A xB
`Computation:
`MDEL. The main architectural considerations include four
`(k-1)
`(k)
`th
`
`k wavefront: +aibc = ec
`general aspects: 1) interprocessor communications, based on
`ij
`ij
`ik kj
`wavefront requirements; 2) the basic PE configuration; 3)
`kw ly 25
`9 N
`interfacing with the host computer; and finally, 4) the ex-
`tendability and reconfigurability issues.
`
`
`
`
`
`
`
`coe
`
`Initial:
`
`“atrix A is stored in the Memory “odule
`
`(™M)
`
`on the left
`
`(stored row by row). M“atrix FP is in
`
`stered column by column.
`the top and is
`4’ on
`The result will be in the C registers.
`
`Final:
`
`Lis
`
`5:
`
`10:
`
`15:
`
`REGIN
`SPT COUNT 3;
`REPEAT;
`WHILE WAVRFRONT TN ARRAY DO
`REGIN
`
`FETCH AR, UP;
`FETCH A, LEFT;
`FLOW A, RIGHT;
`FLOW R, DOWN;
`{ ¢ «<= ¢ + AB;
`MULT A, Ry D;
`ADD Cc, D, Cz
`
`END;
`DECREMENT COUN™;
`UNTIL TERMINATED:
`FNDPROGRA™,
`
`*
`
`* Comments are enclosed between *!"* and ";"
`
`D. Summary of MDFL Features
`
`Future VLSI multiprocessors must support massive con-
`currency to achieve a significant increase in performance;
`consequently, a base language for parallel computers must
`allow expression of concurrency of program execution on a
`large scale [20]. However, few languages supportthe idea of
`large scale concurrency, and only weak notionsoflocality exist
`in most of them [25].
`It must be noted that a parallel program is not just an en-
`semble of separate programsfor individual processors. More
`importantly, it should also define coordination between PE’s
`including their interdependencefor data and the sequencing
`of their tasks. As such, it is a tall order.
`At present, data-flow languages appearto bethe best can-
`didate as the base languageofparallel computers. Data-flow
`languages are asynchronous, data driven, and algorithmic.
`They avoid centralized control and shared memoryto achieve
`asynchrony and maximalconcurrency [20], [26]-[29].
`MDEFLbasically possessesall these properties of data-flow
`languages. It shares the principle of data-activated compu-
`tation and its consequent advantages. In addition, MDFL has
`a ratherdistinctive feature of regularity, and is built around
`the notionoflocality. It is very close to the algorithms; hence,
`MDFLprogramsare modular and easy to understand. MDFL
`permits the programmerto addressa front of processors at the
`same time, instead of programming individual processors
`separately. Being wavefront oriented, it permits viewing the
`algorithm as the repetition of a computational sequence(i-e.,
`the wavefront) progressing through the array. In short, the
`wavefront notion makes it possible to program an array of
`asynchronousprocessorsin a simplistic fashion, and leadsto
`simple readable MDFLprograms.
`
`A, Interprocessor Communication
`The configuration of the processor array is as shownin the
`schematic diagram of Fig. 1. It provides for data and control
`communications between orthogonally adjacent processors and
`links to memory modules throughthefirst row andfirst column
`of processors.
`To simulate the phenomenonof wavefront propagation, the
`processorsin the array must wait for a primary wavefront (of
`data), then perform its computation and,finally act as a sec-
`ondary source of new wavefronts. To implementthis wait,
`processors are provided with data transfer buffers. Hence, a
`FETCHingof data involves an inherent WAITingfor the buffer
`to be filled (DATA READY) by the adjacent data sourcing
`processor. Thus,if the software ensuresthat the processoral-
`ways performs data FETCH before the computation (cf. syntax
`rules of Section III-C), the processing will notbe initiated until
`the arrival ofthe data wavefront (this is similar to the concept
`of data flow machines [17]-[23], [26]-[29]). Each processor
`can FLOW datato the input buffers of the neighboring PE’s,
`thus acting as a secondarysource of data wavefronts (Huygen’s
`principle). To avoid the overrunning of data wavefronts (in
`conformation with Huygen’sprinciple), the processor hard-
`ware ensures that a processor cannot send new data to the
`buffer unless the old data have been used by the neighbor.
`Thus,
`the wavefront concept suggests that interprocessor
`communication employ buffers and “DATA READY/DATA
`USED”flags between adjacent processors.
`The WAITsfor wavefronts of data allow for globally asyn-
`chronous operation of processors, i.e., there is no need for
`global synchronization. Synchronization is a very critical issue
`in parallel processing, especially when one considers large scale
`systems. Two opposite timing schemes cometo mind, namely,
`the synchronousand the asynchronous timing approaches. In
`the synchronous scheme,there is a global clock network which
`distributes the clocking signals over the entire chip. The global
`clock beats out the rhythm to whichall the PE’s in the array
`execute their tasks. In the basic synchronousconfiguration,
`all the PE’s operatein unison,all performing the sameidentical
`operation. In contrast, the asynchronous schemeinvolves no
`global clock, and information transfer is by mutual conve-
`nience between each PE and its immediate neighbors.
`Wheneverthe dataare available, the transmitting PE informs
`the receiver of the fact, and the receiver accepts the data when
`it needs them.It then conveys to the sender the information
`that the data have been used. This scheme can be implemented
`by meansofa simple handshaking protocol[10], [30] (cf. Fig.
`2).
`B. PE Configuration
`The proposed hardwarefor the processing elements of the
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1058
`Petitioner Microsoft Corporation - Ex. 1021, p. 1058
`
`

`

`
`
`CONTROL
`UNIT
`
`MEMORY
`BLOCK:
`ABSOLUTE
`LOADER
`
`1059
`
`Busses
`BUSSES
`
`4
`#ARECTIONA
`DATA BUSSES L
`
`
`
`
`
`REGISTER
`BLOCK
`
`IR
`
`= INSTRUCTION
`REGISTER
`— STATUS FLAGS
`SF
`PMB — PROGRAM
`MEMORY BUFFER
`FOR LOADING
`
`DATA 1/0
`MULTIPLEXERS
`AND
`BUFFERS
`
`Fig. 3. Architecture of an INTerior PE.
`
`great delay involved in accessing the host computer frequently,
`we propose a globally parallel, locally sequential (GPLS)
`scheme,which is explained below.
`Suppose that the array size is N * N, andthesize of the
`operand matrix A is N’ X N’ where N’ = mNand m isanin-
`teger. Then a partitioning of A into submatrices ofsize m will
`be
`
`A = [A],
`
`“SN
`
`and
`
`Ai = [@G-1)m+k,G-1)m+p])
`
`k,p=1,:::,m.
`
`Each processor then works on an m X m submatrix of data
`stored in its local memory. While the processors can work in
`parallel, an individual processor works on its m X m data se-
`quentially, one after the other; hence, the term GPLS. We
`assert that the inclusion of local memory [two-dimensional
`array M(i, j) and linear array G(/)] will not affect the global
`interconnection. More curiously, global MDFLalso needs only
`a minimal modification for the extension. An extended MDFL
`syntax and semantics have been developed to effectively (i.e.,
`transparently) handle largersize matrix operations. A sample
`program for multiplication of matrices of larger size (mN X
`mN) is listed here.
`
`Program 2. Extended matrix multiplication.
`BEGIN
`EQUIVALENCE (8, GCI);
`EQUIVALENCE (C,M);
`SET COUNT mn;
`REPEAT
`WHILE WAVEFRONT IN ARRAY DO
`SCAN BY ROW 1 TO m OO
`BEGIN
`FETCH B, UP;
`FETCH A, LEFT;
`MULT A, B, R;
`apD
`6c, R, C;
`FLOW A, RIGHT;
`FLOW 6B, OWN;
`MOVE RIGHT, DOWN;
`DECREMENT COUNT;
`UNTIL TERMINATED;
`ENDPROGRAM.
`
`Here, each PE scansits submatrix in local memory M(i, j)
`row by row, and C refers to M(i, 7) where i and j are the cur-
`rent values of the scan counters / and J. Similarly, B refers to
`G(i).
`
`Petitioner Microsoft Corporation - Ex. 1021, p. 1059
`Petitioner Microsoft Corporation - Ex. 1021, p. 1059
`
`KUNG er al; WAVEFRONT ARRAY PROCESSOR
`
`k—PEt))
`OATA
`
`hy DATA
`DATA SENT
`TRANSFER
`CONTROL
`FLIP FLOP
`
`DATA READY
`
`
`
`DATA PROCESSED
`
`—f
`
`||
`
`|
`|
`
`PE(i)—>4
`
`| j
`
`DATA
`
`JuseD
`|
`
`Fig. 2. The handshaking scheme.
`
`WAPinvolves a simple modularstructure. A basic architecture
`is schematically shown in Fig. 3. Each processor is a hardware
`interpreter of local MDFL,andreflects a relatively conven-
`tional architecture consisting of an internal program memory,
`a control unit, an ALU, anda set of registers. The only ex-
`ception is the PE’s interfacing with its neighbors. Every pro-
`cessing element has a bidirectional buffer and independent
`status and control flags to eachof its four adjacent elements
`(cf. Fig. 3). These buffers are supported by an appropriate
`multiplexing subsystem, under the control unit’s supervision.
`Except for the extended capabilities of the (1, 1) processor and
`the memorylinks ofthe first row and first column of proces-
`sors, all of the processing elements are essentially identical.
`Hence, from both the software (cf. footnote 2, Section III-A)
`and hardwareviewpoints,it suffices to group the PE’s into four
`types: Corner, FirstRow, FirstCol, and Interior processors.
`For signal processing applications, the most importantre-
`quirements are the accuracy and speed ofthe arithmetic units.
`These are determined by extensive numerical error analysis
`of the algorithms under consideration. Fortunately, most
`matrix algorithms have a rather complete error analysis doc-
`umentation [31], [32]. Another important factor in the se-
`lection of the ALUis the speed/hardwaretradeoff. The de-
`cision betweenserial versus parallel and fixed- versus float-
`ing-point arithmetic will be affected by this consideration. In
`addition, the potential for concurrent execution andinternal
`pipelining within the processing elements may be exploited for
`further speedupof the processing rate.
`
`C. Interfacing with the Host Computer and Extendability
`
`Theinteraction between the host computer and the WAP
`includes loading of programsand I/O ofdata before andafter
`processing. All these transactions may be carried out by means
`of a smart interface, the nature of which is currently under
`investigation. This interfacing will minimize the host com-
`puter’s involvementin the transaction, permitting efficient
`transfer of large blocks of data.
`The word “extendability” refers to the capability of handling
`matrix operations of larger order with a smaller sized array
`processor. Softwaresolutions to the extendability problem are
`often adopted via efficient partitioning (decomposition)of the
`algorithms’ tasks (e.g., the divide-and-conquer method for
`matrix inversion [16], [33]) andintensiveutilization of the host
`computer’s memory.
`In order to handle a larger matrix without incurring the
`
`

`

`1060
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 11, NOVEMBER 1982
`
`spectively. Each of the above three categories will be separately
`discussed in Sections V-A, V-B, and V-C,respectively. Ulti-
`mately, real system applications will be the only objective
`criterion for justifying and reconfirming the usefulness of the
`WAP.Investigations are, therefore, being conducted intoits
`use in such diversefields as image analysis, speech recognition,
`Kalman filtering, nonlinear filtering, adaptive filtering,
`smoothing, prediction, optimization, detection, spectrum
`analysis, and other advanced signal] processing applications
`[12].
`In fact, some recent reports have indicated a promising and
`encouraging trend. For instance, a computing structure similar
`to the WAPhasbeen applied for parallel processing (for dy-
`namic time warping [34]) in speech recognition.It has also
`been recently reported [13] that the wavefront concept can be
`effectively utilized to solve the singular value decomposition
`(SVD) problem, which is crucial for image processing and
`beam-forming applications. We are currently looking into an
`MDEFLprogram for such an SVD algorithm. Finally, for a
`system application in real-time signal processing, a case of
`adaptive array processing systemsfor high resolution direc-
`tivity pattern analysis is studied in an earlier publication [10].
`There, a combined application of the WAP, MDFL, and
`modern spectrum analysis is presented.
`
`A. Matrix-Based Algorithms
`
`Exam

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket