`
`LEE .RND BIER
`
`software controlled paging rather than dynamic caches
`I37. 38 1.
`
`l4. Statically Ordered Data Structure Access
`
`The lack of side ell‘ects in datallow actors makes it partic-
`ularly difficult to support large. shared data structures [20].
`Arvind er a}. have therefore extended the datallow model by
`introducing .l—slrtrctures. a controlled form of global data
`structures [2}. l-structures are write~once data structures with
`nonstrict semantics. which in practice means that reads may
`be issued before data are available in the data structure. Sup-
`port for l-structures requires an ability to queue read requests
`until
`they can be satisfied. This mechanism is the most
`promising available. but it does not come cheaply. One extra
`memory location is required for each read instruction that
`can be simultaneously pending. In addition. the l-structure
`memory needs a processor ofu sort to tend to pending reads
`when a write finally occurs. A much simpler mechanism
`may be used when scheduling is static.
`Consider for example an actor that emits an array. This
`array might be carried by a single token. Suppose that there
`are two actors that take this array as an argument A pure
`datallow model requires that the array be copied. or at least
`than an implementation behave as if the array had been
`copied. Using an l-structure avoids this copying. However.
`with ordered-memory accesses the copying is not necessary.
`and neither is the l-structure memory. Since the scheduler
`is aware of all precedences. it will avoid scheduling reads
`before the data become available. If this can not be avoided
`la processor has nothing to do until the data become avail—
`able). then the read is attempted before the bus is granted
`to the processor. so the processor halts, The bus will not be
`granted to the processor until the data are ready. There is
`no need to queue accesses.
`When data passes through the shared memory From one
`actor to another actor. the scheduler can reclaim the token
`storage after scheduling the read by the destination. Write.
`once shared-data structures are only slightly more compli-
`cated. because there may be more than one destination actor.
`The scheduler can simply use reference counts (RC5) [26.
`I6] to determine when the memory can be reclaimed. For
`the above example. the RC" associated with the array storage
`would be initialized to 2. the number ol‘destinntinns, when
`the anay is scheduled to be written. Each time a read is
`scheduled. the RC is decremented. When it reaches 0 the
`memory can be reclaimed. This works without any run-time
`overhead because the order of these transactions will be en-
`forced at run time.
`
`Many variations of this idea immediately come to mind:
`For errant pie. reference counts could be used for each element
`of the array. instead of the whole array. thereby obtaining
`some of the advantages of the nonstrietness of I-structures.
`Specifically. the array does not have to be completely filled
`
`before some of its elements can be read. Also. if the RC of
`a data structure is identically one. then an actor using it may
`modify it. instead of simply reading it. something not per-
`mined in the write-once l-structures. An intelligent code
`generator can get considerable mileage out of this.
`The reference count technique has been criticized for a
`number of reasons [3]. most of which break down when the
`scheduling is static. Primarily.
`l'or ordercd~memory archi-
`tectures. the overhead of managing RC5 is incurred milieu:
`scheduling time. not at run time.
`
`3. STATICALLY SCHEDULED CONTROL
`
`The ordered-memory architecture and the static shared
`data structures seem to provide a Very clean solution to some
`vexing problems. However. they are only applicable when
`fully static or self-timed scheduling is possible. Although this
`imposes some serious constraints. the constraints are less
`serious than they may appear at first.
`The programming environment called Gabriel [39]. de~
`signed forsignal processing applications. is based on graphical
`dataflow representations of algorithms. Although specialized
`to signal processing. this environment has permitted exten-
`sive experimentation with scheduling algorithms and target
`architectures and with a style of programming that matches
`the need for static scheduling. As mentioned before. we have
`implemented a software simulation of a four-processor or-
`dcred-memory architecture [ 5] using the Frigg hardware
`simulation environment [41 and have retargeted Gabriel to
`this architecture. Hence. we have been able to gain some
`experience compiling and running real programs on this ar—
`chiteclure.
`
`The granularity ol‘thc actors in Gabriel is arbitrary. varying
`i'rorn simple arithmetic operators up to high-level signal pro-
`cessing functions such as FFTs. Gabriel translates dataflotv
`graphs into sequential assembly code for programmable
`DSPs. performing the scheduling statically For multiple pro-
`cessors. A typical signal processing application contains at
`most 1005 of actors. so we can experiment with rather com-
`plex scheduling algorithms without getting bogged down.
`To be able to schedule computations statically, Gabriel
`restricts the datallow model to a subclass called synchronous
`dataflow (SDFI [35]. We begin this section with a review
`ofthe properties of this subclass and then continue by show-
`ing that it is not as limited as it might at first appear. In
`particular. we show that it supports recurrences. manifest
`iteration, and conditional assignment. but does not support
`true recursion. data-dependendent iteration. or conditional
`evaluation.
`
`3.] . Synchronous Dataflow
`
`A subclass of datallow graphs lacking data dependency is
`well suited to static scheduling. Precisely. the term “synw
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 227
`Petitioner Microsoft Corporation - EX. 1066, p. 227
`——————________________________
`
`
`
`«communes FOR STATECALLY SCHEDULED DATAFLOW
`
`34l
`
`chronous dataflow“ has been coined to describe graphs that
`have the following property [35 ]:
`
`to find the best design parameters. so in this paper we retain
`the SDF constraint.
`
`SDF Property. Asynchronottt actor produces and con-
`sumes a fused number of tokens on each ol‘a fixed number
`of input and output paths. An SDF graph consists only of
`synchronous actors.
`
`The basic constraint is that the number oftoltens produced
`or consumed cannot depend on the data. An immediate
`consequence is that SDF graphs cannot have data-dependent
`firing oractors. as one might find. for example. in an if—then-
`else construct. In exchange For this limitation. we gain some
`powerful analytical and practical properties {35. 36]:
`t
`I J For SDF graphs. the number of firings of each actor
`can be easily determined at compile time. If the program is
`nonterminating. as . for example. in real-time DSP. then a
`periodic schedule is always possible. and the number offirings
`of actors within each cycle can be determined at compile
`time. In cithcrcase. knowing these numbers makes it possible
`to construct a deterministic acyclic precedence graph. ll'tlte
`execution time of each actor is deterministic and known.
`then the acyclic precedence graph can be used to construct
`optimal or near-optimal schedules.
`(2) For nonterminating programs it is important to verify
`that memory requirements are bounded. This can be done
`at compile time for SDF graphs
`l3l Starvation conditions. in which a program halts due
`to deadlock. may not be intentional. For any SDF graph. it
`can be analytically determined whether deadlock conditions
`exist.
`
`(4} If the execution time ol'each actor is known. then the
`maximum execution speed of an SDF graph can be deter-
`mined at compile time. For terminating programs.
`this
`means finding the minimum makespan of a schedule. For
`nonterminating programs. this means finding the minimum
`period of a periodic schedule.
`{5} For any nonterminating SDF graph executing ac-
`cording to a periodic schedule. it is possible to buffer data
`between actors statically. Static buli‘en'ng means loosely that
`neither FIFO queues nor dynamically allocated memory are
`required. More specifically. it means that the compiler can
`statically associate memory locations with actorfirings. 'I’hcse
`memory locations contain the input data and provide a re-
`pository For the output data.
`
`These properties are extremely useful for constructing
`parallelizing compilers. but they apply only to SDF graphs.
`and optimal schedules can be constructed only when the
`cxecution times of the actors are known. We have been de-
`veloping techniques that weaken the SDF constraint, thus
`supporting more general dataflow graphs without resorting
`to fully dynamic control [23]. However, these techniques
`require modification of the MOMA controller of the ordered—
`memory architecture. There is still much work to be done
`
`Optimal compile-time scheduling of precedence graphs
`derived From SDF graphs is one of the classic NP-complete
`scheduling problems Many simple heuristics have been de-
`veloped over time, with some very effective ones having
`complexity n3, where n is the number of actors (see for ex-
`ample {25]). However. even it2 complexity can bog down
`a compiler. Fortunately. the granularity of datallow actors
`in Gabriel and the small size of many signal processing ap-
`plications mean that we can ignore this problem for now.
`To generalize these methods beyond signal processing ap-
`plications. strategies will probably be needed to cluster sets
`of actors into macro actors. thus reducing the number of
`actors to be considered in constructing a schedqu For ex»
`ample. the clustering method proposed in [3|] seems suit-
`able.
`
`Static scheduling promises loot-cost architectures. at the
`expense ol'compile—lime complexity. For many applications.
`this is a very attractive tradcofl’. However, only some appli-
`cations can be statically scheduled. The SDF model. which
`can be statically scheduled. may appear to lack control con-
`structs because it does not permit data-dependent firing of
`actors. However. this is not entirely true. Some control
`structures are possible within SDF. notably recurrences.
`manifest iteration. and conditional assignment.
`
`3.2. Recurrences
`
`The datallow community has recognized the importance
`ol'supporting recursion. or self-referential function calls. To
`some extent. this ability has become a litmus test for the
`utility ofa dot-allow model. The most common implemen-
`tation. however, dynamically creates and destroys instances
`of actors. This is clearly going to be problematic for a static
`Scheduler.
`
`[n imperative languages. recursion is used to implement
`recurrences and iteration. usually in combination. IfWe avoid
`the notion of“function calls." at least some recurrences can
`be simply represented as feedback paths in a datallow pro»—
`gram graph. This section studies the representation of re—
`currences using feedback. This representation poses no dif-
`ficulty for static scheduling. although to some it lacks the
`elegance of recursion.
`
`Recurrences depend on the notion ol'“delays." Once un-
`derstood. this notion can he used to explain fundamental
`limits on the concurrency in SDF graphs. It can also be used
`to relate SDF to static dataflow [I4]. This is done below.
`
`3.2. l. Delrtltt
`
`A dataflow graph with recurrence is represented sche-
`matically in Fig. 5. This graph is assumed to fine repeatedly.
`to terminology borrowed from the signal processing com-
`munity, the feedback path has a delay. indicated with :1 dia-
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 228
`Petitioner Microsoft Corporation - EX. 1066, p. 228
`————————____________________________
`
`
`
`342
`
`LEE AND BIER
`
`
`
`FIG. 5. A dataflow graph with a recun'enm Recurrence-s are expressed
`using direct loops and delays.
`
`mood. which can be implemented simply as an initial token
`on the are. A set of delays in a dataflow graph corresponds
`to a marking in Petri nets [45] or to the “D" tag manipulation
`operator in the U-interpreter [I]. In fact. the symbol D was
`selected to suggest delay [3]. A necessary l but not sufficient ]
`condition for avoiding deadlock in an SDF graph is that any
`directed loop in the graph must have at least one delay.
`A delay does not correspond to unit time delay. but rather
`to a single token offset. Such delays are sometimes called
`legit-n! delete or .t'epttmlflm‘ to distinguish them from time
`delays {29]. For SDF graphs. a logical delay need not be a
`run—time operation. Consider for example the feedback are
`in Fig. 5. which has a unit delay. The numbers adjacent to
`the arcs indicate the number ol‘ tokens produced or con-
`sumed when the corresponding actor fires. The initial token
`on the are means that the corresponding input of actor A
`has Sutficient data. so when a token arrives on its other input.
`it can fire. The .tccondtime it fires. it will consume data from
`the Feedback are that is produced by thcfirsi firing at actor
`B. In Steady state. the nth firing of actor 13 will produce a
`token that will be consumed by actor A on its [rt + 1) th
`firing: hence the arc has unit token offset. The value of the
`initial token can be set by the programmer. so a delay can
`be used to initialize a recurrence. When the initial value is
`other than zero. we indicate it using the notion D i value).
`Since delays are simply initial conditions on the buffers. they
`require no run-lime overhead. In Gabriel. a delay is a prop»
`erty of an arc in the datallow graph. rather than an actor.
`1.2.2.
`
`Boundir on chiirmonrt'
`
`Let RU.) be the sum of the execution times of the actors
`in a directed loop L. The iteration period bound is the max-
`imum overall directed loops I. oleL)/D(l’.l. where DH.)
`is the number ofdelays in L [47. 10]. The directed loop l.
`that yields this maximum is called the critical loop. General
`SDF graphs can be systematically converted to homogeneous
`SDFgraphs for the purpose ofcomputing the iteration period
`bound [34]. [fthere are no directed loops in the graph. then
`we define the iteration period bound to be zero. since in
`principle all firings ol'each node could occur simultaneously.
`it is important to realize that there is nothing fundamental
`in the following discussion that prevents this. Implementa-
`tion considerations may make it impractical. however.
`Another limitation on concurrency is the notion of state.
`Particularly in large- or medium-grain datallow graphs. it is
`convenient to permit an actor to remember data from one
`invocation to the next. This is simply modeled as a self-loop
`with a unit delay. Such a self~loop precludes multiple si-
`m ultaneous invocations ofthe actor. hence this self—loop may
`become the critical loop.
`Once the iteration period bound is known. we can derive
`a bound on the performance of an ordered-memory archi-
`tecture. on the basis of a set of l admittedly] unrealistic as-
`sumptions. First. assume that we have a completely deter-
`ministic dataflow graph. and assume that there are enough
`processors that a hypothetically optimal scheduler can meet
`the iteration period bound. The iteration period hound does
`not reflect bandwidth or latency limitations on interprocessor
`communication. however. For the ordered-memory archi-
`tecture of Fig.
`.‘i. a memory transaction can occur in one
`cycle of' the shared memory. If we assume that the shared-
`memory cycle time is the same as local~memory cycle tinte.‘
`then latency adds nothing to the iteration period. Bandwidth
`limitations. however. may add to the interatiott period. Each
`lime the ideal scheduler schedules two simultaneous memory
`transactions. one of them must be delayed. Il’onc ol‘tltem
`is not in the critical path. then that one should be delayed.
`and there may again be no effect on the iteration period. if
`both are in the critical path. then the iteration period will
`be extended by one cycle. If three transactions are scheduled
`simultaneously. then one ofthe transactions has to be delayed
`two cycles. increasing the interation period by at most two
`cycles. If M simultaneous transactions are scheduled. then
`the intention period increases by at most .5! — 1 cycles.
`if“
`the total number of transactions is 3‘". then the very worst
`situation'increases the iteration bound by at most T --
`1
`cycles.
`
`Consider nonterminating algorithms. or algorithms that
`operate on a large data set. For these. directed loops are the
`only fundamental limitation on the parallelizability of the
`algorithm. This is intuitive because any algorithm without
`recurrences can be pipelinert. A special case ofSDF. called
`ltoittugaiteons SDF. is where every actor produces and con-
`sumes a single token on each input and output. For ho—
`mogeneous SDF graphs. it is easy to compute the minimum
`period at which an actor can be fired. This is called the ir-
`ernn‘on period bound and is the reciprocal of the maximum
`compution rate. The iteration period bound may be much
`smaller than the time required to compute one pass through
`the datallovv graph. it is a limit on the time per pass if an
`infinite number of passes are computed.
`
`Suppose now that 10 processors are each running a pro—
`gram that accesocs shared memory 10% of the time. Then
`
`" This is actually not a bad assumption for the architecture in Fig. 3. if
`the number ol‘processors is modest. The main limitation on Shared-mentor}
`cycle time is likely to be capacitive loading on the shat-ad bus. and the price
`ol'the memory. of course.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 229
`Petitioner Microsoft Corporation - EX. 1066, p. 229
`
`
`
`ARCHITECTURES FUR STATICALLY SCHEDULED DATAFLUW
`
`343
`
`this bound tells us that the ordered shared memory archi-
`tecture can run this program in at most twice the time of
`the theoretical minimum. However. this result should not
`be taken very seriously because the performance will depend
`much more heavily on the scheduling heuristics used. The
`performance can be better (since simultaneous transactions
`can be studiously avoided [48]) but can also be worse t if.
`as is likely. a suboptimal scheduling algorithm is used l. Fur-
`thermore. the use of gateways completely undermines this
`analysis. Thus. this is not a Very useful hound.
`
`3.2.]. Bounded Buffer Sizes
`
`Although SDF actors cannot be created at run time, SDF
`is not the same as static dataflow [I4]. For instance. in SDF.
`there is no impediment 10 having multiple instances of an
`actor firc simultaneously, as long as the actor does not have
`state. A particular implementation. however. may impose
`such a constraint. Consider for example an implementation
`that permits no more than one memory location to be as-
`sociated with each arc. This is the key limitation in static
`datallow [M]. It can be modeled with the recurrence in Fig
`5. The feedback are begins with an initial token. This token
`represents a “space" on the output bufi'er of actor A. After
`A fires and consumes that token, it cannot fire again until
`after B has tired. Any memory limitation on any arc in an
`SDF graph can be modeled as :1 Feedback path with a fixed
`number of delays. To avoid unnecessarily sacrificing con-
`Currency. enough memory should be allocated to each an:
`that the corresponding feedback pull-l does not become the
`critical loop.
`
`Suppose that. in Fig. :i. actors A and B are scheduled onto
`different processors. In a conventional shared—memory ar—
`chitecture. any buffer size limitation implies handshaking at
`run time. In elfcct. the feedback path in Fig. 5 has to be
`implemented at run time. just to carry semaphores that in-
`dicate when it is safe to write to the feedfonvard butler. For
`the ordered-memory architecture. however. bufi'cr size lim~
`nations can be statically modeled by the scheduler. They
`imply no additional run-time overhead. [n Fig. 5. the sched-
`uler knows that the write from A and the read to 8 must
`alternate. Since the order of the transactions is enforced at
`run time without semaphores. no additional overhead is in-
`curred.
`
`3.3. Manifest iteration
`
`ln manifest iteration. the number of repetitions ofa com-
`putation is known at compile time and hence is independent
`
`{eke-3W“
`in
`1
`101:1 10l11flfl
`
`FIG. ti. An SDF graph that contains nesled itemtion.
`
`
`
`[-16.1 A modification ofFI'g. f- to model the efi’cct ofa buffer oflength
`Illl between actors B and C‘.
`
`of the data. it can be statically scheduled. Furthermore. it
`can be expressed in dataflow graphs by specifying the number
`oftokens produced and consumed each time an actor lit-es.
`For example. actor A in Fig. ti produces 10 tokens each time
`it fires. as indicated by the "10“ adjacent to its output. Actor
`B consumes l token each time it fires. so it will fire ID times
`l'or every firing of actor A. In conventional programming
`languages. this would be expressed with afitr loop. Nested
`for loops are easily conceived as shown in Fig o. If actors A
`and E fire once each. then B and D will fire 10 times. and
`C will
`lire 100 times. Techniques for automatically con-
`structing static parallel schedules for such graphs are given
`in [35] and [43}.
`There is no fundamental limitation on the parallelism in
`F13. 6 (there are no directed loops] . Hence. this scheme solves
`the first open problem listed by Dennis in [13}. providing
`the semantics of a "parallcl—l'or“ in datal'low.
`
`3.3. l. Bounded Baffin:
`
`Although there is no fundamental limitation on the par-
`allelism in Fig. 6 (there are no directed loops}. there may
`be practical limitations. In Fig. 7. we model a butler oriength
`10 between actors B and C. Again. the tokens on the feedback
`path represent empty locations in the buffer. Actor 13 must
`have :0 tokens on the feedback path lie. 10 empty locations
`in the buffer} before it fires. Whenever actor C fires. it com
`sumes l
`token from the forward path. freeing a buffer lo-
`cation. and indicating the free buffer location by putting a
`token on the freedbnck path. The minimum bufl'er size that
`avoids deadlock is ID.
`
`This nonhomogeneous SDF graph could be converted to
`a homogeneous SDF graph and the iteration period bound
`computed. but in this simple example the iteration period
`bou ad is easily seen by inspection. It is clear that after each
`firing OTB. C must lire ll) limos before 3 can fire again. The
`ID firings can occur in parallel. so the minimum period of
`a periodic schedule is R5 + Rt. where Ry is the run time of
`actor I. In other words. successive firings of B cannot occur
`in parallel because ofthe butler space limitations. By contrast
`if the buli’er had length 100. then It} invocations of 13 could
`{ire simultaneously. assuming Ihere are no other practical
`difficulties. Just as with unit length buffers. no additional
`synchronization overhead is required in the ordered-memory
`architecture to support these bounded buffers.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 230
`Petitioner Microsoft Corporation - EX. 1066, p. 230
`—————-———__________________________
`
`
`
`344
`
`3.3.2. Static Buyers
`
`LEE AND HER
`
`the same as the first time. But with a FIFO buffer oflength
`ID.
`invocations of' C need not access the buffer through
`pointers. so there is no contention for access to the pointers.
`The buffer data can be supplied to all
`ll] firings in parallel.
`assuming the hardware has a mechanism for doing this. In
`the ordered~memory architecture. the IO firings cannot be
`initiated simultaneously. because of bus bandwidth limita-
`tions. However, they can be initiated at intervals of one
`shared-bus cycle. If this cycle time is small compared to the
`execution time of the actors. then the concurrency in the
`parallel-For is adequately exploited.
`An alternative to static huli‘ering that also permits parallel
`firings of successive instances of the same actor is token
`matching [1}. However. even the relatively low coin ol'some
`implementations of token matching [44] would be hard to
`justify for SDF graphs. where static bufi'ering can be used.
`In Fig. 6 we use actors that produce more tokens than
`they consume. or consume more tokens than they produce.
`Proper design of these actors can lead to iteration constructs
`semantically similar to those encountered in conventional
`programming languages. In Fig. 3 we Show three such actors
`that have proved useful in DSP applications. The first. Fig.
`8:). simply emits the last ofN tokens. where N is a parameter
`of the actor. The second. Fig. 8b. takes one input token and
`repeats it on the output. The third. Fig. He, takes one input
`token each time it fires and emits the last N tokens that
`arrived. It has a self-loop used to remember the past tokens
`(and initialize them 1. This can be viewed as the state ofthe
`actor; it effectively prevents multiple simultaneous invoca-
`tions of the actor.
`
`A complete iteration model must include the ability to
`nest recurrences within the iteration and to initialize the re-
`currences when the interation begins. The SDF model can
`handle this. We illustrate this with a finite impulse response
`lFIR) digital filter because it is a simple example. An FIR
`filter computes the inner product or a vector of coelficierrts
`and a vector with the last N input tokens. when: .-\' is the
`order of the filter. It is usually assumed to repeat forever.
`firing each time a new input token arrives. Consider the pos
`sible implementations using a datallow graph. A large-grain
`approach is to define an actor with the implementation dc;
`tails hidden inside. This is the preferred approach in Gabriel.
`An alternative is a line—grain implementation with multiple
`adders and multipliers and a delay line. A third possibility
`is to use iteration and a single adder and multiplier. The first
`
`A second limitation on the parallelism can arise from the
`addressing mechanism of the heifers. Each bufi'er can be
`implemented as a FIFO queue. as was done in Davis' DDM
`{12]. Delays are correctly handled. but then access to the
`buffer becomes a critical section ol‘the parallel code. FIFO
`queues are economically implemented as circular bullets with
`pointers to the read and write locations. However. parallel
`access to the pointers becomes a problem. It' successive irr-
`vocations of an actor are to fire simultaneously on several
`processors. then great care must be taken to ensure the in—
`tegrity of the pointers. A typical approach would be to lock
`the pointers while one processor has control or the FIFO
`queue. but this partially socialismI the implementation. Fur—
`thermore. this requires that the hardware support an indi-
`visible test-andeet operation.
`the FIFO imple-
`In the ordered-memory architecture.
`mentation can be made simpler than in a general shared—
`memory architecture. but a less expensive alternative is static
`bothering [36], Static buffering is based on the obmrvation
`that there is a periodicity in the buffer access that a compiler
`can exploit.
`It preserves the behavior of FIFO queues
`( namely. it correctly handles delays and ordering ofrokensl.
`hut avoids read and write pointers. Specifically. suppose that
`all bufi'ers are implemented with fixed-length circular buffers.
`implementing FIFO queues. where each length has been
`predetermined to be long encugh to sustain the run without
`causing a deadlock. Then consider an input ofany actor in
`on SDF graph. Every N firings, where Nis to be determined.
`the actor will get its input tokents) from the same memory
`locations. The compiler can hard-code these memory loca-
`tions into the implementation. bypassing the need for point—
`ers to the buffer. Systematic methods for doing this. devel-
`oped in [36]. can be illustrated by example. Consider the
`graph in Fig. 7. which is a representation of Fig. 6 with the
`buffer between B and C assigned the length It]. A parallel
`implementation of this can be represented as follows:
`FIRE A
`DC) ten times {
`FIRE B
`
`D0 in parallel ten times l_
`FIRE C
`l
`
`FIRE o
`
`IRE E
`
`_
`
`lF
`
`For each parallel firing ofC. the compiler supplies a spertfic
`memory location for-it to get its input tokens. Note that this
`would not be possible if the FIFO bufl'er bad length I]. for
`example. because the second time the inner [)0 loop is ex»
`ecuted the memory locations accessed by C would net he
`
`"I -~ “Ed
`
`to
`
`(at)
`
`(b)
`FIG. 8. Three SDF actors useful. for iteration.
`
`(C!
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 231
`Petitioner Microsoft Corporation - EX. 1066, p. 231
`
`
`
`ARCHITECTURES FDR STATICALDr' SCHEDULED DATAF'LOW
`
`345
`
`and last possibilities have the advantage that the complexity
`of‘the dataflow graph is independent of the order ofttte filter.
`A good compiler should be able to do as well with any of
`the three structures. One implementation of the last possi-
`bility is shown in Fig. 9. The iteration actors are drawn from
`Fig. 8. The coefficients actor simply outputs a stream of N
`coefficients: it produces one coefficient each time it fires and
`reverts to the beginning of the coelfioient list after reaching
`the end. It could be implemented with a directed loop with
`N delays. or a number of other ways. The product of the
`input data and the coefl'icients is accumulated by Lhe adder
`with a feedback loop. The output of the filter is selected by
`the “last of N“ actor.
`
`The FIR filter in Fig. 9 has the advantage of exploitable
`concurrency combined with a graph complexity that is in—
`dependent ofthe order of the filter. Note. however, that there
`is a diificulty with the feedback loop at the adder. Recall
`From the above that a delay is simply an initial token on the
`are. If this initial token has value zero. then the first output
`ol'the FIR filter will be correct. However. afierevm- N firings
`of the adder. we wish to reset the token on that are to zero.
`This could bedooe with some extra actors. but a fundamental
`dilficu‘lty would remain. The presence of that feedback loop
`implies a limitation on the parallelism ofthe FIR filter. and
`that limitation would be an artifact of our implementation.
`Our solution is to introduce the notion oi" a resetting dolor.
`indicated with a diamond containing an R.
`
`3. 3. 3. Recurring Delhi-ts
`
`A resetting delay is associated with a subgraph. which in
`Fig. 9 is surrounded by a dashed line. For each invocation
`of the suhgraph. the delay token is reinitialized to zero. Fur-
`thermore, the scheduler knows that the precedence is broken
`when this occurs. and consequently it can schedule successive
`FIR output computations simultaneously on separate pro-
`cessors.
`
`The resetting delay can be used in any SDF graph where
`we have nested iterations where the inner iterations involve
`recurrences that must be initialized. to other words. anything
`or the Form
`
`DO some number oftimes {
`initialize X
`
`
`
`FIG. 9. A u FIR filter implemented using :1 single multiplier and adder.
`
`DO some number of times i
`new X = E‘tX}
`
`l
`
`The implementation of a resetting delay is simple and
`general. For the purposes ot'implernentation. the scheduler
`first treats the delay as if it were an actor that consumes one
`token and produces onc token each timc it fires. Recall that
`in practice no run-time operation is required to implement
`a delay, so there actually is no stich aetor. However. by in-
`serting this mythical actor. the scheduler can determine how
`many times it would tire (if it did exist) for each firing of
`the associated suhgraph. The method for doing this is given
`in [35] and consists of‘solving a simple system ofequations.
`For each resetting delay, the scheduler obtains a number N
`of invocations between resets: this number is used to break
`the precedence ol‘ the are for every Nth token and to insert
`an object code that reinitializes the delay value. The method
`works even if‘the suhgmph is not invoked as a unit and even
`if it is scattered among the available processors. It is panic-
`ularly simple when inline code is generated. However. when
`the iteration is implemented by the compiler using loops.
`then a small amount of run-Lime overhead may have to be
`associated with some delays in order to count invocations.
`So far we have shown that neither manifest iteration nor
`recurrences present a fundamental problem for the SDF
`model. Resetting delays can be used to initialize recurrences
`within nested iterations. Hence corresponding programming
`constructs can be efficiently and automatically implemented
`on an ordered-memory architecture. Conditionals are a bit
`more problematic.
`
`3.4. Conditional Assignment
`
`Conditionals in datattow graphs are harder to describe and
`schedule statically. One attractive solution is a mixed-mode
`programming environment. where the programmer can use
`dalatlow at the highest level and conventional languages such
`as C at a lower level. Gabriel is precisely such an environ-
`ment. Conditionals would be expressed in the conventional
`language. This is only a partial solution. however, because
`conditionals would be restricted to lie entirely within one
`large-grain actor. and concurrency within such actors is dif-
`ficult to exploit. Ifthe complexity of the operations that ar