throbber
340
`
`LEE .RND BIER
`
`software controlled paging rather than dynamic caches
`I37. 38 1.
`
`l4. Statically Ordered Data Structure Access
`
`The lack of side ell‘ects in datallow actors makes it partic-
`ularly difficult to support large. shared data structures [20].
`Arvind er a}. have therefore extended the datallow model by
`introducing .l—slrtrctures. a controlled form of global data
`structures [2}. l-structures are write~once data structures with
`nonstrict semantics. which in practice means that reads may
`be issued before data are available in the data structure. Sup-
`port for l-structures requires an ability to queue read requests
`until
`they can be satisfied. This mechanism is the most
`promising available. but it does not come cheaply. One extra
`memory location is required for each read instruction that
`can be simultaneously pending. In addition. the l-structure
`memory needs a processor ofu sort to tend to pending reads
`when a write finally occurs. A much simpler mechanism
`may be used when scheduling is static.
`Consider for example an actor that emits an array. This
`array might be carried by a single token. Suppose that there
`are two actors that take this array as an argument A pure
`datallow model requires that the array be copied. or at least
`than an implementation behave as if the array had been
`copied. Using an l-structure avoids this copying. However.
`with ordered-memory accesses the copying is not necessary.
`and neither is the l-structure memory. Since the scheduler
`is aware of all precedences. it will avoid scheduling reads
`before the data become available. If this can not be avoided
`la processor has nothing to do until the data become avail—
`able). then the read is attempted before the bus is granted
`to the processor. so the processor halts, The bus will not be
`granted to the processor until the data are ready. There is
`no need to queue accesses.
`When data passes through the shared memory From one
`actor to another actor. the scheduler can reclaim the token
`storage after scheduling the read by the destination. Write.
`once shared-data structures are only slightly more compli-
`cated. because there may be more than one destination actor.
`The scheduler can simply use reference counts (RC5) [26.
`I6] to determine when the memory can be reclaimed. For
`the above example. the RC" associated with the array storage
`would be initialized to 2. the number ol‘destinntinns, when
`the anay is scheduled to be written. Each time a read is
`scheduled. the RC is decremented. When it reaches 0 the
`memory can be reclaimed. This works without any run-time
`overhead because the order of these transactions will be en-
`forced at run time.
`
`Many variations of this idea immediately come to mind:
`For errant pie. reference counts could be used for each element
`of the array. instead of the whole array. thereby obtaining
`some of the advantages of the nonstrietness of I-structures.
`Specifically. the array does not have to be completely filled
`
`before some of its elements can be read. Also. if the RC of
`a data structure is identically one. then an actor using it may
`modify it. instead of simply reading it. something not per-
`mined in the write-once l-structures. An intelligent code
`generator can get considerable mileage out of this.
`The reference count technique has been criticized for a
`number of reasons [3]. most of which break down when the
`scheduling is static. Primarily.
`l'or ordercd~memory archi-
`tectures. the overhead of managing RC5 is incurred milieu:
`scheduling time. not at run time.
`
`3. STATICALLY SCHEDULED CONTROL
`
`The ordered-memory architecture and the static shared
`data structures seem to provide a Very clean solution to some
`vexing problems. However. they are only applicable when
`fully static or self-timed scheduling is possible. Although this
`imposes some serious constraints. the constraints are less
`serious than they may appear at first.
`The programming environment called Gabriel [39]. de~
`signed forsignal processing applications. is based on graphical
`dataflow representations of algorithms. Although specialized
`to signal processing. this environment has permitted exten-
`sive experimentation with scheduling algorithms and target
`architectures and with a style of programming that matches
`the need for static scheduling. As mentioned before. we have
`implemented a software simulation of a four-processor or-
`dcred-memory architecture [ 5] using the Frigg hardware
`simulation environment [41 and have retargeted Gabriel to
`this architecture. Hence. we have been able to gain some
`experience compiling and running real programs on this ar—
`chiteclure.
`
`The granularity ol‘thc actors in Gabriel is arbitrary. varying
`i'rorn simple arithmetic operators up to high-level signal pro-
`cessing functions such as FFTs. Gabriel translates dataflotv
`graphs into sequential assembly code for programmable
`DSPs. performing the scheduling statically For multiple pro-
`cessors. A typical signal processing application contains at
`most 1005 of actors. so we can experiment with rather com-
`plex scheduling algorithms without getting bogged down.
`To be able to schedule computations statically, Gabriel
`restricts the datallow model to a subclass called synchronous
`dataflow (SDFI [35]. We begin this section with a review
`ofthe properties of this subclass and then continue by show-
`ing that it is not as limited as it might at first appear. In
`particular. we show that it supports recurrences. manifest
`iteration, and conditional assignment. but does not support
`true recursion. data-dependendent iteration. or conditional
`evaluation.
`
`3.] . Synchronous Dataflow
`
`A subclass of datallow graphs lacking data dependency is
`well suited to static scheduling. Precisely. the term “synw
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 227
`Petitioner Microsoft Corporation - EX. 1066, p. 227
`——————________________________
`
`

`

`«communes FOR STATECALLY SCHEDULED DATAFLOW
`
`34l
`
`chronous dataflow“ has been coined to describe graphs that
`have the following property [35 ]:
`
`to find the best design parameters. so in this paper we retain
`the SDF constraint.
`
`SDF Property. Asynchronottt actor produces and con-
`sumes a fused number of tokens on each ol‘a fixed number
`of input and output paths. An SDF graph consists only of
`synchronous actors.
`
`The basic constraint is that the number oftoltens produced
`or consumed cannot depend on the data. An immediate
`consequence is that SDF graphs cannot have data-dependent
`firing oractors. as one might find. for example. in an if—then-
`else construct. In exchange For this limitation. we gain some
`powerful analytical and practical properties {35. 36]:
`t
`I J For SDF graphs. the number of firings of each actor
`can be easily determined at compile time. If the program is
`nonterminating. as . for example. in real-time DSP. then a
`periodic schedule is always possible. and the number offirings
`of actors within each cycle can be determined at compile
`time. In cithcrcase. knowing these numbers makes it possible
`to construct a deterministic acyclic precedence graph. ll'tlte
`execution time of each actor is deterministic and known.
`then the acyclic precedence graph can be used to construct
`optimal or near-optimal schedules.
`(2) For nonterminating programs it is important to verify
`that memory requirements are bounded. This can be done
`at compile time for SDF graphs
`l3l Starvation conditions. in which a program halts due
`to deadlock. may not be intentional. For any SDF graph. it
`can be analytically determined whether deadlock conditions
`exist.
`
`(4} If the execution time ol'each actor is known. then the
`maximum execution speed of an SDF graph can be deter-
`mined at compile time. For terminating programs.
`this
`means finding the minimum makespan of a schedule. For
`nonterminating programs. this means finding the minimum
`period of a periodic schedule.
`{5} For any nonterminating SDF graph executing ac-
`cording to a periodic schedule. it is possible to buffer data
`between actors statically. Static buli‘en'ng means loosely that
`neither FIFO queues nor dynamically allocated memory are
`required. More specifically. it means that the compiler can
`statically associate memory locations with actorfirings. 'I’hcse
`memory locations contain the input data and provide a re-
`pository For the output data.
`
`These properties are extremely useful for constructing
`parallelizing compilers. but they apply only to SDF graphs.
`and optimal schedules can be constructed only when the
`cxecution times of the actors are known. We have been de-
`veloping techniques that weaken the SDF constraint, thus
`supporting more general dataflow graphs without resorting
`to fully dynamic control [23]. However, these techniques
`require modification of the MOMA controller of the ordered—
`memory architecture. There is still much work to be done
`
`Optimal compile-time scheduling of precedence graphs
`derived From SDF graphs is one of the classic NP-complete
`scheduling problems Many simple heuristics have been de-
`veloped over time, with some very effective ones having
`complexity n3, where n is the number of actors (see for ex-
`ample {25]). However. even it2 complexity can bog down
`a compiler. Fortunately. the granularity of datallow actors
`in Gabriel and the small size of many signal processing ap-
`plications mean that we can ignore this problem for now.
`To generalize these methods beyond signal processing ap-
`plications. strategies will probably be needed to cluster sets
`of actors into macro actors. thus reducing the number of
`actors to be considered in constructing a schedqu For ex»
`ample. the clustering method proposed in [3|] seems suit-
`able.
`
`Static scheduling promises loot-cost architectures. at the
`expense ol'compile—lime complexity. For many applications.
`this is a very attractive tradcofl’. However, only some appli-
`cations can be statically scheduled. The SDF model. which
`can be statically scheduled. may appear to lack control con-
`structs because it does not permit data-dependent firing of
`actors. However. this is not entirely true. Some control
`structures are possible within SDF. notably recurrences.
`manifest iteration. and conditional assignment.
`
`3.2. Recurrences
`
`The datallow community has recognized the importance
`ol'supporting recursion. or self-referential function calls. To
`some extent. this ability has become a litmus test for the
`utility ofa dot-allow model. The most common implemen-
`tation. however, dynamically creates and destroys instances
`of actors. This is clearly going to be problematic for a static
`Scheduler.
`
`[n imperative languages. recursion is used to implement
`recurrences and iteration. usually in combination. IfWe avoid
`the notion of“function calls." at least some recurrences can
`be simply represented as feedback paths in a datallow pro»—
`gram graph. This section studies the representation of re—
`currences using feedback. This representation poses no dif-
`ficulty for static scheduling. although to some it lacks the
`elegance of recursion.
`
`Recurrences depend on the notion ol'“delays." Once un-
`derstood. this notion can he used to explain fundamental
`limits on the concurrency in SDF graphs. It can also be used
`to relate SDF to static dataflow [I4]. This is done below.
`
`3.2. l. Delrtltt
`
`A dataflow graph with recurrence is represented sche-
`matically in Fig. 5. This graph is assumed to fine repeatedly.
`to terminology borrowed from the signal processing com-
`munity, the feedback path has a delay. indicated with :1 dia-
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 228
`Petitioner Microsoft Corporation - EX. 1066, p. 228
`————————____________________________
`
`

`

`342
`
`LEE AND BIER
`
`
`
`FIG. 5. A dataflow graph with a recun'enm Recurrence-s are expressed
`using direct loops and delays.
`
`mood. which can be implemented simply as an initial token
`on the are. A set of delays in a dataflow graph corresponds
`to a marking in Petri nets [45] or to the “D" tag manipulation
`operator in the U-interpreter [I]. In fact. the symbol D was
`selected to suggest delay [3]. A necessary l but not sufficient ]
`condition for avoiding deadlock in an SDF graph is that any
`directed loop in the graph must have at least one delay.
`A delay does not correspond to unit time delay. but rather
`to a single token offset. Such delays are sometimes called
`legit-n! delete or .t'epttmlflm‘ to distinguish them from time
`delays {29]. For SDF graphs. a logical delay need not be a
`run—time operation. Consider for example the feedback are
`in Fig. 5. which has a unit delay. The numbers adjacent to
`the arcs indicate the number ol‘ tokens produced or con-
`sumed when the corresponding actor fires. The initial token
`on the are means that the corresponding input of actor A
`has Sutficient data. so when a token arrives on its other input.
`it can fire. The .tccondtime it fires. it will consume data from
`the Feedback are that is produced by thcfirsi firing at actor
`B. In Steady state. the nth firing of actor 13 will produce a
`token that will be consumed by actor A on its [rt + 1) th
`firing: hence the arc has unit token offset. The value of the
`initial token can be set by the programmer. so a delay can
`be used to initialize a recurrence. When the initial value is
`other than zero. we indicate it using the notion D i value).
`Since delays are simply initial conditions on the buffers. they
`require no run-lime overhead. In Gabriel. a delay is a prop»
`erty of an arc in the datallow graph. rather than an actor.
`1.2.2.
`
`Boundir on chiirmonrt'
`
`Let RU.) be the sum of the execution times of the actors
`in a directed loop L. The iteration period bound is the max-
`imum overall directed loops I. oleL)/D(l’.l. where DH.)
`is the number ofdelays in L [47. 10]. The directed loop l.
`that yields this maximum is called the critical loop. General
`SDF graphs can be systematically converted to homogeneous
`SDFgraphs for the purpose ofcomputing the iteration period
`bound [34]. [fthere are no directed loops in the graph. then
`we define the iteration period bound to be zero. since in
`principle all firings ol'each node could occur simultaneously.
`it is important to realize that there is nothing fundamental
`in the following discussion that prevents this. Implementa-
`tion considerations may make it impractical. however.
`Another limitation on concurrency is the notion of state.
`Particularly in large- or medium-grain datallow graphs. it is
`convenient to permit an actor to remember data from one
`invocation to the next. This is simply modeled as a self-loop
`with a unit delay. Such a self~loop precludes multiple si-
`m ultaneous invocations ofthe actor. hence this self—loop may
`become the critical loop.
`Once the iteration period bound is known. we can derive
`a bound on the performance of an ordered-memory archi-
`tecture. on the basis of a set of l admittedly] unrealistic as-
`sumptions. First. assume that we have a completely deter-
`ministic dataflow graph. and assume that there are enough
`processors that a hypothetically optimal scheduler can meet
`the iteration period bound. The iteration period hound does
`not reflect bandwidth or latency limitations on interprocessor
`communication. however. For the ordered-memory archi-
`tecture of Fig.
`.‘i. a memory transaction can occur in one
`cycle of' the shared memory. If we assume that the shared-
`memory cycle time is the same as local~memory cycle tinte.‘
`then latency adds nothing to the iteration period. Bandwidth
`limitations. however. may add to the interatiott period. Each
`lime the ideal scheduler schedules two simultaneous memory
`transactions. one of them must be delayed. Il’onc ol‘tltem
`is not in the critical path. then that one should be delayed.
`and there may again be no effect on the iteration period. if
`both are in the critical path. then the iteration period will
`be extended by one cycle. If three transactions are scheduled
`simultaneously. then one ofthe transactions has to be delayed
`two cycles. increasing the interation period by at most two
`cycles. If M simultaneous transactions are scheduled. then
`the intention period increases by at most .5! — 1 cycles.
`if“
`the total number of transactions is 3‘". then the very worst
`situation'increases the iteration bound by at most T --
`1
`cycles.
`
`Consider nonterminating algorithms. or algorithms that
`operate on a large data set. For these. directed loops are the
`only fundamental limitation on the parallelizability of the
`algorithm. This is intuitive because any algorithm without
`recurrences can be pipelinert. A special case ofSDF. called
`ltoittugaiteons SDF. is where every actor produces and con-
`sumes a single token on each input and output. For ho—
`mogeneous SDF graphs. it is easy to compute the minimum
`period at which an actor can be fired. This is called the ir-
`ernn‘on period bound and is the reciprocal of the maximum
`compution rate. The iteration period bound may be much
`smaller than the time required to compute one pass through
`the datallovv graph. it is a limit on the time per pass if an
`infinite number of passes are computed.
`
`Suppose now that 10 processors are each running a pro—
`gram that accesocs shared memory 10% of the time. Then
`
`" This is actually not a bad assumption for the architecture in Fig. 3. if
`the number ol‘processors is modest. The main limitation on Shared-mentor}
`cycle time is likely to be capacitive loading on the shat-ad bus. and the price
`ol'the memory. of course.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 229
`Petitioner Microsoft Corporation - EX. 1066, p. 229
`
`

`

`ARCHITECTURES FUR STATICALLY SCHEDULED DATAFLUW
`
`343
`
`this bound tells us that the ordered shared memory archi-
`tecture can run this program in at most twice the time of
`the theoretical minimum. However. this result should not
`be taken very seriously because the performance will depend
`much more heavily on the scheduling heuristics used. The
`performance can be better (since simultaneous transactions
`can be studiously avoided [48]) but can also be worse t if.
`as is likely. a suboptimal scheduling algorithm is used l. Fur-
`thermore. the use of gateways completely undermines this
`analysis. Thus. this is not a Very useful hound.
`
`3.2.]. Bounded Buffer Sizes
`
`Although SDF actors cannot be created at run time, SDF
`is not the same as static dataflow [I4]. For instance. in SDF.
`there is no impediment 10 having multiple instances of an
`actor firc simultaneously, as long as the actor does not have
`state. A particular implementation. however. may impose
`such a constraint. Consider for example an implementation
`that permits no more than one memory location to be as-
`sociated with each arc. This is the key limitation in static
`datallow [M]. It can be modeled with the recurrence in Fig
`5. The feedback are begins with an initial token. This token
`represents a “space" on the output bufi'er of actor A. After
`A fires and consumes that token, it cannot fire again until
`after B has tired. Any memory limitation on any arc in an
`SDF graph can be modeled as :1 Feedback path with a fixed
`number of delays. To avoid unnecessarily sacrificing con-
`Currency. enough memory should be allocated to each an:
`that the corresponding feedback pull-l does not become the
`critical loop.
`
`Suppose that. in Fig. :i. actors A and B are scheduled onto
`different processors. In a conventional shared—memory ar—
`chitecture. any buffer size limitation implies handshaking at
`run time. In elfcct. the feedback path in Fig. 5 has to be
`implemented at run time. just to carry semaphores that in-
`dicate when it is safe to write to the feedfonvard butler. For
`the ordered-memory architecture. however. bufi'cr size lim~
`nations can be statically modeled by the scheduler. They
`imply no additional run-time overhead. [n Fig. 5. the sched-
`uler knows that the write from A and the read to 8 must
`alternate. Since the order of the transactions is enforced at
`run time without semaphores. no additional overhead is in-
`curred.
`
`3.3. Manifest iteration
`
`ln manifest iteration. the number of repetitions ofa com-
`putation is known at compile time and hence is independent
`
`{eke-3W“
`in
`1
`101:1 10l11flfl
`
`FIG. ti. An SDF graph that contains nesled itemtion.
`
`
`
`[-16.1 A modification ofFI'g. f- to model the efi’cct ofa buffer oflength
`Illl between actors B and C‘.
`
`of the data. it can be statically scheduled. Furthermore. it
`can be expressed in dataflow graphs by specifying the number
`oftokens produced and consumed each time an actor lit-es.
`For example. actor A in Fig. ti produces 10 tokens each time
`it fires. as indicated by the "10“ adjacent to its output. Actor
`B consumes l token each time it fires. so it will fire ID times
`l'or every firing of actor A. In conventional programming
`languages. this would be expressed with afitr loop. Nested
`for loops are easily conceived as shown in Fig o. If actors A
`and E fire once each. then B and D will fire 10 times. and
`C will
`lire 100 times. Techniques for automatically con-
`structing static parallel schedules for such graphs are given
`in [35] and [43}.
`There is no fundamental limitation on the parallelism in
`F13. 6 (there are no directed loops] . Hence. this scheme solves
`the first open problem listed by Dennis in [13}. providing
`the semantics of a "parallcl—l'or“ in datal'low.
`
`3.3. l. Bounded Baffin:
`
`Although there is no fundamental limitation on the par-
`allelism in Fig. 6 (there are no directed loops}. there may
`be practical limitations. In Fig. 7. we model a butler oriength
`10 between actors B and C. Again. the tokens on the feedback
`path represent empty locations in the buffer. Actor 13 must
`have :0 tokens on the feedback path lie. 10 empty locations
`in the buffer} before it fires. Whenever actor C fires. it com
`sumes l
`token from the forward path. freeing a buffer lo-
`cation. and indicating the free buffer location by putting a
`token on the freedbnck path. The minimum bufl'er size that
`avoids deadlock is ID.
`
`This nonhomogeneous SDF graph could be converted to
`a homogeneous SDF graph and the iteration period bound
`computed. but in this simple example the iteration period
`bou ad is easily seen by inspection. It is clear that after each
`firing OTB. C must lire ll) limos before 3 can fire again. The
`ID firings can occur in parallel. so the minimum period of
`a periodic schedule is R5 + Rt. where Ry is the run time of
`actor I. In other words. successive firings of B cannot occur
`in parallel because ofthe butler space limitations. By contrast
`if the buli’er had length 100. then It} invocations of 13 could
`{ire simultaneously. assuming Ihere are no other practical
`difficulties. Just as with unit length buffers. no additional
`synchronization overhead is required in the ordered-memory
`architecture to support these bounded buffers.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 230
`Petitioner Microsoft Corporation - EX. 1066, p. 230
`—————-———__________________________
`
`

`

`344
`
`3.3.2. Static Buyers
`
`LEE AND HER
`
`the same as the first time. But with a FIFO buffer oflength
`ID.
`invocations of' C need not access the buffer through
`pointers. so there is no contention for access to the pointers.
`The buffer data can be supplied to all
`ll] firings in parallel.
`assuming the hardware has a mechanism for doing this. In
`the ordered~memory architecture. the IO firings cannot be
`initiated simultaneously. because of bus bandwidth limita-
`tions. However, they can be initiated at intervals of one
`shared-bus cycle. If this cycle time is small compared to the
`execution time of the actors. then the concurrency in the
`parallel-For is adequately exploited.
`An alternative to static huli‘ering that also permits parallel
`firings of successive instances of the same actor is token
`matching [1}. However. even the relatively low coin ol'some
`implementations of token matching [44] would be hard to
`justify for SDF graphs. where static bufi'ering can be used.
`In Fig. 6 we use actors that produce more tokens than
`they consume. or consume more tokens than they produce.
`Proper design of these actors can lead to iteration constructs
`semantically similar to those encountered in conventional
`programming languages. In Fig. 3 we Show three such actors
`that have proved useful in DSP applications. The first. Fig.
`8:). simply emits the last ofN tokens. where N is a parameter
`of the actor. The second. Fig. 8b. takes one input token and
`repeats it on the output. The third. Fig. He, takes one input
`token each time it fires and emits the last N tokens that
`arrived. It has a self-loop used to remember the past tokens
`(and initialize them 1. This can be viewed as the state ofthe
`actor; it effectively prevents multiple simultaneous invoca-
`tions of the actor.
`
`A complete iteration model must include the ability to
`nest recurrences within the iteration and to initialize the re-
`currences when the interation begins. The SDF model can
`handle this. We illustrate this with a finite impulse response
`lFIR) digital filter because it is a simple example. An FIR
`filter computes the inner product or a vector of coelficierrts
`and a vector with the last N input tokens. when: .-\' is the
`order of the filter. It is usually assumed to repeat forever.
`firing each time a new input token arrives. Consider the pos
`sible implementations using a datallow graph. A large-grain
`approach is to define an actor with the implementation dc;
`tails hidden inside. This is the preferred approach in Gabriel.
`An alternative is a line—grain implementation with multiple
`adders and multipliers and a delay line. A third possibility
`is to use iteration and a single adder and multiplier. The first
`
`A second limitation on the parallelism can arise from the
`addressing mechanism of the heifers. Each bufi'er can be
`implemented as a FIFO queue. as was done in Davis' DDM
`{12]. Delays are correctly handled. but then access to the
`buffer becomes a critical section ol‘the parallel code. FIFO
`queues are economically implemented as circular bullets with
`pointers to the read and write locations. However. parallel
`access to the pointers becomes a problem. It' successive irr-
`vocations of an actor are to fire simultaneously on several
`processors. then great care must be taken to ensure the in—
`tegrity of the pointers. A typical approach would be to lock
`the pointers while one processor has control or the FIFO
`queue. but this partially socialismI the implementation. Fur—
`thermore. this requires that the hardware support an indi-
`visible test-andeet operation.
`the FIFO imple-
`In the ordered-memory architecture.
`mentation can be made simpler than in a general shared—
`memory architecture. but a less expensive alternative is static
`bothering [36], Static buffering is based on the obmrvation
`that there is a periodicity in the buffer access that a compiler
`can exploit.
`It preserves the behavior of FIFO queues
`( namely. it correctly handles delays and ordering ofrokensl.
`hut avoids read and write pointers. Specifically. suppose that
`all bufi'ers are implemented with fixed-length circular buffers.
`implementing FIFO queues. where each length has been
`predetermined to be long encugh to sustain the run without
`causing a deadlock. Then consider an input ofany actor in
`on SDF graph. Every N firings, where Nis to be determined.
`the actor will get its input tokents) from the same memory
`locations. The compiler can hard-code these memory loca-
`tions into the implementation. bypassing the need for point—
`ers to the buffer. Systematic methods for doing this. devel-
`oped in [36]. can be illustrated by example. Consider the
`graph in Fig. 7. which is a representation of Fig. 6 with the
`buffer between B and C assigned the length It]. A parallel
`implementation of this can be represented as follows:
`FIRE A
`DC) ten times {
`FIRE B
`
`D0 in parallel ten times l_
`FIRE C
`l
`
`FIRE o
`
`IRE E
`
`_
`
`lF
`
`For each parallel firing ofC. the compiler supplies a spertfic
`memory location for-it to get its input tokens. Note that this
`would not be possible if the FIFO bufl'er bad length I]. for
`example. because the second time the inner [)0 loop is ex»
`ecuted the memory locations accessed by C would net he
`
`"I -~ “Ed
`
`to
`
`(at)
`
`(b)
`FIG. 8. Three SDF actors useful. for iteration.
`
`(C!
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 231
`Petitioner Microsoft Corporation - EX. 1066, p. 231
`
`

`

`ARCHITECTURES FDR STATICALDr' SCHEDULED DATAF'LOW
`
`345
`
`and last possibilities have the advantage that the complexity
`of‘the dataflow graph is independent of the order ofttte filter.
`A good compiler should be able to do as well with any of
`the three structures. One implementation of the last possi-
`bility is shown in Fig. 9. The iteration actors are drawn from
`Fig. 8. The coefficients actor simply outputs a stream of N
`coefficients: it produces one coefficient each time it fires and
`reverts to the beginning of the coelfioient list after reaching
`the end. It could be implemented with a directed loop with
`N delays. or a number of other ways. The product of the
`input data and the coefl'icients is accumulated by Lhe adder
`with a feedback loop. The output of the filter is selected by
`the “last of N“ actor.
`
`The FIR filter in Fig. 9 has the advantage of exploitable
`concurrency combined with a graph complexity that is in—
`dependent ofthe order of the filter. Note. however, that there
`is a diificulty with the feedback loop at the adder. Recall
`From the above that a delay is simply an initial token on the
`are. If this initial token has value zero. then the first output
`ol'the FIR filter will be correct. However. afierevm- N firings
`of the adder. we wish to reset the token on that are to zero.
`This could bedooe with some extra actors. but a fundamental
`dilficu‘lty would remain. The presence of that feedback loop
`implies a limitation on the parallelism ofthe FIR filter. and
`that limitation would be an artifact of our implementation.
`Our solution is to introduce the notion oi" a resetting dolor.
`indicated with a diamond containing an R.
`
`3. 3. 3. Recurring Delhi-ts
`
`A resetting delay is associated with a subgraph. which in
`Fig. 9 is surrounded by a dashed line. For each invocation
`of the suhgraph. the delay token is reinitialized to zero. Fur-
`thermore, the scheduler knows that the precedence is broken
`when this occurs. and consequently it can schedule successive
`FIR output computations simultaneously on separate pro-
`cessors.
`
`The resetting delay can be used in any SDF graph where
`we have nested iterations where the inner iterations involve
`recurrences that must be initialized. to other words. anything
`or the Form
`
`DO some number oftimes {
`initialize X
`
`
`
`FIG. 9. A u FIR filter implemented using :1 single multiplier and adder.
`
`DO some number of times i
`new X = E‘tX}
`
`l
`
`The implementation of a resetting delay is simple and
`general. For the purposes ot'implernentation. the scheduler
`first treats the delay as if it were an actor that consumes one
`token and produces onc token each timc it fires. Recall that
`in practice no run-time operation is required to implement
`a delay, so there actually is no stich aetor. However. by in-
`serting this mythical actor. the scheduler can determine how
`many times it would tire (if it did exist) for each firing of
`the associated suhgraph. The method for doing this is given
`in [35] and consists of‘solving a simple system ofequations.
`For each resetting delay, the scheduler obtains a number N
`of invocations between resets: this number is used to break
`the precedence ol‘ the are for every Nth token and to insert
`an object code that reinitializes the delay value. The method
`works even if‘the suhgmph is not invoked as a unit and even
`if it is scattered among the available processors. It is panic-
`ularly simple when inline code is generated. However. when
`the iteration is implemented by the compiler using loops.
`then a small amount of run-Lime overhead may have to be
`associated with some delays in order to count invocations.
`So far we have shown that neither manifest iteration nor
`recurrences present a fundamental problem for the SDF
`model. Resetting delays can be used to initialize recurrences
`within nested iterations. Hence corresponding programming
`constructs can be efficiently and automatically implemented
`on an ordered-memory architecture. Conditionals are a bit
`more problematic.
`
`3.4. Conditional Assignment
`
`Conditionals in datattow graphs are harder to describe and
`schedule statically. One attractive solution is a mixed-mode
`programming environment. where the programmer can use
`dalatlow at the highest level and conventional languages such
`as C at a lower level. Gabriel is precisely such an environ-
`ment. Conditionals would be expressed in the conventional
`language. This is only a partial solution. however, because
`conditionals would be restricted to lie entirely within one
`large-grain actor. and concurrency within such actors is dif-
`ficult to exploit. Ifthe complexity of the operations that ar

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket