throbber
Homayoun
`
`Reference 44
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2156, p. 1
`
`

`

`(12) United States Patent
`Hammes
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 8,589,666 B2
`Nov. 19, 2013
`
`USOO8589666B2
`
`(54) ELIMINATION OF STREAM CONSUMER
`LOOP OVERSHOOT EFFECTS
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`7,085,955 B2 * 8/2006 Prabhu .............................. T14?6
`(75) Inventor: Jeffrey Hammes, Colorado Springs, CO
`2002/0124159 A1* 9/2002 Bekooji et al. ................ T12/226
`(US)
`* cited by examiner
`(73) Assignee: SRC Computers, Inc., Colorado
`Primary Examiner — Andrew Caldwell
`Springs, CO (US)
`bida:
`Assistant Examiner — George Girl
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35 Steggen, or Firm — William J. Kubida; Hogan
`U.S.C. 154(b) by 1985 days.
`
`(*) Notice:
`
`(21) Appl. No.: 11/456,466
`(22) Filed:
`Jul. 10, 2006
`(65)
`Prior Publication Data
`US 2008/0010444 A1
`Jan. 10, 2008
`
`(2006.01)
`(2006.01)
`(2006.01)
`(2006.01)
`
`(51) Int. Cl.
`G06F 5/00
`G06F 7/38
`G06F 9/00
`G06F 9/44
`(52) U.S. Cl.
`USPC ............................................. 712/241; 714/18
`(58) Field of Classification Search
`USPC ............................................ 712/241, 201, 25
`See application file for complete search history.
`
`ABSTRACT
`57
`-
`(57)
`A reconfigurable processor invoking data stream pipelining is
`configured to associate a restore buffer with each incoming
`data stream. The buffer is configured to be of sufficient size to
`maintain data values dispatched to a loop so as to restore
`values fetched and lost due to loop overshoots. The restore
`buffer stores the values that were recently fetched from the
`buffer to the loop. To determine how many data values should
`be restored, the loop counts the number of the data values it
`takes from each data stream and the number of valid loop
`iterations that take place. Once a loop termination is detected,
`the loop halts the fetching of values from the restore buffer
`and compares, for each stream, the number of loop iterations
`with the number of values fetched. The difference is the
`number of extra values that were taken from the restore buffer
`and are restored.
`
`22 Claims, 6 Drawing Sheets
`
`
`
`3101
`RESTORE
`BUFFER
`
`3.192
`RESTORE
`BUFFER
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2156, p. 2
`
`

`

`U.S. Patent
`
`*0
`C/5
`
`pn
`
`S3
`
`US 8,589,666 B2
`d
`
`63K>
`ON
`ON
`ON
`00so
`(ji
`00
`C/2
`
`Prior Art
`Fig. 1
`
`rere
`ST
`5/3
`
`Nov. 19, 2013
`o
`Zo
`
`K>
`
`Sheet 1 of 6
`
`\
`
`oo
`
`150
`
`135
`
`195
`
`TAKE
`
`VALID
`
`DATA
`
`BUFFER
`
`110
`
`140
`
`135
`
`125
`
`STALL
`
`VALID \
`
`DATA
`
`< ►
`
`
`
`STREAM CONSUMER
`130
`
`STREAM PRODUCER
`120
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2156, p. 3
`
`

`

`U.S. Patent
`U.S. Patent
`
`Nov. 19, 2013
`Nov. 19, 2013
`
`Sheet 2 of 6
`Sheet 2 of 6
`
`US 8,589,666 B2
`US 8,589,666 B2
`
`110y
`11 O
`
`BUFFER
`BUFFER
`
`SIREAM
`STREAM
`z-210
`
`1^215
`
`SYNC
`• f
`
`
`
`130?
`
`235r R
`
`•-
`
`R>235N
`
`«-
`
`R
`
`2352
`
`COMPARE
`COMPARE
`
`-240
`
`LOOP
`
`230
`
`LOOP VALID
`LOOPWALD
`
`-250
`
`TERMINATE
`TERMINATE
`#
`
`m M.al
`
`-260
`
`Fig. 2
`Fig. 2
`Prior Art
`Prior Art
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2156, p. 4
`
`

`

`U.S. Patent
`U.S. Patent
`
`Nov. 19, 2013
`Nov. 19, 2013
`
`Sheet 3 of 6
`Sheet 3 of 6
`
`US 8,589,666 B2
`US 8,589,666 B2
`
`1—145 1
`110
`f
`
`1
`
`—1452
`
`1102
`
`f
`
`H45N
`110N
`
`f
`
`BUFFER
`
`BUFFER
`
`• • •
`
`BUFFER
`
`
`
`
`
`c310i
`3101
`RESTORE
`RESTORE
`BUFFER
`BUFFER
`
`c3109
`3102
`RESTORE
`RESTORE
`BUFFER
`BUFFER
`
`310N
`f310N
`RESTORE
`RESTORE
`BUFFER
`BUFFER
`
`STREAM
`STREAM
`
`STREAM
`STREAM
`
`STREAM
`STREAM
`
`r
`
`130^
`305?
`
`♦
`
`•------»
`
`«-
`
`♦
`SYNC
`f f
`
`335^ r
`
`rT'335n
`
`R
`t3352
`
`COMPARE
`COMPARE
`
`-340
`
`LOOP
`
`LOOP VALID
`LOOPWALD
`
`-350
`
`♦-1^315
`
`♦
`
`330
`
`TERMINATE
`TERMINATE
`
`
`
`-360
`
`Fig. 3
`
`
`
`
`
`
`
`
`
`L
`
`n
`
`J
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2156, p. 5
`
`

`

`U.S. Patent
`U.S. Patent
`
`Nov. 19, 2013
`Nov. 19, 2013
`
`Sheet 4 of 6
`Sheet 4 of 6
`
`US 8,589,666 B2
`US 8,589,666 B2
`
`1—145 1
`1451
`
`c3101
`3101
`
`-145 2
`1452
`
`lH°2
`3102
`
`• • •
`
`—RESTORE
`--RESTORE
`ENABLE
`ENABLE
`RESTORE
`RESTORE
`VALUE
`VALUE
`
`315
`
`f
`
`COUNTER
`COUNTER
`-440
`440
`
`
`
`
`
`
`
`
`
`■»
`
`#
`
`«-
`
`*
`
`DATA | VALID | TAKE |
`
`DATA | VALID | TAKE |
`
`
`
`
`
`COUNTER
`R
`COUNTE
`t410 1
`
`SUBTRACT -1
`SUBTRACT
`t450 1
`450
`
`COUNTER
`COUNTER
`'4102
`
`SUBTRACT
`SUBTRACT
`
`^450 2
`4502
`
`Fig. 4
`Fig. 4
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2156, p. 6
`
`

`

`U.S. Patent
`
`Nov. 19, 2013
`
`Sheet 5 of 6
`
`US 8,589,666 B2
`
`505
`
`CONFIGURING ADATAFLOWPIPELINED STRUCTURE TO INCLUDE A
`RESTORE BUFFER, AFIRST INFIRST OUT (FIFO) BUFFER AND AN
`ITERATION LOOP STRUCTURE WHEREIN THE RESTORE BUFFER IS IN
`COMMUNICATION WITH, AND INTERPOSED BETWEEN, FIFO BUFFER
`AND THE TERATION LOOP STRUCTURE.
`
`510
`
`FETCHINGADATA VALUE FROM THE RESTORE BUFFER TO THE
`ITERATION LOOP STRUCTURE WHEREIN THE RESTORE BUFFER
`REPLACES THE DATA VALUE FETCHED FROM THE RESTORE BUFFER
`WITH ANOTHER DATA VALUE FROM THE FIFO BUFFER
`
`52O
`
`MANTAINING ANACCOUNTING OF HOW MANY DATA
`VALUES ARE FETCHED FROM THE RESTORE BUFFER TO
`THE TERATION LOOP STRUCTURE.
`
`530
`
`
`
`
`
`
`
`EXECUTING ANTERATIVE COMPUTATION PROCESS ON THE
`TERATION LOOP STRUCTURE WHERENEACH ITERATION FALNG
`TO TERMINATE THE TERATIVE COMPUTATION PROCESS
`IS A VALID LOOP TERATION.
`
`Fig. 5A
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2156, p. 7
`
`

`

`U.S. Patent
`
`Nov. 19, 2013
`
`Sheet 6 of 6
`
`US 8,589,666 B2
`
`COUNTING THE VALID LOOP TERATIONS OF THE
`ITERATIVE COMPUTATION PROCESS AND PRESERVING
`A TOTAL NUMBER OF VALID LOOP TERATIONS.
`
`550
`
`IDENTIFYING TERMINATION OF THE
`ITERATIVE COMPUTATION PROCESS.
`
`560
`
`570
`
`
`
`
`
`HALTING FETCHING DATA VALUES FROM THE RESTORE
`BUFFER BASED ON IDENTIFICATION OF TERMINATION OF
`THE TERATIVE COMPUTATIONAL PROCESS.
`
`RESTORING DATA VALUES TO THE RESTORE BUFFER THAT WERE
`FETCHED FROM THE RESTORE BUFFER BETWEEN TERMINATION OF
`THE TERATIVE COMPUTATIONAL PROCESS AND HALTING FETCHING
`DATA VALUES FROM THE RESTORE BUFFER
`
`580
`
`595
`END
`
`Fig. 5B
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2156, p. 8
`
`

`

`US 8,589,666 B2
`
`1.
`ELMINATION OF STREAM CONSUMER
`LOOP OVERSHOOT EFFECTS
`
`COPYRIGHT NOTICEAPERMISSION
`
`A portion of the disclosure of this patent document con
`tains material which is Subject to copyright protection. The
`copyright owner has no objection to the facsimile reproduc
`tion by anyone of the patent document of the patent disclosure
`as it appears in the United States Patent and Trademark Office
`patent file or records, but otherwise, reserves all copyright
`rights whatsoever. The following notice applies to the soft
`ware and data and described below, inclusive of the drawing
`figures where applicable: Copyright(R) 2002, SRC Comput
`ers, Inc.
`
`10
`
`15
`
`BACKGROUND OF THE INVENTION
`
`2
`once the logic configurations are loaded into the FPGAs. In
`that manner, the logic configurations provide an optimal fit
`for the application program. Once a compiler has constructed
`and optimized the dataflow graph, the graph is translated to a
`hardware description, expressed in a hardware description
`language. This hardware description can be synthesized by
`hardware synthesis tools to create logic configurations to run
`on reconfigurable hardware. The focus of the present inven
`tion is data stream pipelining as applied to the interconnection
`of loops running on reconfigurable hardware.
`When a C or Fortran loop is a consumer of one or more
`streams of data, and that loop is pipelined, overlapping loop
`iterations are spawned with the goal of doing a maximal
`amount of workinas short a time as possible. To activate loop
`iterations as fast as possible, an iteration of a loop is spawned
`before the loop has determined whether the loop's termina
`tion condition has been reached, resulting in the activation of
`extra loop iterations. Those extra iterations are benign if they
`don’t affect the state of the computation. But when loop
`iterations take values from streams Subsequent to the Star of
`an iteration that meets termination criteria, these extra itera
`tions do affect the state of the computation because they take
`values that should not have been taken.
`For instance, consider this simplified illustration. A math
`ematical operation counting to the sum of 1000 by increments
`of 1 may require a loop iteration of 1000 loops (i.e. add 1 to
`the previous total until the total equal 1000). Upon each clock
`tick, a piece of data is taken from the data stream and added to
`the previous total. Assume that actual computation of adding
`a single increment one number takes two clock ticks, one to
`add the number and one to compare the total to the value
`1000. Therefore, when the total is 999 and the next piece of
`data is provided to the computation loop by the data stream
`the total will reach its termination criteria. However, while the
`computation determines that the total is equal to 1000 and
`signal that the loop should be halted, the next value has
`already been retrieved from the data stream. That last value
`should not have been taken.
`A stream is a data path between a producer and consumer
`of data, where the producer and consumer run concurrently.
`The path between the producer and consumer is made up of a
`data connection, a “valid’ signal, and a reverse direction
`“stall signal. FIG. 1 shows typical signals used in a stream
`connection as is well known and will be recognized by one
`skilled in the relevant art. The use of a First-In-First-Out
`buffer 110, or “FIFO' buffer, removes the need for tight
`synchronization between the producer 120 and consumer
`130. The producer 120 will generate data values 125 at its own
`rate, allowing them to accumulate in the FIFO buffer 110. As
`the FIFO buffer 110 approaches becoming full, it will issue a
`stall signal 140 to the producer 120 so that it will suspend the
`generation of data values 125 until the stall signal is released.
`The consumer 130 will take 150 values 145 from the FIFO
`buffer at its own rate and as the values 145 are available.
`The use of the FIFO buffer, with the valid 135, stall 140 and
`take 150 signals, allows flexible coupling of stream producers
`and consumers. A stream's producer 120 and its consumers
`130 may run at different speeds. For example, when the
`producer 120 runs faster than the consumer 130, then it will
`Stall 140 from time to time as values fill the FIFO buffer.
`When the producer runs slower than the consumer, the FIFO
`will sometimes be empty and the consumer will wait for new
`values to be available.
`When loops, in a high level language such as C or Fortran,
`are compiled for execution on reconfigurable hardware Such
`as FPGAs, aggressive pipelining is employed, with the goal
`of achieving as great a throughput of data as possible. The
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`1. Field of the Invention
`The present invention relates, in general, to pipelined struc
`tures that are produced by reconfigurable hardware compil
`ers, and more particularly to eliminating the effects of stream
`consumer pipelined loop overshoots in reconfigurable hard
`Wa.
`2. Relevant Background
`Reconfigurable computing is a technology receiving
`increased interest in the computing arts. Traditional general
`purpose computing is characterized by computer code
`executed serially on one or more general purpose processors.
`Reconfigurable computing is characterized by programming
`reconfigurable hardware, such as FPGAs to execute logic
`routines.
`Reconfigurable computing offers significant performance
`advances in computation-intensive processing. For example,
`the reconfigurable hardware may be programmed with a logic
`configuration that has more parallelism and pipelining char
`acteristics than a conventional instruction processor. Also, the
`reconfigurable hardware may be programmed with a custom
`logic configuration that is very efficient for executing the
`tasks assigned by the program. Furthermore, dividing a pro
`gram’s processing requirements between the instruction pro
`cessor and the reconfigurable hardware may increase the
`overall processing power of the computer.
`Software programs that are written in a high level language
`like, for example, C or Fortran can be converted by compilers
`targeting reconfigurable hardware into Software that is
`executable on that hardware. Likewise, loop structures in the
`high level language may be converted by a compiler into a
`form that exploits parallelism and pipelining characteristics
`of reconfigurable hardware.
`Data streams provide a flexible method for connecting
`concurrent producer and consumer loops in C and Fortran
`programs written for execution on reconfigurable hardware.
`Streams are also useful as a general interface between a
`reconfigurable hardware chip (typically a Field Program
`55
`mable Gate Array (“FPGA')) and other chips, boards, memo
`ries, disks, real-time signal sources, and the like.
`Reconfigurable hardware thus allows logic configurations
`to be built based upon the desired implementation. A program
`written in a high level programming language such as C or
`Fortran is converted into dataflow graphs that express the
`interconnection of functional units, and that are used to create
`specialized logic configurations. The dataflow graphs are
`analyzed and optimized to provide an optimal set of logic
`configurations for each particular high level program. For
`example, a data buffer in a loop structure can be designed
`based on the program computations that are going to be run
`
`60
`
`65
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2156, p. 9
`
`

`

`3
`body of an inner loop is converted to a dataflow graph with
`many individual functional units that are derived from the
`operators in the loop body. The FPGA implementations of
`these functional units may have latencies of one or more clock
`ticks, and the interconnection of many functional units will
`result in a loop body that can have new data fed in on every
`clock tick, even though it may take many clock ticks for the
`results of an iteration's data to reach the bottom of the data
`flow graph. Thus many iterations are active at any point in
`time.
`Part of the loop body's dataflow graph consists of the
`expression that computes a loop termination criteria. If a new
`loop iteration is activated on every clock tick, then extra loop
`iterations are activated despite a termination value being
`achieved. For example, when it takes three clocks for the data
`of a given loop iteration to produce the true/false termination
`value, then by the time the termination expression becomes
`true, three additional iterations of the loop will already have
`started. These extra iterations are harmless as long as they are
`not allowed to affect the state of the overall computation.
`However, the data acquired by the loop from the data stream
`may be lost.
`Consider the following producer-consumer loops:
`
`10
`
`15
`
`#pragma Src parallel sections
`
`#pragma Src section
`
`inti;
`for (i=0; i-N* M: i++)
`put stream (&SO, ALi, 1);
`
`#pragma Src section
`
`The first parallel section puts values to stream S0. The
`second parallel section takes values from S0, but it does this
`in a loop nest. The inner loop of the nest will be pipelined as
`the compiler determines that it can spawna new loop iteration
`on every clock tick because there are neither loop-carried
`dependencies nor multiple accesses to any memory.
`FIG. 2 shows a high level block diagram of part of a
`simplified dataflow structure generated by a compiler for the
`consumer 130 inner loop as is known in the prior art. The
`compiler creates a FIFO buffer 110 for stream S0 210,
`which receives the values from the stream's producer. The
`loop has a synchronization module or node 215 that watches
`the FIFO buffer 110 and spawns a loop iteration whenever it
`takes a value from the FIFO buffer 110, sending into the loop
`body 230 the stream value along with an “iteration' signal
`that tells the loop that a new iteration is being started. The
`boxes labeled 'R' represent registers 235,235, ... 235, that
`hold the various scalar variables and base address values
`referenced in the loop.
`Part of the loop body 230 is composed of the loop termi
`nation expression. In the example presented above, termina
`tion is simply a comparison of i and N. The output, of the
`Compare node 240 goes false when the comparison is false,
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 8,589,666 B2
`
`4
`and the Loop Valid node 250 latches into a false state, telling
`the Terminate node 260 to emit a "done” pulse telling the
`Synchronization node 215 that the loop has reached its ter
`mination condition.
`Because of pipelining, it can take multiple clock ticks for
`the effects of a value to move through the loop body 230, but
`when there are values available in the FIFO buffer 110, the
`Synchronization node 215 will take a value from the FIFO
`buffer and start a new loop iteration 235 on every clock tick.
`Assuming that the Compare node 240 and Loop Valid node
`250 each have a one-clock tick latency, the Synchronization
`node 215 will spawn at least of couple of extra iterations
`before it sees that loop termination has occurred, and those
`extra loop iterations will have taken from the FIFO buffer
`values that should not have been taken.
`This effect becomes even more pronounced when the ter
`mination expression is more complex. A rather extreme case
`arises from the following: for (i-0; is w/v. i----) The “divide
`operation' is typically a many-clock latency operation; thirty
`or more clock ticks is not unusual, so the Synchronization
`node 215 of this loop 230 could spawn more than thirty extra
`iterations before it learns that the loop has reached termina
`tion.
`The effect of these extra iterations on the consumer loop
`nest is twofold. First, when the inner loop is entered for the
`second time and it begins to fetch values from the stream, it
`will miss the values that had been erroneously fetched by the
`earlier extra spawned iterations. Thus the program will not
`compute correctly. Second, the parallel sections will eventu
`ally deadlock, because the loss of values means that the con
`Sumer will never see the number of values it is expecting.
`There are important reasons for wanting a consumer loop
`nest to work correctly. In practical applications, there might
`be work to be done once a certain number of values have been
`read from a stream. In such a situation, the approach is to use
`data from the stream in a segmented fashion. It is, however,
`often impractical for the producer to be similarly segmented.
`For example, a data array from a distant memory might be the
`Source of the stream, and the relatively high costs of starting
`up a data transfer of such an array make it beneficial to do one
`continuous stream rather than a series of Small ones.
`A straightforward solution to the problem of loop stream
`overshoot as is known in the prior art is to slow the iteration
`rate of the consumer loop so that it does not activate a new
`iteration until the current iteration's termination expression
`has been computed, but this would result in significant deg
`radation of performance. The degradation would be espe
`cially bad in the cases of termination expressions that have
`large latencies, such as in the earlier example that does a
`divide operation. There the loop would be slowed by a factor
`of thirty or more, and Such performance is unacceptable.
`There is a need, therefore, for computer systems and com
`puter implemented methodology for eliminating the effect of
`stream pipeline loop overshoot in reconfigurable data struc
`tures. Specifically, a need exists to preserve and restore values
`taken from a buffer as the result of ongoing loop operations
`during which time a valid termination criteria has been
`reached. These and other needs of the prior art are addressed
`by the present invention that is subsequently herein described
`in detail.
`
`SUMMARY OF THE INVENTION
`
`Briefly stated, the present invention involves computer
`implemented methods and computer systems for eliminating
`overshoot effects of pipelined consumer iterative computa
`tional loops. In a reconfigurable processor environment that
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2156, p. 10
`
`

`

`US 8,589,666 B2
`
`5
`invokes data stream pipelining during iterative computations,
`an FPGA or other reconfigured processor is configured to
`place two buffers between the producer of a data stream and
`the data consumer. In addition to a typically found FIFO
`buffer, a restore buffer is interposed between the producer and
`the consumer of the data values. The restore buffer is config
`ured to be of sufficient size to maintaina persistent memory of
`enough data values dispatched to the iterative loop structure
`So as to restore data values fetched and lost due to pipelined
`consumer loop overshoots.
`In one aspect of the present invention, a restore buffer is
`placed between each incoming stream and its consumer loop.
`The restore buffer stores the data values that were recently
`fetched from the buffer to the consumer so that later they can
`be restored. To determine how many data values should be
`restored, the iterative loop structure counts the number of the
`data values it takes from each data stream. The iterative loop
`structure also counts the number of valid loop iterations that
`take place. Once a loop termination is detected, the iterative
`loop structure halts fetching data values from the restore
`buffer and compares, for each stream, the number of loop
`iterations with the number of values fetched from each data
`stream. The difference is the number of extra data values that
`were taken from the restore buffer and must be restored. The
`iterative loop structure then directs the restore buffers to
`restore those values.
`The foregoing and other features, utilities and advantages
`of the invention will be apparent from the following more
`particular description of an embodiment of the invention as
`illustrated in the accompanying drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The aforementioned and other features and objects of the
`present invention and the manner of attaining them will
`become more apparent and the invention itself will be best
`understood by reference to the following description of a
`preferred embodiment taken in conjunction with the accom
`panying drawings, wherein:
`FIG. 1 shows a high level block diagram of a data structure
`having typical signals used in a stream connection between
`data producers and data consumers as known in the prior art;
`FIG. 2 shows a high level block diagram of part of a
`simplified data structure generated by a compiler for a con
`Sumer of data having an inner iteration loop as known in the
`prior art;
`FIG. 3 shows a high level block diagram of a data flow
`structure for eliminating the effect of stream pipeline loop
`overshoots in accordance with one embodiment of the present
`invention;
`FIG. 4 shows a high level block diagram of one embodi
`ment of a Synchronization node of a pipelined loop structure
`for eliminating the effect ofpipeline loop overshoots in accor
`dance with the present invention; and
`FIG. 5A and FIG. 5B comprise a flow diagram of one
`method embodiment of eliminating the effects of pipeline
`loop overshoots according to the present invention.
`The Figures depict embodiments of the present invention
`for purposes of illustration only. One skilled in the art will
`readily recognize from the following discussion that alterna
`tive embodiments of the structures and methods illustrated
`herein may be employed without departing from the prin
`ciples of the invention described herein.
`
`DETAILED DESCRIPTION OF THE PREFERRED
`EMBODIMENTS
`
`Computer systems and computer implemented methods
`for eliminating the effects of loop overshoots in the consump
`
`6
`tion of pipelined data streams are discussed in detail herein. In
`a pipelined data stream environment comprising a producer
`of data and a consumer of that data, the effects of consumer
`loop overshoots are eliminated by interposing, in one
`embodiment of the present invention, two buffers between the
`producer and consumer. As previously described, a typical
`configuration of a pipelined data stream structure places a
`single buffer between the producer and the consumer. Such a
`single buffer is typically configured to absorb differences in
`the production and consumption rate of the data streams, but
`does little to prevent the lost of data due to consumer loop
`overshoots.
`Consumer loops are simple iterative processes that operate
`to provide a particular result. As a simple example, an addi
`tion operation may necessitate an iterative loop until a certain
`value is obtained. Consumer loops receive or fetch data values
`from a buffer and begin the looping process. Once launched,
`the computations continue until completed. Thus, using our
`simple addition example, the loop computation may comprise
`fetching a value, adding the value to the existing total, and
`then comparing the value to a predetermined numberto deter
`mine if a termination criteria has been reached. This process
`may take two or three clock ticks of the processor. Thus even
`though there is additional data in the buffer available to the
`consumer, there is a lag between when a value has been
`fetched and when the loop has determined that it should
`terminate.
`In reconfigurable computing, iterative loops such as those
`described above, are pipelined. Pipelining, as is well known
`to one skilled in the relevant art, is a process where instead of
`waiting for the completion of aparticular loop iteration before
`fetching the next values and starting the next iteration, the
`loop aggressively fetches the next set of values on the very
`next clock tick, thereby starting a new iteration on each clock
`tick. Thus, there are several values, representing computa
`tions of several iterations, in the computational pipeline.
`Maintaining several loop iterations simultaneously for the
`same computation significantly enhances the efficiency of the
`computational process but as previously mentioned presents
`a problem when the termination criteria is eventually reached.
`When the termination criteria is reached, the loop has over
`shot the number of values it should have taken from the buffer.
`Under the prior art, this data would have been lost.
`The present invention eliminates the loss of data due to
`consumer loop overshoot providing, in one embodiment, an
`additional buffer interposed between the producer and the
`consumer that remembers and can restore the data values that
`were fetched from the buffer after the value that resulted in the
`loop meeting termination criteria was fetched.
`A high level block diagram of a system for eliminating the
`effects of pipelined loop overshoots according to one embodi
`ment of the present invention is shown in FIG. 3. As shown, a
`consumer 130 is coupled to a series of data streams 145,
`145, . . . 145, providing data originating from a producer
`(not shown). Interposed between the consumer 130 and the
`producer, and associated with each data stream 145, are two
`buffers. The first buffer 110, 110, ... 110, absorbs differ
`ences in the data flow rate between the producer 120 and the
`consumer 130. The second buffer, also referred to herein as
`the restore buffer,310,310, ... 310, maintains data values
`for a period sufficient to allow the computations conducted by
`the consumer 130 to utilize a pipeline dataflow architecture,
`without the risk of data value loss. The restore buffer 310
`keeps a persistent memory of data values fetched to the con
`Sumer So as to restore any overshoot values sent to the con
`Sumer as a function of loop computations. In other embodi
`
`5
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2156, p. 11
`
`

`

`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`7
`ments, the functionality of the restore buffer and that of the
`first buffer can be combined into a single buffer.
`The consumer 130 comprises, at least in part, an iteration
`loop structure 305 that performs a loop computation. Associ
`ated with the iteration loop structure 305 is a Synchronization
`node 315, various registers 335, 335. . . .335 that hold
`various scalar variables and base address values referenced in
`the loop, the inner loop body 330 that conducts the computa
`tional loop and includes a Compare node 340, a Valid node
`350 that determines when termination criteria has been met,
`and finally a Terminate node 360. According to one embodi
`ment of the present invention, when the Compare node 340
`goes false, the Valid node 350 communicates with the Termi
`nate node 360 causing it to emit a message to the Synchroni
`zation node 315 informing it that the termination criteria has
`been reached. As will be apparent to one skilled in the relevant
`art, other implementations of a loop structure are possible
`without altering the scope of the present invention and are
`indeed contemplated by the present invention.
`Upon receiving a message from the Terminate node 360
`indicating that the termination criteria for the current compu
`tational loop has been met, the Synchronization node 315
`halts fetching data values. During the period from when the
`value that resulted in the termination criteria being met was
`fetched and the Synchronization node 315 halting the fetch
`ing process, two possible edge cases, and any combination in
`between, can occur. The first edge case assumes that the data
`stream buffer 110,310 that feeds the iteration loop structure
`305 is never empty. In such a case, the iteration loop structure
`315 will take a value from the data stream 145 on every clock
`tick. The number of extra data values taken from the restore
`buffer 310 prior to receiving a signal to halt is, in this scenario,
`at its maximum. The other edge case occurs when the data
`stream 145 empties with the last validiteration. The last value
`taken from the data stream 145 results in the termination
`criteria being met and before another value is removed from
`the buffer 310, or is available to be removed, the Synchroni
`zation node 315 halts the fetching process. Thus there are no
`data values that are lost due to loop overshoot by the iteration
`loop structure 305.
`According to one embodiment of the present invention,
`these and other data value overshoot Scenarios are eliminated
`by counting the number of data values fetched from the
`restore buffer 310 of each data stream and comparing it to the
`number of valid loop iterations. Since the number of data
`values lost depends on the number of times at which data
`values arrive in the data stream buffer 145, the compiler that
`configures the computational structure for each computation
`cannot predict the number of values that will be lost from loop
`overshoot. It can, however, determine the worst case scenario
`by examining the dataflow graph of the loop and Summing the
`latencies (in clock ticks) of the nodes on the path from the
`Synchronization node 315 to the termination signal being
`generated by the Terminate node 360.
`In the case of multiple incoming data streams 145, the
`dynamic behavior of each incoming data stream 145 may be
`different, so that the number of overshoot values may differ
`among the streams 145. Thus, an individual count of the
`number of data values fetched must be kept for each stream.
`But since the compiler can determine an upper bound for the
`number of extra values taken, it can specify a restore buffer
`310 of a sufficient size to retain that many values, thereby
`guaranteeing that all of the required values will be available,
`even in the worst case.
`FIG. 4 shows a high level block diagram of one embodi
`ment of a Synchronization node 315 of a pipelined loop
`structure for eliminating the effect of pipeline loop over
`
`US 8,589,666 B2
`
`8
`shoots in accordance with the present invention. The Syn
`chronization node 315 is coupled to a restore buffer 310,
`310 for each of a plurality of data streams. The Synchroni
`zation node 315 is further coupled to the Valid node 350 and
`the Terminate node 360 (both not shown) and is capable of
`receiving signals informing of termination of the computa
`tional loop. Each restore buffer 310 is associated with a data
`value fetch counter 410, 410, that counts the number of
`fetches, or takes, conducted by the Synchronization node 315
`for that particular buffer 310, thus determining a data fetch
`value. Depending on the state of the data stream, this number
`of fetches could be as high as one data value fetch per clock
`cycle over the length of the loop computation. A loop iteration
`counter 440 is also maintained, in this embodiment of the
`present invention, in the Synchronization node 315. T

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket