`Hammes
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 8,589,666 B2
`Nov. 19, 2013
`
`USOO8589666B2
`
`(54) ELIMINATION OF STREAM CONSUMER
`LOOP OVERSHOOT EFFECTS
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`7,085,955 B2 * 8/2006 Prabhu .............................. T14?6
`(75) Inventor: Jeffrey Hammes, Colorado Springs, CO
`2002/0124159 A1* 9/2002 Bekooji et al. ................ T12/226
`(US)
`* cited by examiner
`(73) Assignee: SRC Computers, Inc., Colorado
`Primary Examiner — Andrew Caldwell
`Springs, CO (US)
`bida:
`Assistant Examiner — George Girl
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35 Steggen, or Firm — William J. Kubida; Hogan
`U.S.C. 154(b) by 1985 days.
`
`(*) Notice:
`
`(21) Appl. No.: 11/456,466
`(22) Filed:
`Jul. 10, 2006
`(65)
`Prior Publication Data
`US 2008/0010444 A1
`Jan. 10, 2008
`
`(2006.01)
`(2006.01)
`(2006.01)
`(2006.01)
`
`(51) Int. Cl.
`G06F 5/00
`G06F 7/38
`G06F 9/00
`G06F 9/44
`(52) U.S. Cl.
`USPC ............................................. 712/241; 714/18
`(58) Field of Classification Search
`USPC ............................................ 712/241, 201, 25
`See application file for complete search history.
`
`ABSTRACT
`57
`-
`(57)
`A reconfigurable processor invoking data stream pipelining is
`configured to associate a restore buffer with each incoming
`data stream. The buffer is configured to be of sufficient size to
`maintain data values dispatched to a loop so as to restore
`values fetched and lost due to loop overshoots. The restore
`buffer stores the values that were recently fetched from the
`buffer to the loop. To determine how many data values should
`be restored, the loop counts the number of the data values it
`takes from each data stream and the number of valid loop
`iterations that take place. Once a loop termination is detected,
`the loop halts the fetching of values from the restore buffer
`and compares, for each stream, the number of loop iterations
`with the number of values fetched. The difference is the
`number of extra values that were taken from the restore buffer
`and are restored.
`
`22 Claims, 6 Drawing Sheets
`
`
`
`3101
`RESTORE
`BUFFER
`
`3.192
`RESTORE
`BUFFER
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2027 - p. 1
`
`
`
`U.S. Patent
`
`Nov. 19, 2013
`
`Sheet 1 of 6
`
`US 8,589,666 B2
`
`
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2027 - p. 2
`
`
`
`U.S. Patent
`
`Nov. 19, 2013
`
`Sheet 2 of 6
`
`US 8,589,666 B2
`
`11 O
`
`BUFFER
`
`STREAM
`
`
`
`COMPARE
`
`LOOPWALD
`
`TERMINATE
`
`m M.al
`
`Fig. 2
`Prior Art
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2027 - p. 3
`
`
`
`U.S. Patent
`
`Nov. 19, 2013
`
`Sheet 3 of 6
`
`US 8,589,666 B2
`
`
`
`3101
`RESTORE
`BUFFER
`
`3102
`RESTORE
`BUFFER
`
`310N
`RESTORE
`BUFFER
`
`STREAM
`
`STREAM
`
`STREAM
`
`COMPARE
`
`LOOPWALD
`
`
`
`TERMINATE
`
`
`
`
`
`
`
`
`
`
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2027 - p. 4
`
`
`
`U.S. Patent
`
`Nov. 19, 2013
`
`Sheet 4 of 6
`
`US 8,589,666 B2
`
`1451
`
`3101
`
`1452
`
`3102
`
`
`
`
`
`
`
`
`
`--RESTORE
`ENABLE
`
`RESTORE
`VALUE
`
`COUNTER
`
`440
`
`
`
`
`
`R
`COUNTE
`
`COUNTER
`
`SUBTRACT
`
`SUBTRACT
`
`450
`
`4502
`
`Fig. 4
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2027 - p. 5
`
`
`
`U.S. Patent
`
`Nov. 19, 2013
`
`Sheet 5 of 6
`
`US 8,589,666 B2
`
`505
`
`CONFIGURING ADATAFLOWPIPELINED STRUCTURE TO INCLUDE A
`RESTORE BUFFER, AFIRST INFIRST OUT (FIFO) BUFFER AND AN
`ITERATION LOOP STRUCTURE WHEREIN THE RESTORE BUFFER IS IN
`COMMUNICATION WITH, AND INTERPOSED BETWEEN, FIFO BUFFER
`AND THE TERATION LOOP STRUCTURE.
`
`510
`
`FETCHINGADATA VALUE FROM THE RESTORE BUFFER TO THE
`ITERATION LOOP STRUCTURE WHEREIN THE RESTORE BUFFER
`REPLACES THE DATA VALUE FETCHED FROM THE RESTORE BUFFER
`WITH ANOTHER DATA VALUE FROM THE FIFO BUFFER
`
`52O
`
`MANTAINING ANACCOUNTING OF HOW MANY DATA
`VALUES ARE FETCHED FROM THE RESTORE BUFFER TO
`THE TERATION LOOP STRUCTURE.
`
`530
`
`
`
`
`
`
`
`EXECUTING ANTERATIVE COMPUTATION PROCESS ON THE
`TERATION LOOP STRUCTURE WHERENEACH ITERATION FALNG
`TO TERMINATE THE TERATIVE COMPUTATION PROCESS
`IS A VALID LOOP TERATION.
`
`Fig. 5A
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2027 - p. 6
`
`
`
`U.S. Patent
`
`Nov. 19, 2013
`
`Sheet 6 of 6
`
`US 8,589,666 B2
`
`COUNTING THE VALID LOOP TERATIONS OF THE
`ITERATIVE COMPUTATION PROCESS AND PRESERVING
`A TOTAL NUMBER OF VALID LOOP TERATIONS.
`
`550
`
`IDENTIFYING TERMINATION OF THE
`ITERATIVE COMPUTATION PROCESS.
`
`560
`
`570
`
`
`
`
`
`HALTING FETCHING DATA VALUES FROM THE RESTORE
`BUFFER BASED ON IDENTIFICATION OF TERMINATION OF
`THE TERATIVE COMPUTATIONAL PROCESS.
`
`RESTORING DATA VALUES TO THE RESTORE BUFFER THAT WERE
`FETCHED FROM THE RESTORE BUFFER BETWEEN TERMINATION OF
`THE TERATIVE COMPUTATIONAL PROCESS AND HALTING FETCHING
`DATA VALUES FROM THE RESTORE BUFFER
`
`580
`
`595
`END
`
`Fig. 5B
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2027 - p. 7
`
`
`
`US 8,589,666 B2
`
`1.
`ELMINATION OF STREAM CONSUMER
`LOOP OVERSHOOT EFFECTS
`
`COPYRIGHT NOTICEAPERMISSION
`
`A portion of the disclosure of this patent document con
`tains material which is Subject to copyright protection. The
`copyright owner has no objection to the facsimile reproduc
`tion by anyone of the patent document of the patent disclosure
`as it appears in the United States Patent and Trademark Office
`patent file or records, but otherwise, reserves all copyright
`rights whatsoever. The following notice applies to the soft
`ware and data and described below, inclusive of the drawing
`figures where applicable: Copyright(R) 2002, SRC Comput
`ers, Inc.
`
`10
`
`15
`
`BACKGROUND OF THE INVENTION
`
`2
`once the logic configurations are loaded into the FPGAs. In
`that manner, the logic configurations provide an optimal fit
`for the application program. Once a compiler has constructed
`and optimized the dataflow graph, the graph is translated to a
`hardware description, expressed in a hardware description
`language. This hardware description can be synthesized by
`hardware synthesis tools to create logic configurations to run
`on reconfigurable hardware. The focus of the present inven
`tion is data stream pipelining as applied to the interconnection
`of loops running on reconfigurable hardware.
`When a C or Fortran loop is a consumer of one or more
`streams of data, and that loop is pipelined, overlapping loop
`iterations are spawned with the goal of doing a maximal
`amount of workinas short a time as possible. To activate loop
`iterations as fast as possible, an iteration of a loop is spawned
`before the loop has determined whether the loop's termina
`tion condition has been reached, resulting in the activation of
`extra loop iterations. Those extra iterations are benign if they
`don’t affect the state of the computation. But when loop
`iterations take values from streams Subsequent to the Star of
`an iteration that meets termination criteria, these extra itera
`tions do affect the state of the computation because they take
`values that should not have been taken.
`For instance, consider this simplified illustration. A math
`ematical operation counting to the sum of 1000 by increments
`of 1 may require a loop iteration of 1000 loops (i.e. add 1 to
`the previous total until the total equal 1000). Upon each clock
`tick, a piece of data is taken from the data stream and added to
`the previous total. Assume that actual computation of adding
`a single increment one number takes two clock ticks, one to
`add the number and one to compare the total to the value
`1000. Therefore, when the total is 999 and the next piece of
`data is provided to the computation loop by the data stream
`the total will reach its termination criteria. However, while the
`computation determines that the total is equal to 1000 and
`signal that the loop should be halted, the next value has
`already been retrieved from the data stream. That last value
`should not have been taken.
`A stream is a data path between a producer and consumer
`of data, where the producer and consumer run concurrently.
`The path between the producer and consumer is made up of a
`data connection, a “valid’ signal, and a reverse direction
`“stall signal. FIG. 1 shows typical signals used in a stream
`connection as is well known and will be recognized by one
`skilled in the relevant art. The use of a First-In-First-Out
`buffer 110, or “FIFO' buffer, removes the need for tight
`synchronization between the producer 120 and consumer
`130. The producer 120 will generate data values 125 at its own
`rate, allowing them to accumulate in the FIFO buffer 110. As
`the FIFO buffer 110 approaches becoming full, it will issue a
`stall signal 140 to the producer 120 so that it will suspend the
`generation of data values 125 until the stall signal is released.
`The consumer 130 will take 150 values 145 from the FIFO
`buffer at its own rate and as the values 145 are available.
`The use of the FIFO buffer, with the valid 135, stall 140 and
`take 150 signals, allows flexible coupling of stream producers
`and consumers. A stream's producer 120 and its consumers
`130 may run at different speeds. For example, when the
`producer 120 runs faster than the consumer 130, then it will
`Stall 140 from time to time as values fill the FIFO buffer.
`When the producer runs slower than the consumer, the FIFO
`will sometimes be empty and the consumer will wait for new
`values to be available.
`When loops, in a high level language such as C or Fortran,
`are compiled for execution on reconfigurable hardware Such
`as FPGAs, aggressive pipelining is employed, with the goal
`of achieving as great a throughput of data as possible. The
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`1. Field of the Invention
`The present invention relates, in general, to pipelined struc
`tures that are produced by reconfigurable hardware compil
`ers, and more particularly to eliminating the effects of stream
`consumer pipelined loop overshoots in reconfigurable hard
`Wa.
`2. Relevant Background
`Reconfigurable computing is a technology receiving
`increased interest in the computing arts. Traditional general
`purpose computing is characterized by computer code
`executed serially on one or more general purpose processors.
`Reconfigurable computing is characterized by programming
`reconfigurable hardware, such as FPGAs to execute logic
`routines.
`Reconfigurable computing offers significant performance
`advances in computation-intensive processing. For example,
`the reconfigurable hardware may be programmed with a logic
`configuration that has more parallelism and pipelining char
`acteristics than a conventional instruction processor. Also, the
`reconfigurable hardware may be programmed with a custom
`logic configuration that is very efficient for executing the
`tasks assigned by the program. Furthermore, dividing a pro
`gram’s processing requirements between the instruction pro
`cessor and the reconfigurable hardware may increase the
`overall processing power of the computer.
`Software programs that are written in a high level language
`like, for example, C or Fortran can be converted by compilers
`targeting reconfigurable hardware into Software that is
`executable on that hardware. Likewise, loop structures in the
`high level language may be converted by a compiler into a
`form that exploits parallelism and pipelining characteristics
`of reconfigurable hardware.
`Data streams provide a flexible method for connecting
`concurrent producer and consumer loops in C and Fortran
`programs written for execution on reconfigurable hardware.
`Streams are also useful as a general interface between a
`reconfigurable hardware chip (typically a Field Program
`55
`mable Gate Array (“FPGA')) and other chips, boards, memo
`ries, disks, real-time signal sources, and the like.
`Reconfigurable hardware thus allows logic configurations
`to be built based upon the desired implementation. A program
`written in a high level programming language such as C or
`Fortran is converted into dataflow graphs that express the
`interconnection of functional units, and that are used to create
`specialized logic configurations. The dataflow graphs are
`analyzed and optimized to provide an optimal set of logic
`configurations for each particular high level program. For
`example, a data buffer in a loop structure can be designed
`based on the program computations that are going to be run
`
`60
`
`65
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2027 - p. 8
`
`
`
`3
`body of an inner loop is converted to a dataflow graph with
`many individual functional units that are derived from the
`operators in the loop body. The FPGA implementations of
`these functional units may have latencies of one or more clock
`ticks, and the interconnection of many functional units will
`result in a loop body that can have new data fed in on every
`clock tick, even though it may take many clock ticks for the
`results of an iteration's data to reach the bottom of the data
`flow graph. Thus many iterations are active at any point in
`time.
`Part of the loop body's dataflow graph consists of the
`expression that computes a loop termination criteria. If a new
`loop iteration is activated on every clock tick, then extra loop
`iterations are activated despite a termination value being
`achieved. For example, when it takes three clocks for the data
`of a given loop iteration to produce the true/false termination
`value, then by the time the termination expression becomes
`true, three additional iterations of the loop will already have
`started. These extra iterations are harmless as long as they are
`not allowed to affect the state of the overall computation.
`However, the data acquired by the loop from the data stream
`may be lost.
`Consider the following producer-consumer loops:
`
`10
`
`15
`
`#pragma Src parallel sections
`
`#pragma Src section
`
`inti;
`for (i=0; i-N* M: i++)
`put stream (&SO, ALi, 1);
`
`#pragma Src section
`
`The first parallel section puts values to stream S0. The
`second parallel section takes values from S0, but it does this
`in a loop nest. The inner loop of the nest will be pipelined as
`the compiler determines that it can spawna new loop iteration
`on every clock tick because there are neither loop-carried
`dependencies nor multiple accesses to any memory.
`FIG. 2 shows a high level block diagram of part of a
`simplified dataflow structure generated by a compiler for the
`consumer 130 inner loop as is known in the prior art. The
`compiler creates a FIFO buffer 110 for stream S0 210,
`which receives the values from the stream's producer. The
`loop has a synchronization module or node 215 that watches
`the FIFO buffer 110 and spawns a loop iteration whenever it
`takes a value from the FIFO buffer 110, sending into the loop
`body 230 the stream value along with an “iteration' signal
`that tells the loop that a new iteration is being started. The
`boxes labeled 'R' represent registers 235,235, ... 235, that
`hold the various scalar variables and base address values
`referenced in the loop.
`Part of the loop body 230 is composed of the loop termi
`nation expression. In the example presented above, termina
`tion is simply a comparison of i and N. The output, of the
`Compare node 240 goes false when the comparison is false,
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 8,589,666 B2
`
`4
`and the Loop Valid node 250 latches into a false state, telling
`the Terminate node 260 to emit a "done” pulse telling the
`Synchronization node 215 that the loop has reached its ter
`mination condition.
`Because of pipelining, it can take multiple clock ticks for
`the effects of a value to move through the loop body 230, but
`when there are values available in the FIFO buffer 110, the
`Synchronization node 215 will take a value from the FIFO
`buffer and start a new loop iteration 235 on every clock tick.
`Assuming that the Compare node 240 and Loop Valid node
`250 each have a one-clock tick latency, the Synchronization
`node 215 will spawn at least of couple of extra iterations
`before it sees that loop termination has occurred, and those
`extra loop iterations will have taken from the FIFO buffer
`values that should not have been taken.
`This effect becomes even more pronounced when the ter
`mination expression is more complex. A rather extreme case
`arises from the following: for (i-0; is w/v. i----) The “divide
`operation' is typically a many-clock latency operation; thirty
`or more clock ticks is not unusual, so the Synchronization
`node 215 of this loop 230 could spawn more than thirty extra
`iterations before it learns that the loop has reached termina
`tion.
`The effect of these extra iterations on the consumer loop
`nest is twofold. First, when the inner loop is entered for the
`second time and it begins to fetch values from the stream, it
`will miss the values that had been erroneously fetched by the
`earlier extra spawned iterations. Thus the program will not
`compute correctly. Second, the parallel sections will eventu
`ally deadlock, because the loss of values means that the con
`Sumer will never see the number of values it is expecting.
`There are important reasons for wanting a consumer loop
`nest to work correctly. In practical applications, there might
`be work to be done once a certain number of values have been
`read from a stream. In such a situation, the approach is to use
`data from the stream in a segmented fashion. It is, however,
`often impractical for the producer to be similarly segmented.
`For example, a data array from a distant memory might be the
`Source of the stream, and the relatively high costs of starting
`up a data transfer of such an array make it beneficial to do one
`continuous stream rather than a series of Small ones.
`A straightforward solution to the problem of loop stream
`overshoot as is known in the prior art is to slow the iteration
`rate of the consumer loop so that it does not activate a new
`iteration until the current iteration's termination expression
`has been computed, but this would result in significant deg
`radation of performance. The degradation would be espe
`cially bad in the cases of termination expressions that have
`large latencies, such as in the earlier example that does a
`divide operation. There the loop would be slowed by a factor
`of thirty or more, and Such performance is unacceptable.
`There is a need, therefore, for computer systems and com
`puter implemented methodology for eliminating the effect of
`stream pipeline loop overshoot in reconfigurable data struc
`tures. Specifically, a need exists to preserve and restore values
`taken from a buffer as the result of ongoing loop operations
`during which time a valid termination criteria has been
`reached. These and other needs of the prior art are addressed
`by the present invention that is subsequently herein described
`in detail.
`
`SUMMARY OF THE INVENTION
`
`Briefly stated, the present invention involves computer
`implemented methods and computer systems for eliminating
`overshoot effects of pipelined consumer iterative computa
`tional loops. In a reconfigurable processor environment that
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2027 - p. 9
`
`
`
`US 8,589,666 B2
`
`5
`invokes data stream pipelining during iterative computations,
`an FPGA or other reconfigured processor is configured to
`place two buffers between the producer of a data stream and
`the data consumer. In addition to a typically found FIFO
`buffer, a restore buffer is interposed between the producer and
`the consumer of the data values. The restore buffer is config
`ured to be of sufficient size to maintaina persistent memory of
`enough data values dispatched to the iterative loop structure
`So as to restore data values fetched and lost due to pipelined
`consumer loop overshoots.
`In one aspect of the present invention, a restore buffer is
`placed between each incoming stream and its consumer loop.
`The restore buffer stores the data values that were recently
`fetched from the buffer to the consumer so that later they can
`be restored. To determine how many data values should be
`restored, the iterative loop structure counts the number of the
`data values it takes from each data stream. The iterative loop
`structure also counts the number of valid loop iterations that
`take place. Once a loop termination is detected, the iterative
`loop structure halts fetching data values from the restore
`buffer and compares, for each stream, the number of loop
`iterations with the number of values fetched from each data
`stream. The difference is the number of extra data values that
`were taken from the restore buffer and must be restored. The
`iterative loop structure then directs the restore buffers to
`restore those values.
`The foregoing and other features, utilities and advantages
`of the invention will be apparent from the following more
`particular description of an embodiment of the invention as
`illustrated in the accompanying drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The aforementioned and other features and objects of the
`present invention and the manner of attaining them will
`become more apparent and the invention itself will be best
`understood by reference to the following description of a
`preferred embodiment taken in conjunction with the accom
`panying drawings, wherein:
`FIG. 1 shows a high level block diagram of a data structure
`having typical signals used in a stream connection between
`data producers and data consumers as known in the prior art;
`FIG. 2 shows a high level block diagram of part of a
`simplified data structure generated by a compiler for a con
`Sumer of data having an inner iteration loop as known in the
`prior art;
`FIG. 3 shows a high level block diagram of a data flow
`structure for eliminating the effect of stream pipeline loop
`overshoots in accordance with one embodiment of the present
`invention;
`FIG. 4 shows a high level block diagram of one embodi
`ment of a Synchronization node of a pipelined loop structure
`for eliminating the effect ofpipeline loop overshoots in accor
`dance with the present invention; and
`FIG. 5A and FIG. 5B comprise a flow diagram of one
`method embodiment of eliminating the effects of pipeline
`loop overshoots according to the present invention.
`The Figures depict embodiments of the present invention
`for purposes of illustration only. One skilled in the art will
`readily recognize from the following discussion that alterna
`tive embodiments of the structures and methods illustrated
`herein may be employed without departing from the prin
`ciples of the invention described herein.
`
`DETAILED DESCRIPTION OF THE PREFERRED
`EMBODIMENTS
`
`Computer systems and computer implemented methods
`for eliminating the effects of loop overshoots in the consump
`
`6
`tion of pipelined data streams are discussed in detail herein. In
`a pipelined data stream environment comprising a producer
`of data and a consumer of that data, the effects of consumer
`loop overshoots are eliminated by interposing, in one
`embodiment of the present invention, two buffers between the
`producer and consumer. As previously described, a typical
`configuration of a pipelined data stream structure places a
`single buffer between the producer and the consumer. Such a
`single buffer is typically configured to absorb differences in
`the production and consumption rate of the data streams, but
`does little to prevent the lost of data due to consumer loop
`overshoots.
`Consumer loops are simple iterative processes that operate
`to provide a particular result. As a simple example, an addi
`tion operation may necessitate an iterative loop until a certain
`value is obtained. Consumer loops receive or fetch data values
`from a buffer and begin the looping process. Once launched,
`the computations continue until completed. Thus, using our
`simple addition example, the loop computation may comprise
`fetching a value, adding the value to the existing total, and
`then comparing the value to a predetermined numberto deter
`mine if a termination criteria has been reached. This process
`may take two or three clock ticks of the processor. Thus even
`though there is additional data in the buffer available to the
`consumer, there is a lag between when a value has been
`fetched and when the loop has determined that it should
`terminate.
`In reconfigurable computing, iterative loops such as those
`described above, are pipelined. Pipelining, as is well known
`to one skilled in the relevant art, is a process where instead of
`waiting for the completion of aparticular loop iteration before
`fetching the next values and starting the next iteration, the
`loop aggressively fetches the next set of values on the very
`next clock tick, thereby starting a new iteration on each clock
`tick. Thus, there are several values, representing computa
`tions of several iterations, in the computational pipeline.
`Maintaining several loop iterations simultaneously for the
`same computation significantly enhances the efficiency of the
`computational process but as previously mentioned presents
`a problem when the termination criteria is eventually reached.
`When the termination criteria is reached, the loop has over
`shot the number of values it should have taken from the buffer.
`Under the prior art, this data would have been lost.
`The present invention eliminates the loss of data due to
`consumer loop overshoot providing, in one embodiment, an
`additional buffer interposed between the producer and the
`consumer that remembers and can restore the data values that
`were fetched from the buffer after the value that resulted in the
`loop meeting termination criteria was fetched.
`A high level block diagram of a system for eliminating the
`effects of pipelined loop overshoots according to one embodi
`ment of the present invention is shown in FIG. 3. As shown, a
`consumer 130 is coupled to a series of data streams 145,
`145, . . . 145, providing data originating from a producer
`(not shown). Interposed between the consumer 130 and the
`producer, and associated with each data stream 145, are two
`buffers. The first buffer 110, 110, ... 110, absorbs differ
`ences in the data flow rate between the producer 120 and the
`consumer 130. The second buffer, also referred to herein as
`the restore buffer,310,310, ... 310, maintains data values
`for a period sufficient to allow the computations conducted by
`the consumer 130 to utilize a pipeline dataflow architecture,
`without the risk of data value loss. The restore buffer 310
`keeps a persistent memory of data values fetched to the con
`Sumer So as to restore any overshoot values sent to the con
`Sumer as a function of loop computations. In other embodi
`
`5
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2027 - p. 10
`
`
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`7
`ments, the functionality of the restore buffer and that of the
`first buffer can be combined into a single buffer.
`The consumer 130 comprises, at least in part, an iteration
`loop structure 305 that performs a loop computation. Associ
`ated with the iteration loop structure 305 is a Synchronization
`node 315, various registers 335, 335. . . .335 that hold
`various scalar variables and base address values referenced in
`the loop, the inner loop body 330 that conducts the computa
`tional loop and includes a Compare node 340, a Valid node
`350 that determines when termination criteria has been met,
`and finally a Terminate node 360. According to one embodi
`ment of the present invention, when the Compare node 340
`goes false, the Valid node 350 communicates with the Termi
`nate node 360 causing it to emit a message to the Synchroni
`zation node 315 informing it that the termination criteria has
`been reached. As will be apparent to one skilled in the relevant
`art, other implementations of a loop structure are possible
`without altering the scope of the present invention and are
`indeed contemplated by the present invention.
`Upon receiving a message from the Terminate node 360
`indicating that the termination criteria for the current compu
`tational loop has been met, the Synchronization node 315
`halts fetching data values. During the period from when the
`value that resulted in the termination criteria being met was
`fetched and the Synchronization node 315 halting the fetch
`ing process, two possible edge cases, and any combination in
`between, can occur. The first edge case assumes that the data
`stream buffer 110,310 that feeds the iteration loop structure
`305 is never empty. In such a case, the iteration loop structure
`315 will take a value from the data stream 145 on every clock
`tick. The number of extra data values taken from the restore
`buffer 310 prior to receiving a signal to halt is, in this scenario,
`at its maximum. The other edge case occurs when the data
`stream 145 empties with the last validiteration. The last value
`taken from the data stream 145 results in the termination
`criteria being met and before another value is removed from
`the buffer 310, or is available to be removed, the Synchroni
`zation node 315 halts the fetching process. Thus there are no
`data values that are lost due to loop overshoot by the iteration
`loop structure 305.
`According to one embodiment of the present invention,
`these and other data value overshoot Scenarios are eliminated
`by counting the number of data values fetched from the
`restore buffer 310 of each data stream and comparing it to the
`number of valid loop iterations. Since the number of data
`values lost depends on the number of times at which data
`values arrive in the data stream buffer 145, the compiler that
`configures the computational structure for each computation
`cannot predict the number of values that will be lost from loop
`overshoot. It can, however, determine the worst case scenario
`by examining the dataflow graph of the loop and Summing the
`latencies (in clock ticks) of the nodes on the path from the
`Synchronization node 315 to the termination signal being
`generated by the Terminate node 360.
`In the case of multiple incoming data streams 145, the
`dynamic behavior of each incoming data stream 145 may be
`different, so that the number of overshoot values may differ
`among the streams 145. Thus, an individual count of the
`number of data values fetched must be kept for each stream.
`But since the compiler can determine an upper bound for the
`number of extra values taken, it can specify a restore buffer
`310 of a sufficient size to retain that many values, thereby
`guaranteeing that all of the required values will be available,
`even in the worst case.
`FIG. 4 shows a high level block diagram of one embodi
`ment of a Synchronization node 315 of a pipelined loop
`structure for eliminating the effect of pipeline loop over
`
`US 8,589,666 B2
`
`8
`shoots in accordance with the present invention. The Syn
`chronization node 315 is coupled to a restore buffer 310,
`310 for each of a plurality of data streams. The Synchroni
`zation node 315 is further coupled to the Valid node 350 and
`the Terminate node 360 (both not shown) and is capable of
`receiving signals informing of termination of the computa
`tional loop. Each restore buffer 310 is associated with a data
`value fetch counter 410, 410, that counts the number of
`fetches, or takes, conducted by the Synchronization node 315
`for that particular buffer 310, thus determining a data fetch
`value. Depending on the state of the data stream, this number
`of fetches could be as high as one data value fetch per clock
`cycle over the length of the loop computation. A loop iteration
`counter 440 is also maintained, in this embodiment of the
`present invention, in the Synchronization node 315. The loop
`iteration counter 440 tracks the number of valid loop itera
`tions determining a loop iteration value.
`As shown in FIG.4, the loop iteration counter 440 and each
`data value fetch counter 410 are in communication with a
`subtract node 450, 450. A subtract node 450 is associated
`with each restore buffer 310. Upon the Synchronization node
`315 receiving a termination signal from the Terminate node
`360, the subtract node 450 calculates the difference between
`the data fetch value supplied by the data value fetch counter
`410 and the loop iteration value supplied by the loop iteration
`counter 440. The result is a restore value number that is
`communicated to the restore buffer 310 along with an
`enabling signal telling the restore buffer 310 to restore that
`number of data values.
`In one embodiment of the present invention, the size of the
`restore buffer 310 is determined by two characteristics of the
`iteration loop structure 305. The first characteristic is the
`distance, measured in clock ticks, between the Synchroniza
`tion node 315 and the Loop V