`(12) Patent Application Publication (10) Pub. No.: US 2010/0070730 A1
`Pop et al.
`(43) Pub. Date:
`Mar. 18, 2010
`
`US 20100070730A1
`
`(54) MINIMIZING MEMORY ACCESS
`CONFLCTS OF PROCESS
`COMMUNICATION CHANNELS
`
`(76) Inventors:
`
`Sebastian Pop, Austin, TX (US);
`Jan Sjodin, Austin, TX (US);
`Harsha Jagasia, Austin, TX (US)
`
`Correspondence Address:
`MEYERTONS, HOOD, KIVLIN, KOWERT &
`GOETZEL (AMD)
`P.O. BOX 398
`AUSTIN, TX 78767-0398 (US)
`
`(21) Appl. No.:
`
`12/212,370
`
`(22) Filed:
`
`Sep. 17, 2008
`
`Publication Classification
`
`(51) Int. Cl
`(2006.01)
`Goriz/00
`(52) U.S. Cl. ................................. 711/167: 711/E12.001
`(57)
`ABSTRACT
`A system and method for minimizing cache conflicts and
`synchronization Support for generated parallel tasks within a
`compiler framework. A compiler comprises library functions
`to generate a queue for parallel applications and divides it into
`windows. A window may be sized to fit within a first-level
`cache of a processor. Application code with producer and
`consumer patterns within a loop construct has these patterns
`split into producer and consumer tasks. Within a producer
`task loop, a function call is placed for a push operation that
`modifies a memory location within a producer sliding win
`dow without a check for concurrent accesses. A consumer
`task loop has a similar function call. At the time a producer or
`consumer task is ready to move, or slide, to an adjacent
`window, its corresponding function call determines if the
`adjacent window is available.
`
`— Memory Hierarchy 400
`
`As
`
`Consumer
`Sliding
`Window 462
`
`
`
`
`
`
`
`
`
`
`
`
`
`Consumer
`Sliding
`Window 442
`
`Main
`Memory 450
`
`Shared Cache
`Memory 440
`
`Producer
`Sliding
`Window 444
`
`- - - - - - - - - - - - - - -
`
`Unit 420a |
`
`- - - - - - - - - - - - - - -
`
`Unit 42Ob
`
`Cache
`Memory 416a |
`
`Data Copy
`4.18a
`
`Cache
`Memory 412a |
`Data Copy
`414a
`
`Cache
`Memory 416b
`
`Data Copy
`418b.
`
`Cache
`Memory 412b
`Data Copy
`414b.
`
`
`
`
`
`
`
`Processor
`Core
`112b
`
`
`
`
`
`Processor
`Core
`112a
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 1
`
`
`
`Patent Application Publication
`
`Mar. 18, 2010 Sheet 1 of 7
`
`US 2010/0070730 A1
`
`— Processing Node 100
`
`
`
`Shared Cache
`Memory
`Subsystem
`118
`
`Packet Processing Logic
`116
`
`Unit 115a
`
`Unit 11.5b
`
`Cache Memory
`Subsystem
`114a
`
`Cache Memory
`Subsystem
`114b
`
`PrOCeSSOr
`COre
`112a
`
`PrOCeSSOr
`COre
`112b
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 2
`
`
`
`Patent Application Publication
`
`Mar. 18, 2010 Sheet 2 of 7
`
`US 2010/0070730 A1
`
`Method 200
`
`
`
`Generate SOUrCe
`COce.
`210
`
`Compile (frontend)
`Source COde to
`Intermediate
`Representation (IR).
`220
`
`Compile (backend) IR
`to Machine COCle.
`230
`
`Execute Machine
`Code.
`240
`
`FIG 2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 3
`
`
`
`Patent Application Publication
`
`Mar. 18, 2010 Sheet 3 of 7
`
`US 2010/0070730 A1
`
`Flow Dependences 300
`
`Written Elements
`302
`
`Read Elements
`306
`
`FIG. 3A
`
`|
`
`From Prior
`Producer
`
`N. Dependences
`
`304
`
`Written Elements
`312
`
`Read Elements
`316
`
`Dependences
`314
`
`To Later
`Consumer
`
`FIG. 3B
`
`Written Elements
`
`
`
`--- s
`
`Read Elements
`326
`
`FIG. 3C
`
`Data
`Dependences
`324
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 4
`
`
`
`Patent Application Publication
`
`Mar. 18, 2010 Sheet 4 of 7
`
`US 2010/0070730 A1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Consumer
`Sliding
`Window 462
`
`
`
`
`
`Consumer
`Sliding
`WindoW 442
`
`Memory 416a
`
`Data Copy
`4.18a
`
`Cache
`Memory 412a
`Data Copy
`414a
`
`Processor
`Core
`112a
`
`Memory Hierarchy 400
`
`Main
`Memory 450
`
`Shared Cache
`Memory 440
`
`Producer
`Sliding
`WindoW 444
`
`Unit 420b
`
`Cache
`Memory 416b
`
`Data Copy
`418b.
`
`Cache
`Memory 412b
`Data Copy
`414b.
`
`
`
`
`
`ProCessor
`Core
`112b
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 5
`
`
`
`Patent Application Publication
`
`Mar. 18, 2010 Sheet 5 of 7
`
`US 2010/0070730 A1
`
`
`
`
`
`
`
`
`
`
`
`Begin Compile
`(back-end) of IR.
`502
`
`
`
`Enclose both
`loops in Compiler
`ps in
`O
`directives for
`COnCurrent
`eXeCUtion.
`512
`
`Generate Calls
`for shared queue
`management.
`514
`
`Generate any
`necessary
`alignment
`between
`producer/
`Consumer tasks.
`516
`
`— Method 500
`
`p
`
`EnCounter
`a loop
`in RT
`504
`a/
`
`
`
`Use other
`ld Ortihm
`alg
`on loop.
`510
`
`
`
`--No
`
`Are flow
`dependences
`Computable?
`506
`
`
`
`Yes
`
`No
`
`Partition the
`computations into
`tasks and loops.
`508
`
`Finish generating
`machine Code.
`Link libraries in
`final binary.
`518
`
`Begin execution.
`520
`
`FIG. 5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 6
`
`
`
`Patent Application Publication
`
`Mar. 18, 2010 Sheet 6 of 7
`
`US 2010/0070730 A1
`
`Execute a
`loop
`instruction.
`620
`
`EnCOUnter
`a push
`operation?
`612
`S/
`
`Yes
`w
`
`Push element(s)
`into Window.
`Update pointers.
`614
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`No
`
`A
`s
`
`Method 600
`
`Open Producer
`Parallel Task.
`602
`
`Encounter 1
`push operation
`
`ls
`Stream ful?
`606
`
`YeS->
`
`Wait for
`Consumer
`window to
`slide.
`608
`
`Push element(s)
`into window.
`Update pointers.
`610
`
`
`
`Reached
`end of loop?
`618
`a/
`Yes
`— —
`Set end-of-Stream
`Flag. Update
`Pointers
`622
`
`FIG. 6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 7
`
`
`
`Patent Application Publication
`
`Mar. 18, 2010 Sheet 7 of 7
`
`US 2010/0070730 A1
`
`Open Consumer
`Parallel Task.
`702
`
`&
`'s
`
`
`
`
`
`
`
`EnCounter
`a pop
`operation?
`714
`
`Pop
`element(s)
`from Window.
`Update
`pointers.
`716
`
`
`
`
`
`
`
`
`
`No
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`No
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Reached
`end of loop?
`720
`a/
`
`-
`
`A.
`
`Method 700
`
`Execute a
`loop
`instruction.
`704
`
`Encounter 1
`pop operation?
`
`Stream empty?
`
`Wait for
`Producer
`Yeso Window to
`slide.
`710
`
`Pop element(s)
`from Window.
`Update pointers.
`712
`
`
`
`Yes
`
`EnCOunter an
`alignment call?
`
`Yeso-
`
`Pop
`element(s)
`from
`Window.
`724
`
`N
`Update Pointers.
`726
`
`FIG. 7
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 8
`
`
`
`US 2010/0070730 A1
`
`Mar. 18, 2010
`
`MINIMIZING MEMORYACCESS
`CONFLCTS OF PROCESS
`COMMUNICATION CHANNELS
`
`BACKGROUND OF THE INVENTION
`1. Field of the Invention
`0001
`0002 This invention relates to computer systems, and
`more particularly, to minimizing cache conflicts and synchro
`nization Support for generated parallel tasks with a compiler
`framework.
`0003 2. Description of the Relevant Art
`0004 Both hardware and software determine the perfor
`mance of computer systems. Hardware design is becoming
`difficult to generate more performance due to cross capaci
`tance effects on wires, parasitic inductance effects on wires,
`and electrostatic field effects within transistors, which
`increase circuit noise effects on-chip and propagation delays.
`Additionally, continuing decreases in geometric dimensions
`of devices and metal routes may increase these effects. Also,
`the number of Switching nodes per clock period increases as
`more devices are placed on-chip, and, thus, the power con
`Sumption increases. These noise and power effects limit the
`operational frequency, and, therefore, the performance of the
`hardware.
`0005 While the reduction in geometric dimensions on
`chip discussed above may lead to larger caches and multiple
`cores placed on each processor, Software and Software pro
`grammers cannot continue to depend on ever-fasterhardware
`to hide inefficient code. In some cases hardware traits may
`increase performance if the execution of parallel applications,
`Such as multi-threaded applications, executed on multi-core
`processors and chips ensure that concurrent accesses by mul
`tiple threads to shared memory, such as the multiple first-level
`caches, are synchronized. This synchronization ensures cor
`rectness of operations, but also may limit peak performance.
`For example, locking mechanisms, such as Semaphores or
`otherwise, may ensure correctness of operations, but may
`also limit peak performance.
`0006. Some lock-free mechanisms may be used that allow
`a thread to complete read and write operations to shared
`memory without regard for operations of other threads. Each
`operation may be recorded in a log. Before a commit of the
`thread occurs, validation is performed. If it is found that other
`threads concurrently modified the same accessed memory
`locations, then re-execution is required. This re-execution
`may limit peak performance.
`0007 Another option for increasing performance of par
`allel applications from a software point-of-view is the man
`agement of the use of Scratchpad memories. Such as concur
`rent first-in-first-out (FIFO) queues implemented in a cache
`hierarchy. Many lock-free algorithms are based on compare
`and swap (CAS) algorithms. The CAS algorithm takes as
`arguments the address of a shared memory location, an
`expected value, and a new value. If the shared location cur
`rently holds the expected value, it is assigned the new value
`atomically (an atomic operation may generally represent one
`or more operations that have an appearance of a single opera
`tion to the rest of the system). A Boolean return value may be
`used to indicate whether the replacement occurred.
`0008. However, CAS algorithms deal with “ABA” prob
`lems, wherein a process reads a value A from a shared loca
`tion, computes a new value, and then the process attempts a
`CAS operation. Between the read operation and the CAS
`operation other processes may change the value in the shared
`
`location from A to B, do other work, and then change the
`value in the shared location back to A again. Now the first
`thread is still executing under the assumption that the value
`has not changed even though another thread did work that
`violates that assumption. The CAS operation will succeed
`when it should not. Another example is within a lock-free
`queue, a data element may be removed from a list within the
`queue, deleted, and then a new data element is allocated and
`added to the list. It is common for the allocated new data
`element to beat the same location as the deleted data element
`due to optimizations in the memory manager. A pointer to the
`new data element may be equal to a pointer to the old data
`element.
`0009. One solution to the above problem is to associate tag
`bits, or a modification counter, to a data element. The counter
`is incremented with each successful CAS operation. Due to
`wrap around, the problem is reduced, but not eliminated.
`Some architectures provide a double-word CAS operation,
`which allows for a larger tag. However, on-chip real estate is
`increased as well as matching circuitry delays. Another solu
`tion performs a CAS operation in a fetch and store-modify
`CAS sequence, rather than the usual read-modify-CAS
`sequence. However, this solution makes the algorithm block
`ing rather than non-blocking.
`0010 A wait-free algorithm is both non-blocking and star
`Vation free, wherein the algorithm guarantees that every
`active process will make progress within a bounded number
`of time steps. An example of a wait-free algorithm includes L.
`Lamport, Specifying Concurrent Program Modules, ACM
`Transactions on Programming Languages and Systems, Vol.
`5, No. 2, April 1983, pp. 190-222. However, Lamport presents
`a wait-free algorithm that restricts concurrency to a single
`enqueued element and a single dequeued element and the
`frequency of occurrence of required synchronization is not
`reduced.
`0011. In view of the above, efficient methods and mecha
`nisms for improving the generation of parallel tasks from
`sequential code and execution of the parallel tasks while
`minimizing cache conflicts are desired.
`
`SUMMARY OF THE INVENTION
`0012 Systems and methods for minimizing cache con
`flicts and synchronization Support for generated parallel tasks
`within a compiler framework are contemplated. Application
`code has producer and consumer patterns in a loop construct
`divided into two corresponding loops or tasks. An array's
`elements, or a Subset of the elements, are updated or modified
`in the producer task. The same array elements or a Subset of
`the array's elements are read out in the consumer taskin order
`to compute a value for another variable within the loop con
`struct. In one embodiment, a method comprises dividing a
`stream into windows, wherein a stream is a circular first-in,
`first-out (FIFO) shared storage queue. In one window, a pro
`ducer task is able to modify memory locations within a pro
`ducer sliding window without checking for concurrent
`accesses to the corresponding elements.
`0013 Upon filling the producer sliding window, the pro
`ducer task may move to an adjacent window to continue
`computational work. However, at this moment of moving, or
`sliding, to an adjacent window, the producer task verifies that
`a consumer task is not reading values from this adjacent
`window. The method performs similar operations for a con
`Sumer task wherein a consumer task is able to read memory
`locations within a consumer sliding window without check
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 9
`
`
`
`US 2010/0070730 A1
`
`Mar. 18, 2010
`
`ing for concurrent accesses to the corresponding elements.
`Since synchronization is limited to the times a producer or a
`consumer task completes its corresponding operations in a
`window and the task is ready to move to an adjacent window,
`the penalty for synchronization may be reduced.
`0014. In various embodiments, a compiler comprises
`library functions that may be either placed in an intermediate
`representation of an application code by back-end compila
`tion or placed in source code by a software programmer.
`These library functions are configured to generate a stream
`and divide it into windows. A window is sized both to fit
`within a first-level cache of a processor and to reduce the
`chances of eviction from this first-level cache. Within a pro
`ducer task loop, a function call may be placed for a push
`operation that modifies a memory location within a producer
`sliding window without checking for concurrent accesses to
`the corresponding elements. Likewise, within a consumer
`task loop, a function call may be placed for a pop operation
`that reads a memory location within a consumer sliding win
`dow without checking for concurrent accesses to the corre
`sponding elements. At the time a window is filled by a pro
`ducer task, or a window is emptied by a consumer task, the
`corresponding function call performs a determination of
`whether or not an adjacent window is available for continued
`work. If an adjacent window is available, the corresponding
`task moves, or slides, from its current window to the adjacent
`window. Otherwise, the corresponding task waits until the
`adjacent window is available.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`0015 FIG. 1 is a generalized block diagram illustrating
`one embodiment of an exemplary processing node.
`0016 FIG. 2 is a flow diagram of one embodiment of a
`static compiler method.
`0017 FIG. 3A is a generalized block diagram illustrating
`one embodiment of source code pattern with regular data
`dependences within an array.
`0018 FIG. 3B is a generalized block diagram illustrating
`one embodiment of source code pattern with shifted data
`dependences within an array.
`0019 FIG. 3C is a generalized block diagram illustrating
`one embodiment of Source code pattern with irregular data
`dependences within an array.
`0020 FIG. 4 is a generalized block diagram illustrating
`one embodiment of a memory hierarchy that Supports pro
`ducer and consumer sliding windows.
`0021
`FIG. 5 is a flow diagram illustrating one embodi
`ment of a method for automatic parallelization.
`0022 FIG. 6 is a flow diagram illustrating one embodi
`ment of a method for executing a producer task.
`0023 FIG. 7 is a flow diagram illustrating one embodi
`ment of a method for executing a consumer task.
`0024. While the invention is susceptible to various modi
`fications and alternative forms, specific embodiments are
`shown by way of example in the drawings and are herein
`described in detail. It should be understood, however, that
`drawings and detailed description thereto are not intended to
`limit the invention to the particular form disclosed, but on the
`contrary, the invention is to cover all modifications, equiva
`
`lents and alternatives falling within the spirit and scope of the
`present invention as defined by the appended claims.
`
`DETAILED DESCRIPTION
`0025. In the following description, numerous specific
`details are set forth to provide a thorough understanding of the
`present invention. However, one having ordinary skill in the
`art should recognize that the invention may be practiced with
`out these specific details. In some instances, well-known
`circuits, structures, and techniques have not been shown in
`detail to avoid obscuring the present invention.
`0026 FIG. 1 is a block diagram of one embodiment of an
`exemplary processing node 100. Processing node 100 may
`include memory controller 120, interface logic 140, one or
`more processing units 115a-115b. As used herein, elements
`referred to by a reference numeral followed by a letter may be
`collectively referred to by the numeral alone. For example,
`processing units 115a-115b may be collectively referred to as
`processing units 115. Processing units 115 may include a
`processor core 112 and a corresponding cache memory Sub
`systems 114. Processing node 100 may further include packet
`processing logic 116, and a shared cache memory Subsystem
`118. In one embodiment, the illustrated functionality of pro
`cessing node 100 is incorporated upon a single integrated
`circuit.
`0027 Generally, packet processing logic 116 is config
`ured to respond to control packets received on the links to
`which processing node 100 is coupled, to generate control
`packets in response to processor cores 112 and/or cache
`memory Subsystems 114, to generate probe commands and
`response packets in response to transactions selected by
`memory controller 120 for service, and to route packets for
`which node 100 is an intermediate node to other nodes
`through interface logic 140. Interface logic 140 may include
`logic to receive packets and synchronize the packets to an
`internal clock used by packet processing logic 116.
`0028 Cache subsystems 114 and 118 may comprise high
`speed cache memories configured to store blocks of data.
`Cache memory Subsystems 114 may be integrated within
`respective processor cores 112. Alternatively, cache memory
`Subsystems 114 may be coupled to processor cores 114 in a
`backside cache configuration or an inline configuration, as
`desired. Still further, cache memory subsystems 114 may be
`implemented as a hierarchy of caches. Caches which are
`nearer processor cores 112 (within the hierarchy) may be
`integrated into processor cores 112, if desired. In one embodi
`ment, cache memory Subsystems 114 each represent L2 cache
`structures, and shared cache Subsystem 118 represents an L3
`cache structure.
`0029. Both the cache memory subsystem 114 and the
`shared cache memory Subsystem 118 may include a cache
`memory coupled to a corresponding cache controller. For the
`shared cache memory subsystem 118, the cache controller
`may include programmable logic in order to programmably
`enable a storage of directory entries within locations of sub
`system 118.
`0030 Processor cores 112 include circuitry for executing
`instructions according to a predefined instruction set. For
`example, the x86 instruction set architecture may be selected.
`Alternatively, the Alpha, PowerPC, or any other instruction
`set architecture may be selected. Generally, processor core
`112 accesses the cache memory Subsystems 114, respec
`tively, for data and instructions. If the requested block is not
`found in cache memory Subsystem 114 or in shared cache
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 10
`
`
`
`US 2010/0070730 A1
`
`Mar. 18, 2010
`
`memory Subsystem 118, thena read request may be generated
`and transmitted to the memory controller within the node to
`which the missing block is mapped.
`0031
`Referring to FIG. 2, one embodiment of a static
`compiler method 200 is shown. Software applications may be
`Written by a designer in a high-level language such as C, C++,
`Fortran, or other in block 210. This source code may be stored
`on a computer readable medium. A command instruction,
`which may be entered at a prompt by a user, with any neces
`sary options may be executed in order to compile the source
`code. One example of a compiler is the GNU Compiler Col
`lection (GCC), which is a set of compilers produced for
`Various programming languages. However, other examples of
`compilers are possible and contemplated.
`0032. In block 220, in one embodiment, the front-end
`compilation translates the source code to an intermediate
`representation (IR). Syntactic and semantic processing as
`well as some optimizations may be performed at this step. The
`back-end compilation in block 230 translates the IR to
`machine code. The back-end may perform more transforma
`tions and optimizations for a particular computerarchitecture
`and processor design. For example, a processor is designed to
`execute instructions of a particular instruction set architecture
`(ISA), but the processor may have one or more processor
`cores. The manner in which a software application is executed
`(block 240) in order to reach peak performance may differ
`greatly between a single-, a dual-, or a quad-core processor.
`Other designs may have eight cores. Regardless, the manner
`in which to compile the software application in order to
`achieve peak performance may need to vary between a single
`core and a multi-core processor.
`0033 Lock contention may be used to prevent potential
`Overlapped accesses to shared memory, such as caches in
`memory subsystem 114 and 118 in FIG.1. However, it also
`reduces performance when cores are in a wait state until the
`lock is removed. Transactional memory may be used to pre
`vent halted execution. However, if a memory conflict is later
`found during a validation stage, a particular thread may roll
`back its operations to a last validated checkpoint or the start of
`the thread and begin re-execution. In another embodiment,
`the thread may be aborted and rescheduled for execution at a
`later time.
`0034). The above examples of synchronization are used to
`ensure a producer and a consumer do not concurrently access
`the same memory location. A producer may correspond to a
`processor core, such as processor core 112a in FIG. 1, sup
`plying data, such as in an array, while executing a task or a
`thread. A consumer may correspond to a processor core, such
`as processor core 112b in FIG.1, retrieving data, such as in an
`array, while executing a parallel task or a parallel thread. For
`example, referring to FIG. 3A, one embodiment of a source
`code pattern with regular data dependences within an array is
`shown. Automatic parallelization techniques are based on
`data dependence analysis information.
`0035. Here, an array has its elements updated, such as
`written elements 302, and subsequently these elements are
`read out, which are represented by read elements 306. FIG.
`3A represents a regular data flow relation in which all the
`elements written by the producer are read by a same con
`Sumer. This one-to-one correlation is depicted by data depen
`dences 304. For example, suppose a designer has written
`Source code that contains the below code segment now in the
`IR,
`
`for (i = 0; i < n; i++) {
`ai) = variable1 op variable2;
`variable3 = ai) op variable4;
`
`}
`
`/* line 1 */
`
`/* line 4 *f
`
`I0036) The compiler may split this loop into two parallel
`loops,
`
`/* Producer Task */
`for (i = 0; i < n; i++) {
`ai) = variable1 op variable2;
`}
`* Consumer Task */
`for (i = 0; i < n; i++) {
`variable3 = ai op variable4:
`
`/* line 7 *f
`
`I0037. The automatic parallelization techniques of a com
`piler, such as in a runtime parallelization library, are based on
`data dependence analysis information. This static analysis
`determines the relations between memory accesses and
`allows the analysis of dependences between computations via
`memory accesses. This in turn allows task partitioning, and
`data and computation privatization. Data dependences repre
`sent a relation between two tasks. In the case of flow depen
`dences, the dependence relation is between a task that writes
`data and another task that is reading it.
`I0038. In FIG.3B only apart of the elements of an array are
`Written, making a part of the read elements dependent on a
`previous producer. Likewise, it may occur that only a part of
`the elements of an array are read, making part of the written
`elements dependent on a later consumer. Here, an array has its
`elements partially updated, such as written elements 312, and
`Subsequently these elements are read out, which are repre
`sented by partial read elements 316. FIG. 3B represents a
`shifted data flow relation in which partial of the elements
`Written by the producer are read by a same consumer. This
`type of shifted data dependence is depicted by data depen
`dences 314. For example, suppose a designer has written
`Source code that contains the below code segment now in the
`IR,
`
`for (i = 0; i < n; i++) {
`ai+2) = variable1 op variable2;
`variable3 = ai) op variable4;
`
`}
`
`/* line 12 */
`
`/* line 15 */
`
`0039
`loops,
`
`The compiler may split this loop into two parallel
`
`/* Producer Task *
`for (i = 1; i <= n; i++) {
`ai+2) = variable1 op variable2;
`}
`* Consumer Task */
`for (i = 1; i <= n; i++) {
`variable3 = ai op variable4:
`
`/* line 17 */
`
`/* line 19*/
`
`/* line 21 */
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 11
`
`
`
`US 2010/0070730 A1
`
`Mar. 18, 2010
`
`0040 FIG.3C represents a more difficult dependence rela
`tion that may not be determined at compile time. Here, an
`array may have its elements fully or partially updated, such as
`written elements 322, and subsequently these elements are
`read out, which are represented by partial read elements 326.
`However, the reading out of the elements may not be in
`array-order in the sequential code. FIG. 3C represents an
`irregular data flow relation not known at compile time. This
`type of irregular data dependence is depicted by data depen
`dences 324. The read elements 326 may be accessed by an
`indirection. For example, Suppose a designer has written
`Source code that contains the below code segment now in the
`IR,
`
`for (i = 0; i < n; i++) {
`ai = variable1 op variable2;
`variable3 = a(b(i) op variable4;
`
`f* line 24 *f
`
`f* line 27*.
`
`0041. In the above example, the read elements are
`accessed by an indirection and, accordingly, the consumer
`task may then be considered dependent on the completion of
`the producer task.
`0042 Turning now to FIG. 4, one embodiment of a
`memory hierarchy 400 that Supports producer and consumer
`sliding windows is shown. Processing units 420 may be simi
`lar to processing units 115 of FIG. 1. The same circuitry for
`processor cores 112 of FIG.1 may be used here as well. In one
`embodiment, the cache subsystem is shown as two levels, 412
`and 416, but other embodiments may be used as well. Cache
`memory 412 may be implemented as a L1 cachestructure and
`may be integrated into processor core 112, if desired. Cache
`memory 416 may be implemented as a L2 cache structure.
`Other embodiments are possible and contemplated. Inter
`faces are not shown here as they are in FIG. 1 for simpler
`illustrative purposes. A shared cache memory 440 may be
`implemented as a L3 cachestructure. Here, the shared cache
`memory 440 is shown as one level, but other levels and
`implementations are possible.
`0.043
`Again, an interface, such as a memory controller, to
`main memory 450 is not shown for simpler illustrative pur
`poses. Main memory 450 may be implemented as dynamic
`random-access memory (DRAM), dual in-line memory mod
`ules (dimms), a hard disk, or otherwise.
`0044. The transfer of data between tasks as shown in FIG.
`3A-3C may occur via a communication channel called a
`stream. A stream may be a circular buffer managed as a FIFO
`concurrent lock free queue. Concurrent FIFO queues are
`widely used in parallel applications and operating systems. A
`stream may be implemented in the memory hierarchy 400
`such as in stream copies 440 and 460. The most updated
`contents of a stream may be in Stream copies located closest
`to a processor core, such as stream copy 440.
`0045. As shown above in FIG. 3A-3C, a loop in source
`code with a regular or shifted flow dependence may be split
`into two parallel tasks such as a producer task and a consumer
`task. Each task may be provided a sliding window, or local
`buffer, within the stream. For example, in one embodiment,
`processor core 112b may be assigned a producer task as
`depicted by lines of code 5-7 above and processor core 112a
`may be assigned a consumer task as depicted by lines of code
`
`9-11 above. Stream 460 may be created for the parallel tasks
`in code lines 5-11. Stream 440 may be a more up-to-date copy
`of stream 460.
`0046 Producer sliding window 444, and correspondingly
`464, may be empty and designated for storing data produced
`by core 112b. In fact, a snapshot in the middle of code execu
`tion may show core 112b has produced data for the loop of
`lines 5-7, which is presently stored in filled space 446. The
`pointers within stream 440 may have been updated and core
`112b is now allowed to begin filling producer sliding window
`444 with new data. More up-to-date copies of producer slid
`ing window 444 may be found in the cache hierarchy, such as
`data copies 418b and 414b. In fact, the size of the producer
`sliding window and its copies may be chosen in order that the
`window remains located in closest cache to processor 112b
`with a low chance of being evicted. Therefore, cache conflicts
`may be reduced.
`0047 While processor core 112b is executing a producer
`task, producing data for stream copies 440 and 460, and filling
`data copy 414b to be subsequently sent to producer sliding
`windows 444 and 464, processor core 112a may be concur
`rently executing a consumer task, reading data from stream
`copy 440, and reading from data copy 414a which was pre
`viously read from consumer sliding window 442. Consumer
`sliding window 442, and correspondingly 462, may be full
`and designated for reading data by core 112a. In fact, a
`Snapshot in the middle of code execution may show core 112a
`is reading data for the loop of lines 9-11, which is presently
`stored infilled space 446. The pointers within stream 440 may
`have been updated and core 112a is now allowed to begin
`reading consumer sliding window 442 which has new data.
`More up-to-date copies of consumersliding window 442 may
`be found in the cache hierarchy, Such as data copies 418a and
`414a. The size of the consumer sliding window may be the
`same size as the producer sliding window for ease of imple
`mentation sake. Alternatively, when core 112a has permis
`sion to read from a window due to no overlap of the producer
`and consumer sliding windows, rather than read data from
`stream copy 440, core 112a may send a communicative probe
`to locate required data. If a copy of the required updated data
`is in cache 416b or 412b of processing unit 420b, then a copy
`of the required data may be sent from processing unit 420b to
`cache 412a.
`0048. As can be seen in FIG. 4, as long as the producer
`sliding window 444 does not overlap the consumer sliding
`window 442, both cores 112a-112b may execute in parallel
`without conflicting memory accesses. In fact, the only pen
`alty for synchronization occurs when a producer sliding win
`dow 444 or a consumer sliding window 442 needs to slide
`within stream 440. A check must be performed to ensure there
`is no overlap between the windows. Further details are pro
`vided later. Of course, consumer sliding window 442 can not
`begin reading and sliding until producer sliding window 444
`has filled at least one window within stream 440 and subse
`quently moved to another window within stream 440. This is
`a small overhead price to be paid during initial execution of
`the parallel tasks. By allowing cores 112a-112b to concur
`rently execute on data within copies of sliding windows 442
`and 444 and only needing to perform synchronization during
`times when a window needs to slide, peak performance of a
`system may be more easily achieved.
`0049. The embodiment shown in FIG. 4 is for illustrative
`purposes only. In other embodiments, more than two process
`ing units 420 may be included in a system and more than one
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 12
`
`
`
`US 2