throbber
(19) United States
`(12) Patent Application Publication (10) Pub. No.: US 2010/0070730 A1
`Pop et al.
`(43) Pub. Date:
`Mar. 18, 2010
`
`US 20100070730A1
`
`(54) MINIMIZING MEMORY ACCESS
`CONFLCTS OF PROCESS
`COMMUNICATION CHANNELS
`
`(76) Inventors:
`
`Sebastian Pop, Austin, TX (US);
`Jan Sjodin, Austin, TX (US);
`Harsha Jagasia, Austin, TX (US)
`
`Correspondence Address:
`MEYERTONS, HOOD, KIVLIN, KOWERT &
`GOETZEL (AMD)
`P.O. BOX 398
`AUSTIN, TX 78767-0398 (US)
`
`(21) Appl. No.:
`
`12/212,370
`
`(22) Filed:
`
`Sep. 17, 2008
`
`Publication Classification
`
`(51) Int. Cl
`(2006.01)
`Goriz/00
`(52) U.S. Cl. ................................. 711/167: 711/E12.001
`(57)
`ABSTRACT
`A system and method for minimizing cache conflicts and
`synchronization Support for generated parallel tasks within a
`compiler framework. A compiler comprises library functions
`to generate a queue for parallel applications and divides it into
`windows. A window may be sized to fit within a first-level
`cache of a processor. Application code with producer and
`consumer patterns within a loop construct has these patterns
`split into producer and consumer tasks. Within a producer
`task loop, a function call is placed for a push operation that
`modifies a memory location within a producer sliding win
`dow without a check for concurrent accesses. A consumer
`task loop has a similar function call. At the time a producer or
`consumer task is ready to move, or slide, to an adjacent
`window, its corresponding function call determines if the
`adjacent window is available.
`
`— Memory Hierarchy 400
`
`As
`
`Consumer
`Sliding
`Window 462
`
`
`
`
`
`
`
`
`
`
`
`
`
`Consumer
`Sliding
`Window 442
`
`Main
`Memory 450
`
`Shared Cache
`Memory 440
`
`Producer
`Sliding
`Window 444
`
`- - - - - - - - - - - - - - -
`
`Unit 420a |
`
`- - - - - - - - - - - - - - -
`
`Unit 42Ob
`
`Cache
`Memory 416a |
`
`Data Copy
`4.18a
`
`Cache
`Memory 412a |
`Data Copy
`414a
`
`Cache
`Memory 416b
`
`Data Copy
`418b.
`
`Cache
`Memory 412b
`Data Copy
`414b.
`
`
`
`
`
`
`
`Processor
`Core
`112b
`
`
`
`
`
`Processor
`Core
`112a
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 1
`
`

`

`Patent Application Publication
`
`Mar. 18, 2010 Sheet 1 of 7
`
`US 2010/0070730 A1
`
`— Processing Node 100
`
`
`
`Shared Cache
`Memory
`Subsystem
`118
`
`Packet Processing Logic
`116
`
`Unit 115a
`
`Unit 11.5b
`
`Cache Memory
`Subsystem
`114a
`
`Cache Memory
`Subsystem
`114b
`
`PrOCeSSOr
`COre
`112a
`
`PrOCeSSOr
`COre
`112b
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 2
`
`

`

`Patent Application Publication
`
`Mar. 18, 2010 Sheet 2 of 7
`
`US 2010/0070730 A1
`
`Method 200
`
`
`
`Generate SOUrCe
`COce.
`210
`
`Compile (frontend)
`Source COde to
`Intermediate
`Representation (IR).
`220
`
`Compile (backend) IR
`to Machine COCle.
`230
`
`Execute Machine
`Code.
`240
`
`FIG 2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 3
`
`

`

`Patent Application Publication
`
`Mar. 18, 2010 Sheet 3 of 7
`
`US 2010/0070730 A1
`
`Flow Dependences 300
`
`Written Elements
`302
`
`Read Elements
`306
`
`FIG. 3A
`
`|
`
`From Prior
`Producer
`
`N. Dependences
`
`304
`
`Written Elements
`312
`
`Read Elements
`316
`
`Dependences
`314
`
`To Later
`Consumer
`
`FIG. 3B
`
`Written Elements
`
`
`
`--- s
`
`Read Elements
`326
`
`FIG. 3C
`
`Data
`Dependences
`324
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 4
`
`

`

`Patent Application Publication
`
`Mar. 18, 2010 Sheet 4 of 7
`
`US 2010/0070730 A1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Consumer
`Sliding
`Window 462
`
`
`
`
`
`Consumer
`Sliding
`WindoW 442
`
`Memory 416a
`
`Data Copy
`4.18a
`
`Cache
`Memory 412a
`Data Copy
`414a
`
`Processor
`Core
`112a
`
`Memory Hierarchy 400
`
`Main
`Memory 450
`
`Shared Cache
`Memory 440
`
`Producer
`Sliding
`WindoW 444
`
`Unit 420b
`
`Cache
`Memory 416b
`
`Data Copy
`418b.
`
`Cache
`Memory 412b
`Data Copy
`414b.
`
`
`
`
`
`ProCessor
`Core
`112b
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 5
`
`

`

`Patent Application Publication
`
`Mar. 18, 2010 Sheet 5 of 7
`
`US 2010/0070730 A1
`
`
`
`
`
`
`
`
`
`
`
`Begin Compile
`(back-end) of IR.
`502
`
`
`
`Enclose both
`loops in Compiler
`ps in
`O
`directives for
`COnCurrent
`eXeCUtion.
`512
`
`Generate Calls
`for shared queue
`management.
`514
`
`Generate any
`necessary
`alignment
`between
`producer/
`Consumer tasks.
`516
`
`— Method 500
`
`p
`
`EnCounter
`a loop
`in RT
`504
`a/
`
`
`
`Use other
`ld Ortihm
`alg
`on loop.
`510
`
`
`
`--No
`
`Are flow
`dependences
`Computable?
`506
`
`
`
`Yes
`
`No
`
`Partition the
`computations into
`tasks and loops.
`508
`
`Finish generating
`machine Code.
`Link libraries in
`final binary.
`518
`
`Begin execution.
`520
`
`FIG. 5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 6
`
`

`

`Patent Application Publication
`
`Mar. 18, 2010 Sheet 6 of 7
`
`US 2010/0070730 A1
`
`Execute a
`loop
`instruction.
`620
`
`EnCOUnter
`a push
`operation?
`612
`S/
`
`Yes
`w
`
`Push element(s)
`into Window.
`Update pointers.
`614
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`No
`
`A
`s
`
`Method 600
`
`Open Producer
`Parallel Task.
`602
`
`Encounter 1
`push operation
`
`ls
`Stream ful?
`606
`
`YeS->
`
`Wait for
`Consumer
`window to
`slide.
`608
`
`Push element(s)
`into window.
`Update pointers.
`610
`
`
`
`Reached
`end of loop?
`618
`a/
`Yes
`— —
`Set end-of-Stream
`Flag. Update
`Pointers
`622
`
`FIG. 6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 7
`
`

`

`Patent Application Publication
`
`Mar. 18, 2010 Sheet 7 of 7
`
`US 2010/0070730 A1
`
`Open Consumer
`Parallel Task.
`702
`
`&
`'s
`
`
`
`
`
`
`
`EnCounter
`a pop
`operation?
`714
`
`Pop
`element(s)
`from Window.
`Update
`pointers.
`716
`
`
`
`
`
`
`
`
`
`No
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`No
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Reached
`end of loop?
`720
`a/
`
`-
`
`A.
`
`Method 700
`
`Execute a
`loop
`instruction.
`704
`
`Encounter 1
`pop operation?
`
`Stream empty?
`
`Wait for
`Producer
`Yeso Window to
`slide.
`710
`
`Pop element(s)
`from Window.
`Update pointers.
`712
`
`
`
`Yes
`
`EnCOunter an
`alignment call?
`
`Yeso-
`
`Pop
`element(s)
`from
`Window.
`724
`
`N
`Update Pointers.
`726
`
`FIG. 7
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 8
`
`

`

`US 2010/0070730 A1
`
`Mar. 18, 2010
`
`MINIMIZING MEMORYACCESS
`CONFLCTS OF PROCESS
`COMMUNICATION CHANNELS
`
`BACKGROUND OF THE INVENTION
`1. Field of the Invention
`0001
`0002 This invention relates to computer systems, and
`more particularly, to minimizing cache conflicts and synchro
`nization Support for generated parallel tasks with a compiler
`framework.
`0003 2. Description of the Relevant Art
`0004 Both hardware and software determine the perfor
`mance of computer systems. Hardware design is becoming
`difficult to generate more performance due to cross capaci
`tance effects on wires, parasitic inductance effects on wires,
`and electrostatic field effects within transistors, which
`increase circuit noise effects on-chip and propagation delays.
`Additionally, continuing decreases in geometric dimensions
`of devices and metal routes may increase these effects. Also,
`the number of Switching nodes per clock period increases as
`more devices are placed on-chip, and, thus, the power con
`Sumption increases. These noise and power effects limit the
`operational frequency, and, therefore, the performance of the
`hardware.
`0005 While the reduction in geometric dimensions on
`chip discussed above may lead to larger caches and multiple
`cores placed on each processor, Software and Software pro
`grammers cannot continue to depend on ever-fasterhardware
`to hide inefficient code. In some cases hardware traits may
`increase performance if the execution of parallel applications,
`Such as multi-threaded applications, executed on multi-core
`processors and chips ensure that concurrent accesses by mul
`tiple threads to shared memory, such as the multiple first-level
`caches, are synchronized. This synchronization ensures cor
`rectness of operations, but also may limit peak performance.
`For example, locking mechanisms, such as Semaphores or
`otherwise, may ensure correctness of operations, but may
`also limit peak performance.
`0006. Some lock-free mechanisms may be used that allow
`a thread to complete read and write operations to shared
`memory without regard for operations of other threads. Each
`operation may be recorded in a log. Before a commit of the
`thread occurs, validation is performed. If it is found that other
`threads concurrently modified the same accessed memory
`locations, then re-execution is required. This re-execution
`may limit peak performance.
`0007 Another option for increasing performance of par
`allel applications from a software point-of-view is the man
`agement of the use of Scratchpad memories. Such as concur
`rent first-in-first-out (FIFO) queues implemented in a cache
`hierarchy. Many lock-free algorithms are based on compare
`and swap (CAS) algorithms. The CAS algorithm takes as
`arguments the address of a shared memory location, an
`expected value, and a new value. If the shared location cur
`rently holds the expected value, it is assigned the new value
`atomically (an atomic operation may generally represent one
`or more operations that have an appearance of a single opera
`tion to the rest of the system). A Boolean return value may be
`used to indicate whether the replacement occurred.
`0008. However, CAS algorithms deal with “ABA” prob
`lems, wherein a process reads a value A from a shared loca
`tion, computes a new value, and then the process attempts a
`CAS operation. Between the read operation and the CAS
`operation other processes may change the value in the shared
`
`location from A to B, do other work, and then change the
`value in the shared location back to A again. Now the first
`thread is still executing under the assumption that the value
`has not changed even though another thread did work that
`violates that assumption. The CAS operation will succeed
`when it should not. Another example is within a lock-free
`queue, a data element may be removed from a list within the
`queue, deleted, and then a new data element is allocated and
`added to the list. It is common for the allocated new data
`element to beat the same location as the deleted data element
`due to optimizations in the memory manager. A pointer to the
`new data element may be equal to a pointer to the old data
`element.
`0009. One solution to the above problem is to associate tag
`bits, or a modification counter, to a data element. The counter
`is incremented with each successful CAS operation. Due to
`wrap around, the problem is reduced, but not eliminated.
`Some architectures provide a double-word CAS operation,
`which allows for a larger tag. However, on-chip real estate is
`increased as well as matching circuitry delays. Another solu
`tion performs a CAS operation in a fetch and store-modify
`CAS sequence, rather than the usual read-modify-CAS
`sequence. However, this solution makes the algorithm block
`ing rather than non-blocking.
`0010 A wait-free algorithm is both non-blocking and star
`Vation free, wherein the algorithm guarantees that every
`active process will make progress within a bounded number
`of time steps. An example of a wait-free algorithm includes L.
`Lamport, Specifying Concurrent Program Modules, ACM
`Transactions on Programming Languages and Systems, Vol.
`5, No. 2, April 1983, pp. 190-222. However, Lamport presents
`a wait-free algorithm that restricts concurrency to a single
`enqueued element and a single dequeued element and the
`frequency of occurrence of required synchronization is not
`reduced.
`0011. In view of the above, efficient methods and mecha
`nisms for improving the generation of parallel tasks from
`sequential code and execution of the parallel tasks while
`minimizing cache conflicts are desired.
`
`SUMMARY OF THE INVENTION
`0012 Systems and methods for minimizing cache con
`flicts and synchronization Support for generated parallel tasks
`within a compiler framework are contemplated. Application
`code has producer and consumer patterns in a loop construct
`divided into two corresponding loops or tasks. An array's
`elements, or a Subset of the elements, are updated or modified
`in the producer task. The same array elements or a Subset of
`the array's elements are read out in the consumer taskin order
`to compute a value for another variable within the loop con
`struct. In one embodiment, a method comprises dividing a
`stream into windows, wherein a stream is a circular first-in,
`first-out (FIFO) shared storage queue. In one window, a pro
`ducer task is able to modify memory locations within a pro
`ducer sliding window without checking for concurrent
`accesses to the corresponding elements.
`0013 Upon filling the producer sliding window, the pro
`ducer task may move to an adjacent window to continue
`computational work. However, at this moment of moving, or
`sliding, to an adjacent window, the producer task verifies that
`a consumer task is not reading values from this adjacent
`window. The method performs similar operations for a con
`Sumer task wherein a consumer task is able to read memory
`locations within a consumer sliding window without check
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 9
`
`

`

`US 2010/0070730 A1
`
`Mar. 18, 2010
`
`ing for concurrent accesses to the corresponding elements.
`Since synchronization is limited to the times a producer or a
`consumer task completes its corresponding operations in a
`window and the task is ready to move to an adjacent window,
`the penalty for synchronization may be reduced.
`0014. In various embodiments, a compiler comprises
`library functions that may be either placed in an intermediate
`representation of an application code by back-end compila
`tion or placed in source code by a software programmer.
`These library functions are configured to generate a stream
`and divide it into windows. A window is sized both to fit
`within a first-level cache of a processor and to reduce the
`chances of eviction from this first-level cache. Within a pro
`ducer task loop, a function call may be placed for a push
`operation that modifies a memory location within a producer
`sliding window without checking for concurrent accesses to
`the corresponding elements. Likewise, within a consumer
`task loop, a function call may be placed for a pop operation
`that reads a memory location within a consumer sliding win
`dow without checking for concurrent accesses to the corre
`sponding elements. At the time a window is filled by a pro
`ducer task, or a window is emptied by a consumer task, the
`corresponding function call performs a determination of
`whether or not an adjacent window is available for continued
`work. If an adjacent window is available, the corresponding
`task moves, or slides, from its current window to the adjacent
`window. Otherwise, the corresponding task waits until the
`adjacent window is available.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`0015 FIG. 1 is a generalized block diagram illustrating
`one embodiment of an exemplary processing node.
`0016 FIG. 2 is a flow diagram of one embodiment of a
`static compiler method.
`0017 FIG. 3A is a generalized block diagram illustrating
`one embodiment of source code pattern with regular data
`dependences within an array.
`0018 FIG. 3B is a generalized block diagram illustrating
`one embodiment of source code pattern with shifted data
`dependences within an array.
`0019 FIG. 3C is a generalized block diagram illustrating
`one embodiment of Source code pattern with irregular data
`dependences within an array.
`0020 FIG. 4 is a generalized block diagram illustrating
`one embodiment of a memory hierarchy that Supports pro
`ducer and consumer sliding windows.
`0021
`FIG. 5 is a flow diagram illustrating one embodi
`ment of a method for automatic parallelization.
`0022 FIG. 6 is a flow diagram illustrating one embodi
`ment of a method for executing a producer task.
`0023 FIG. 7 is a flow diagram illustrating one embodi
`ment of a method for executing a consumer task.
`0024. While the invention is susceptible to various modi
`fications and alternative forms, specific embodiments are
`shown by way of example in the drawings and are herein
`described in detail. It should be understood, however, that
`drawings and detailed description thereto are not intended to
`limit the invention to the particular form disclosed, but on the
`contrary, the invention is to cover all modifications, equiva
`
`lents and alternatives falling within the spirit and scope of the
`present invention as defined by the appended claims.
`
`DETAILED DESCRIPTION
`0025. In the following description, numerous specific
`details are set forth to provide a thorough understanding of the
`present invention. However, one having ordinary skill in the
`art should recognize that the invention may be practiced with
`out these specific details. In some instances, well-known
`circuits, structures, and techniques have not been shown in
`detail to avoid obscuring the present invention.
`0026 FIG. 1 is a block diagram of one embodiment of an
`exemplary processing node 100. Processing node 100 may
`include memory controller 120, interface logic 140, one or
`more processing units 115a-115b. As used herein, elements
`referred to by a reference numeral followed by a letter may be
`collectively referred to by the numeral alone. For example,
`processing units 115a-115b may be collectively referred to as
`processing units 115. Processing units 115 may include a
`processor core 112 and a corresponding cache memory Sub
`systems 114. Processing node 100 may further include packet
`processing logic 116, and a shared cache memory Subsystem
`118. In one embodiment, the illustrated functionality of pro
`cessing node 100 is incorporated upon a single integrated
`circuit.
`0027 Generally, packet processing logic 116 is config
`ured to respond to control packets received on the links to
`which processing node 100 is coupled, to generate control
`packets in response to processor cores 112 and/or cache
`memory Subsystems 114, to generate probe commands and
`response packets in response to transactions selected by
`memory controller 120 for service, and to route packets for
`which node 100 is an intermediate node to other nodes
`through interface logic 140. Interface logic 140 may include
`logic to receive packets and synchronize the packets to an
`internal clock used by packet processing logic 116.
`0028 Cache subsystems 114 and 118 may comprise high
`speed cache memories configured to store blocks of data.
`Cache memory Subsystems 114 may be integrated within
`respective processor cores 112. Alternatively, cache memory
`Subsystems 114 may be coupled to processor cores 114 in a
`backside cache configuration or an inline configuration, as
`desired. Still further, cache memory subsystems 114 may be
`implemented as a hierarchy of caches. Caches which are
`nearer processor cores 112 (within the hierarchy) may be
`integrated into processor cores 112, if desired. In one embodi
`ment, cache memory Subsystems 114 each represent L2 cache
`structures, and shared cache Subsystem 118 represents an L3
`cache structure.
`0029. Both the cache memory subsystem 114 and the
`shared cache memory Subsystem 118 may include a cache
`memory coupled to a corresponding cache controller. For the
`shared cache memory subsystem 118, the cache controller
`may include programmable logic in order to programmably
`enable a storage of directory entries within locations of sub
`system 118.
`0030 Processor cores 112 include circuitry for executing
`instructions according to a predefined instruction set. For
`example, the x86 instruction set architecture may be selected.
`Alternatively, the Alpha, PowerPC, or any other instruction
`set architecture may be selected. Generally, processor core
`112 accesses the cache memory Subsystems 114, respec
`tively, for data and instructions. If the requested block is not
`found in cache memory Subsystem 114 or in shared cache
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 10
`
`

`

`US 2010/0070730 A1
`
`Mar. 18, 2010
`
`memory Subsystem 118, thena read request may be generated
`and transmitted to the memory controller within the node to
`which the missing block is mapped.
`0031
`Referring to FIG. 2, one embodiment of a static
`compiler method 200 is shown. Software applications may be
`Written by a designer in a high-level language such as C, C++,
`Fortran, or other in block 210. This source code may be stored
`on a computer readable medium. A command instruction,
`which may be entered at a prompt by a user, with any neces
`sary options may be executed in order to compile the source
`code. One example of a compiler is the GNU Compiler Col
`lection (GCC), which is a set of compilers produced for
`Various programming languages. However, other examples of
`compilers are possible and contemplated.
`0032. In block 220, in one embodiment, the front-end
`compilation translates the source code to an intermediate
`representation (IR). Syntactic and semantic processing as
`well as some optimizations may be performed at this step. The
`back-end compilation in block 230 translates the IR to
`machine code. The back-end may perform more transforma
`tions and optimizations for a particular computerarchitecture
`and processor design. For example, a processor is designed to
`execute instructions of a particular instruction set architecture
`(ISA), but the processor may have one or more processor
`cores. The manner in which a software application is executed
`(block 240) in order to reach peak performance may differ
`greatly between a single-, a dual-, or a quad-core processor.
`Other designs may have eight cores. Regardless, the manner
`in which to compile the software application in order to
`achieve peak performance may need to vary between a single
`core and a multi-core processor.
`0033 Lock contention may be used to prevent potential
`Overlapped accesses to shared memory, such as caches in
`memory subsystem 114 and 118 in FIG.1. However, it also
`reduces performance when cores are in a wait state until the
`lock is removed. Transactional memory may be used to pre
`vent halted execution. However, if a memory conflict is later
`found during a validation stage, a particular thread may roll
`back its operations to a last validated checkpoint or the start of
`the thread and begin re-execution. In another embodiment,
`the thread may be aborted and rescheduled for execution at a
`later time.
`0034). The above examples of synchronization are used to
`ensure a producer and a consumer do not concurrently access
`the same memory location. A producer may correspond to a
`processor core, such as processor core 112a in FIG. 1, sup
`plying data, such as in an array, while executing a task or a
`thread. A consumer may correspond to a processor core, such
`as processor core 112b in FIG.1, retrieving data, such as in an
`array, while executing a parallel task or a parallel thread. For
`example, referring to FIG. 3A, one embodiment of a source
`code pattern with regular data dependences within an array is
`shown. Automatic parallelization techniques are based on
`data dependence analysis information.
`0035. Here, an array has its elements updated, such as
`written elements 302, and subsequently these elements are
`read out, which are represented by read elements 306. FIG.
`3A represents a regular data flow relation in which all the
`elements written by the producer are read by a same con
`Sumer. This one-to-one correlation is depicted by data depen
`dences 304. For example, suppose a designer has written
`Source code that contains the below code segment now in the
`IR,
`
`for (i = 0; i < n; i++) {
`ai) = variable1 op variable2;
`variable3 = ai) op variable4;
`
`}
`
`/* line 1 */
`
`/* line 4 *f
`
`I0036) The compiler may split this loop into two parallel
`loops,
`
`/* Producer Task */
`for (i = 0; i < n; i++) {
`ai) = variable1 op variable2;
`}
`* Consumer Task */
`for (i = 0; i < n; i++) {
`variable3 = ai op variable4:
`
`/* line 7 *f
`
`I0037. The automatic parallelization techniques of a com
`piler, such as in a runtime parallelization library, are based on
`data dependence analysis information. This static analysis
`determines the relations between memory accesses and
`allows the analysis of dependences between computations via
`memory accesses. This in turn allows task partitioning, and
`data and computation privatization. Data dependences repre
`sent a relation between two tasks. In the case of flow depen
`dences, the dependence relation is between a task that writes
`data and another task that is reading it.
`I0038. In FIG.3B only apart of the elements of an array are
`Written, making a part of the read elements dependent on a
`previous producer. Likewise, it may occur that only a part of
`the elements of an array are read, making part of the written
`elements dependent on a later consumer. Here, an array has its
`elements partially updated, such as written elements 312, and
`Subsequently these elements are read out, which are repre
`sented by partial read elements 316. FIG. 3B represents a
`shifted data flow relation in which partial of the elements
`Written by the producer are read by a same consumer. This
`type of shifted data dependence is depicted by data depen
`dences 314. For example, suppose a designer has written
`Source code that contains the below code segment now in the
`IR,
`
`for (i = 0; i < n; i++) {
`ai+2) = variable1 op variable2;
`variable3 = ai) op variable4;
`
`}
`
`/* line 12 */
`
`/* line 15 */
`
`0039
`loops,
`
`The compiler may split this loop into two parallel
`
`/* Producer Task *
`for (i = 1; i <= n; i++) {
`ai+2) = variable1 op variable2;
`}
`* Consumer Task */
`for (i = 1; i <= n; i++) {
`variable3 = ai op variable4:
`
`/* line 17 */
`
`/* line 19*/
`
`/* line 21 */
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 11
`
`

`

`US 2010/0070730 A1
`
`Mar. 18, 2010
`
`0040 FIG.3C represents a more difficult dependence rela
`tion that may not be determined at compile time. Here, an
`array may have its elements fully or partially updated, such as
`written elements 322, and subsequently these elements are
`read out, which are represented by partial read elements 326.
`However, the reading out of the elements may not be in
`array-order in the sequential code. FIG. 3C represents an
`irregular data flow relation not known at compile time. This
`type of irregular data dependence is depicted by data depen
`dences 324. The read elements 326 may be accessed by an
`indirection. For example, Suppose a designer has written
`Source code that contains the below code segment now in the
`IR,
`
`for (i = 0; i < n; i++) {
`ai = variable1 op variable2;
`variable3 = a(b(i) op variable4;
`
`f* line 24 *f
`
`f* line 27*.
`
`0041. In the above example, the read elements are
`accessed by an indirection and, accordingly, the consumer
`task may then be considered dependent on the completion of
`the producer task.
`0042 Turning now to FIG. 4, one embodiment of a
`memory hierarchy 400 that Supports producer and consumer
`sliding windows is shown. Processing units 420 may be simi
`lar to processing units 115 of FIG. 1. The same circuitry for
`processor cores 112 of FIG.1 may be used here as well. In one
`embodiment, the cache subsystem is shown as two levels, 412
`and 416, but other embodiments may be used as well. Cache
`memory 412 may be implemented as a L1 cachestructure and
`may be integrated into processor core 112, if desired. Cache
`memory 416 may be implemented as a L2 cache structure.
`Other embodiments are possible and contemplated. Inter
`faces are not shown here as they are in FIG. 1 for simpler
`illustrative purposes. A shared cache memory 440 may be
`implemented as a L3 cachestructure. Here, the shared cache
`memory 440 is shown as one level, but other levels and
`implementations are possible.
`0.043
`Again, an interface, such as a memory controller, to
`main memory 450 is not shown for simpler illustrative pur
`poses. Main memory 450 may be implemented as dynamic
`random-access memory (DRAM), dual in-line memory mod
`ules (dimms), a hard disk, or otherwise.
`0044. The transfer of data between tasks as shown in FIG.
`3A-3C may occur via a communication channel called a
`stream. A stream may be a circular buffer managed as a FIFO
`concurrent lock free queue. Concurrent FIFO queues are
`widely used in parallel applications and operating systems. A
`stream may be implemented in the memory hierarchy 400
`such as in stream copies 440 and 460. The most updated
`contents of a stream may be in Stream copies located closest
`to a processor core, such as stream copy 440.
`0045. As shown above in FIG. 3A-3C, a loop in source
`code with a regular or shifted flow dependence may be split
`into two parallel tasks such as a producer task and a consumer
`task. Each task may be provided a sliding window, or local
`buffer, within the stream. For example, in one embodiment,
`processor core 112b may be assigned a producer task as
`depicted by lines of code 5-7 above and processor core 112a
`may be assigned a consumer task as depicted by lines of code
`
`9-11 above. Stream 460 may be created for the parallel tasks
`in code lines 5-11. Stream 440 may be a more up-to-date copy
`of stream 460.
`0046 Producer sliding window 444, and correspondingly
`464, may be empty and designated for storing data produced
`by core 112b. In fact, a snapshot in the middle of code execu
`tion may show core 112b has produced data for the loop of
`lines 5-7, which is presently stored in filled space 446. The
`pointers within stream 440 may have been updated and core
`112b is now allowed to begin filling producer sliding window
`444 with new data. More up-to-date copies of producer slid
`ing window 444 may be found in the cache hierarchy, such as
`data copies 418b and 414b. In fact, the size of the producer
`sliding window and its copies may be chosen in order that the
`window remains located in closest cache to processor 112b
`with a low chance of being evicted. Therefore, cache conflicts
`may be reduced.
`0047 While processor core 112b is executing a producer
`task, producing data for stream copies 440 and 460, and filling
`data copy 414b to be subsequently sent to producer sliding
`windows 444 and 464, processor core 112a may be concur
`rently executing a consumer task, reading data from stream
`copy 440, and reading from data copy 414a which was pre
`viously read from consumer sliding window 442. Consumer
`sliding window 442, and correspondingly 462, may be full
`and designated for reading data by core 112a. In fact, a
`Snapshot in the middle of code execution may show core 112a
`is reading data for the loop of lines 9-11, which is presently
`stored infilled space 446. The pointers within stream 440 may
`have been updated and core 112a is now allowed to begin
`reading consumer sliding window 442 which has new data.
`More up-to-date copies of consumersliding window 442 may
`be found in the cache hierarchy, Such as data copies 418a and
`414a. The size of the consumer sliding window may be the
`same size as the producer sliding window for ease of imple
`mentation sake. Alternatively, when core 112a has permis
`sion to read from a window due to no overlap of the producer
`and consumer sliding windows, rather than read data from
`stream copy 440, core 112a may send a communicative probe
`to locate required data. If a copy of the required updated data
`is in cache 416b or 412b of processing unit 420b, then a copy
`of the required data may be sent from processing unit 420b to
`cache 412a.
`0048. As can be seen in FIG. 4, as long as the producer
`sliding window 444 does not overlap the consumer sliding
`window 442, both cores 112a-112b may execute in parallel
`without conflicting memory accesses. In fact, the only pen
`alty for synchronization occurs when a producer sliding win
`dow 444 or a consumer sliding window 442 needs to slide
`within stream 440. A check must be performed to ensure there
`is no overlap between the windows. Further details are pro
`vided later. Of course, consumer sliding window 442 can not
`begin reading and sliding until producer sliding window 444
`has filled at least one window within stream 440 and subse
`quently moved to another window within stream 440. This is
`a small overhead price to be paid during initial execution of
`the parallel tasks. By allowing cores 112a-112b to concur
`rently execute on data within copies of sliding windows 442
`and 444 and only needing to perform synchronization during
`times when a window needs to slide, peak performance of a
`system may be more easily achieved.
`0049. The embodiment shown in FIG. 4 is for illustrative
`purposes only. In other embodiments, more than two process
`ing units 420 may be included in a system and more than one
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2175, p. 12
`
`

`

`US 2

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket