IPR2015-01198, No. 1014 Exhibit - Blumofe (P.T.A.B. May. 12, 2015)

Cilk: An Efficient Multithreaded
`
`Runtime System
`
`Robert
`
`D. Blumofe
`
`Christopher
`
`F. Joerg
`
`Bradley
`
`C. Kuszmaul
`
`Charles
`
`E. Leiserson
`
`Keith H. Randall
`
`Yuli
`
`Zhou
`
`MIT
`
`Laboratory
`
`for Computer
`
`Science
`
`545 Technology
`
`Square
`
`Cambridge,
`
`MA
`
`02139
`
`Abstract
`
`system for multi-
`is a C-based runtime
`“silk”)
`(pronounced
`Cilk
`threaded parallel programming.
`In this paper, we document
`the effi-
`ciency of the Cilk work-stealing
`scheduler, both empirically
`and ana-
`lytically. We show that on real and synthetic applications,
`the “work”
`and “critical
`path” of a Cilk computation
`can be used to accurately
`model performance. Consequently, a Cilk programmer
`can focus on
`reducing the work and critical path of his computation,
`insulated from
`load balancing and other
`rtmtime scheduling issues. We also prove
`that for the class of “fully
`strict”
`(well-structured)
`programs,
`the Cilk
`scheduler achieves space, time, and communication
`bounds all within
`a constant
`factor of optimal.
`runs on the Connection Ma-
`The Cilk rmrtime system currently
`chine CM5 MPP,
`the Intel Paragon MPP, the Silicon Graphics Power
`Challenge SMP, and the MIT Phish network of workstations.
`Ap-
`plications written in Cilk include protein folding,
`graphic rendering,
`backtrack search, and the *Socrates chess program, which won third
`prize in the 1994 ACM International Computer Chess Championship.
`
`1
`
`Introduction
`
`has become an increasingly popular way to implement
`Multithreading
`dynamic, highly asynchronous, concurrent programs [1, 8,9, 10, 11,
`12, 15, 19,21,22,24,25,28,33,
`34,36,39,
`40]. A multithreaded
`sys-
`tem provides the programmer with a means to create, synchronize, and
`schedule threads. Although
`the schedulers in many of
`these rtmtime
`systems seem to perform well
`in practice, none provide users with a
`guarantee of application performance. Cilk is a runtime system whose
`work-stealing
`scheduler
`is efficient
`in theory as well as in practice.
`Moreover,
`it gives the user an algorithmic model of application
`per-
`formance based on the measures of “work”
`and “critical path” which
`can be used to predict
`the runtime of a Cilk program accurately.
`A Cilk mtrltithreaded
`computation
`can be viewed as a directed
`acyclic graph (dag)
`that unfolds dynamically,
`as is shown schemat-
`ically in Figure 1. A Cilk program consists of a collection
`of Cilk
`each of which is broken into a sequence of threads, which
`procedures,
`C func-
`form the vertices of
`the dag. Each thread is a rtonblocking
`tion, which means that
`it can run to completion without waiting or
`
`This researchwassupportedin pm by the Advanced ResearchProjectsAgency under
`GrantsNIM014-94-1-09SSandNOW14-92-J-1310, Robert Blumofe is supportedin part by
`an ARPA High-PerformanceComputing GraduateFellowship. Keith Randatl is supported
`in part by a Department of Defense NDSEG Fellowship.
`
`Permission to make digital/hard copies of all or part of this material with-
`out fee is granted provided that the copies are not made or distributed
`for profit or commercial advantage,
`the ACM copyrigh~sewer
`notice, the title of the publication and its date appear, and notice is given
`that copyright
`is by permission of the Association for Computing Machinery,
`Inc. (ACM). To copy otherwise? to repubhsh,
`to post on servers or to
`redistribute to lists, requires prior specific permission and/or a fee.
`
`PPOPP ’95 Santa Clara, CA USA
`@ 1995 ACM 0-89791 -701 -619510007...$3.50
`
`207
`
`Figure
`Threads are
`computation.
`The Cilk model of multithreaded
`1:
`shown as circles, which are grouped into procedures.
`Each downward edge
`corresponds to a spawn of a child, each horizontal edge corresponds to a spawn
`of a successor, and each upward, curved edge corresponds to a data dependency.
`The numbers in tbe figure indicate the levels of procedures in the spawn tree.
`
`the threads from a
`suspending once it has been invoked. As one of
`Cilk procedure runs,
`it can spawn a child thread which begins a new
`child procedure.
`In the figure, downward edges connect
`threads and
`their procedures with the children they have spawned. A spawn is
`like a subroutine call, except
`that the calling thread may execute cost-
`ctrrrently with its child, possibly spawning additional
`children. Since
`threads cannot block in the Cilk model, a thread cannot spawn chil-
`dren and then wait
`for values to be returned. Rather,
`the thread must
`thread to receive the children’s
`return
`additionally
`spawn a successor
`vahres when they are produced. A thread and its successors are con-
`sidered to be parts of the same Cilk procedure.
`In the figure, sequences
`of successor threads that form Cilk procedures are connected by hori-
`zontal edges. Return values, and other values sent from one thread to
`among the threads, where a thread
`another,
`induce data dependencies
`receiving a value cannot begin until another
`thread sends the value.
`Data dependencies are shown as upward, curved edges in the figure.
`Thus, a Cilk computation unfolds as a spawn tree composed of proce-
`dures and the spawn edges that connect
`them to their children, but the
`execution is constrained to follow the precedence relation determined
`by the dag of threads.
`The execution time of any Cilk program on a parallel computer
`with P processors is constrained by two parameters of
`the computa-
`The work, denoted T1,
`the work and the critical
`is the
`tion:
`path.
`time used by a one-processor execution of
`the program, which cor-
`responds to the sum of
`the execution times of all
`the threads. The
`critical path length, denoted T~,
`is the total amount of time required
`by an infinite-processor
`execution, which corresponds to the largest
`sum of thread execution times along any path. Whh P processors,
`the
`execution time cannot be less than T’1/P or less than T~.
`The Cilk
`scheduler uses “work stealing”
`[3, 7, 13, 14, 15, 19,27,28,29,34,
`40]
`to achieve execution time very near to the sum of these two measures.
`
`SAMSUNG-1014
`Page 1 of 10
`
`

`Off-line techniques for computing such efficient schedules have been
`known for a long time [5, 16, 17], but this efficiency has been difficult
`to achieve on-line in a distributed environment while simultaneously
`using small amounts of space and communication.
`We demonstrate the efficiency of
`the Cilk scheduler both empir-
`ically and analytically.
`Empirically, we have been able to document
`that Cilk works well
`for dynamic, asynchronous,
`tree-like, MIMD-
`style computations.
`To date,
`the applications we have programmed
`include protein folding,
`graphic rendering, backtrack search, and the
`*Socrates
`chess program, which won third prize in the 1994 ACM
`International Computer Chess Championship. Many of these applica-
`tions pose problems for more traditional
`parallel environments,
`such
`as message passing [38] and data parallel
`[2, 20], because of the unpre-
`dictability
`of the dynamic workloads on processors. Analytically,
`we
`prove that
`for
`“fully
`strict”
`(well-structured)
`programs, Cilk’s work-
`stealing scheduler achieves execution space, time, and communication
`bounds all within a constant
`factor of optimal. To date, all of the ap-
`plications
`that we have coded are fully strict.
`The Cilk language is an extension to C that provides an abstraction
`threads in explicit
`continuation-passing
`style. A Cilk program is
`of
`preprocessed to C and then linked with a runtime library to run on the
`Connection Machine CM5 MPP,
`the Intel Paragon MPP,
`the Silicon
`Graphics Power Challenge SMP, or the MIT Phish [4] network of
`workstations.
`In this paper, we focus on the Connection Machine
`CM5 implementation
`of Cilk.
`The Cilk scheduler on the CM5 is
`written in about 30 pages of C, and it performs communication
`among
`processors using the Strata [6] active-message library.
`Section 2
`The remainder of
`this paper
`is organized as follows.
`describes Cilk’s runtime data structures and the C language extensions
`that are used for programming.
`Section 3 describes the work-stealing
`scheduler. Section 4 documents the performance of several Cilk ap-
`plications.
`Section 5 shows how the work and critical path of a Cilk
`computation can be used to model performance. Section 6 shows ana-
`lytically
`that the scheduler works well. Finally, Section 7 offers some
`concluding remarks and describes our plans for the future.
`
`2
`
`The Cilk
`tion
`
`programming
`
`environment
`
`and
`
`implementa-
`
`In this section we describe a C language extension that we have de-
`veloped to ease the task of coding Cilk programs. We also explain the
`basic runtime data structures that Cilk uses.
`In the Cilk language, a thread T is defined in a manner similar
`a C function definition:
`
`to
`
`thread
`
`T (ai-g-decls
`
`)
`
`{ stints
`
`. . . }
`
`translates T into a C function of one argument
`The Cilk preprocessor
`to a closure data
`and void return type. The one argument
`is a pointer
`structure,
`illustrated in Figure 2, which holds the arguments for T. A
`closure consists of a pointer
`to the C function for T, a slot
`for each
`of the specified arguments, and a join counter
`indicating
`the number
`of missing arguments that need to be supplied before T is ready to
`if
`it has obtained all of
`its arguments, and it
`run. A closure is ready
`is waidng if some arguments are missing. To run a ready closure,
`the
`Cilk scheduler invokes the thread as a procedure using the closure itself
`as its sole argument. Within
`the code for
`the thread,
`the arguments
`are copied out of
`the closure data structure into local variables. The
`closure is allocated from a simple runtime heap when it is created, and
`it is returned to the heap when the thread terminates.
`The Cilk
`language supports a data type called a cmrdnuatiorr,
`which is specified by the type modifier keyword cent. A continuation
`is essentially a global
`reference to an empty argument slot of a closure,
`implemented
`as a compound data structure containing
`a pointer
`to a
`closure and an offset
`that designates one of
`the closure’s argument
`
`208
`
`waiting
`
`closure
`
`code
`
`ready closure
`
`Figure
`
`2: The closure data structure.
`
`can be created and passed among threads, which
`slots. Continuations
`enables threads to communicate
`and synchronize with each other.
`Continuations
`are typed with the C data type of the slot in the closure.
`At runtime, a thread can spawn a child thread by creating a closure
`for the child. Spawning is specified in the Cilk language as follows:
`
`spawn T (args
`
`.
`
`)
`
`fills in all available arguments,
`This statement creates a child closure,
`and initializes
`the join counter
`to the number of missing arguments.
`Available arguments are specified as in C. To specify a missing argu-
`ment,
`the user specifies a continuation
`variable (type cent)
`preceded
`by a question mark. For example,
`if
`the second argument
`is ?k,
`then
`Cilk sets the variable k to a continuation
`that refers to the second argu-
`ment slot of the created closure.
`If the closure is ready, that is, it has no
`missing arguments,
`then spawn causes the closure to be immediately
`posted to the scheduler
`for execution.
`In typical applications,
`child
`closures are usually created with no missing arguments.
`To create a successor thread, a thread executes the following
`ment:
`
`state-
`
`spawnnext
`
`7’
`
`(args
`
`)
`
`informs
`it
`to spawn, but
`identical
`is semantically
`This statement
`the new closure should be treated as a successor,
`the scheduler
`that
`as opposed to a child.
`Successor closures are usually created with
`some missing arguments, which are filled in by values produced by
`the children.
`way
`inthenonnal
`retumvahres
`does notever
`ACilkprocedure
`Instead,
`theprogrammer must code the parent
`toaparent
`procedure.
`procedure astwo threads. The first
`thread spawns thechild
`procedure,
`passing it a continuation
`pointing
`to the successor thread’s closure.
`The child sends its “return”
`valtre explicitly
`as an argument
`to the
`waiting successor. This strategy of communicating
`between threads
`passirrg. Cilkprovides
`primitives of the
`iscalled
`explicit
`contirruadorr
`following
`form to send values from one closure to anothec
`
`s end.argument
`
`(k,
`
`value)
`
`sends the value value to the argument
`This statement
`slot of a waiting
`the continuation
`by the continuation
`k. The types of
`closure
`specified
`The join counter of
`the waiting
`and the value must be compatible.
`closure is decremented,
`and if
`it becomes zero,
`then the closure is
`ready and is posted to the scheduler.
`Figure 3 shows the familiar
`recursive Fibonacci procedure written
`in Cilk.
`It consists of
`two threads,
`fib and its successor sum. Re-
`flecting the explicit
`continuation
`passing style that Cilk supports,
`the
`
`SAMSUNG-1014
`Page 2 of 10
`
`

`3
`
`The Cilk work-stealing
`
`scheduler
`
`[3,7, 13, 14, 15,
`Cilk’s scheduler uses the technique of work-stealing
`19,27,28,29,34,
`40]
`in which a processor (the thief) who runs out of
`work selects another processor (the victim)
`from whom to steal work,
`and then steals the shallowest
`ready thread in the victim’s
`spawn tree.
`Cilk’s strategy is for thieves to choose victims at random [3, 27, 37].
`ready queue to hold
`At runtime, each processor maintains a local
`ready closures. Each closure has an associated level, which corre-
`sponds to the number of spawn’s (but not spawnnext
`‘s) on the path
`from the root of the spawn tree. The ready queue is an array in which
`the .Ltb element contains a linked list of all
`ready closures having
`level L.
`ready
`all
`Cilk begins executing the user program by initializing
`queues to be empty, placing the root
`thread into the level-O list of
`Processor O’s queue, and then starting a scheduling
`loop on each
`processor. Within a scheduling loop, a processor
`first checks to see
`whether
`its ready queue is empty.
`If
`it
`is, the processor commences
`“work
`stealing,” which will be described shortly.
`Otherwise,
`the
`processor performs the following
`steps:
`1. Remove the thread at the head of the list of the deepest nonempty
`level
`in the ready queue.
`2. Extract
`the thread from the closure, and invoke it.
`As a thread executes, it may spawn or send arguments to other threads,
`When the thread terminates, control
`returns to the scheduling loop.
`When a thread at level L spawns a child thread T,
`the scheduler
`executes the following
`operations:
`1. Allocate and initialize
`a closure for T.
`initialize
`2. Copy the available arguments into the closure,
`continuations
`to point
`to missing arguments, and initialize
`join counter
`to the number of missing arguments.
`3. Label
`the closure with level L + 1.
`4.
`If
`there are no missing arguments, post the closure to the ready
`+ 1)list.
`queue by inserting it at the head of the level-(L
`Execution of spawnnext
`is similar, except
`that the closure is labeled
`with level L and, if
`it is ready, posted to the level-L list.
`value) performs the
`A processor that executes send-argument(k,
`following
`steps:
`
`any
`the
`
`the join counter
`
`1. Find the closure and argument slot referenced by the continua
`tion k.
`2. Place value in the argument slot, and decrement
`of the cIosure.
`If
`the join counter goes to zero, post
`queue at the appropriate level.
`
`3.
`
`the closure to the ready
`
`k refers to a closure on a remote processor,
`When the continuation
`ensues.
`The processor
`that
`initiated
`ths
`network
`communication
`sends a message to the remote processor
`send-argument
`function
`to perform the operations. The only subtlety occurs in step 3.
`If
`the
`closure must be posted,
`it is posted to the ready queue of the initiating
`processor,
`rather than to that of
`the remote processor. This policy is
`necessary for
`the scheduler
`to be provably good, but as a practical
`matter, we have also had success with posting the closure to the re-
`mote processor’s queue, which can sometimes save a few percent
`in
`overhead.
`If
`the scheduler attempts to remove a thread from an empty ready
`queue,
`the processor becomes a thief and commences work
`srealing
`as follows:
`at random.
`1. Select a victim processor uniformly
`2.
`If
`the victim’s
`ready queue is empty, go to step 1.
`3.
`If
`the victim’s
`ready queue is nonempty, extract a thread from
`the tail of
`the list
`in the shallowest
`nonempty
`level of
`the ready
`queue, and invoke
`it.
`
`thread
`{
`if
`
`int
`
`flb (cent
`(n<Z)
`send.argument
`
`k,
`
`i.nt n)
`
`(k, n)
`
`else
`{
`
`}
`
`x, y;
`int
`ccmt
`sum (k, ?x, ?y) ;
`spawn-next
`spawn tib (x, n-1)
`;
`spawn rib (y, n-2)
`;
`
`int
`k,
`int
`(k, x.y)
`;
`
`x,
`
`int
`
`y)
`
`} t
`
`sum (cent
`hread
`{
`send-argument
`}
`
`3: A Cilk procedure, consisting of two threads,
`figure
`Fibonacci number.
`
`to compute the rtth
`
`the
`
`specifying where the
`
`to each thread is the continuation
`first argument
`“return”
`value should be placed.
`first checks to see if
`it
`When the fib function
`is invoked,
`boundary case has been reached, in which case it uses sendargument
`k.
`to “return”
`the value of n to the slot specified by continuation
`Otherwise,
`it spawns the successor thread sum, as well as two children
`to compute the two subcases. Each of
`these two children
`is given a
`continuation
`specifying to which argument
`in the sum thread it should
`send its result. The sum thread simply adds the two arguments when
`they arrive and sends this result
`to the slot designated by k.
`Although writing
`in explicit
`continuation
`passing style is some-
`what onerous for
`the programmer,
`the decision to break procedures
`into separate nonblocking
`threads simplifies
`the Cilk runtime system.
`Each Cilk thread leaves the C rrmtime stack empty when it completes.
`Thus, Cilk can run on top of a vanilla C rrtntime system. A com-
`mon alternative [19, 25, 32, 34]
`is to support a programming
`style
`in which a thread suspends whenever
`it discovers that required val-
`ues have not yet been computed,
`resuming when the values become
`available. When a thread suspends, however,
`it may leave temporary
`values on the rtmtime stack which must be saved, or each thread must
`have its own rtmtime stack. Consequently,
`this akemative strategy
`requires changes to the runtime system that depend on the C calling
`stack layout and register usage conventions.
`Another advantage of
`Cilk’s strategy is that it allows multiple children to be spawned from
`a single nonblocking
`thread, which saves on context switching.
`In
`Cilk,
`r children can be spawned and executed with only r + 1 context
`switches, whereas the alternative of suspending whenever a thread is
`spawned causes 2r context switches. Since our primary interest
`is in
`understanding how to build efficient multithreaded
`rnntime systems,
`but without
`redesigning
`the basic C runtime system, we chose the
`alternative of burdening the programmer with a requirement which
`is perhaps less elegant
`linguistically,
`but which yields a simple and
`portable runtime implementation.
`Cilk supports a variety of features that give the programmer greater
`control over rtmtime performance. For example, when the last action
`of a thread is to spawn a ready thread,
`the programmer
`can use the
`keyword cal 1 instead of spawn that produces a “tail call”
`to run the
`new thread immediately without
`invoking
`the scheduler. Cilk also
`allows arrays and subarrays to be passed as arguments to closures.
`Other
`features include various abilities
`to override the scheduler’s
`decisions,
`including on which processor a thread should be placed and
`how to pack and unpack data when a closure is migrated from one
`processor to another.
`
`209
`
`SAMSUNG-1014
`Page 3 of 10
`
`

`communi-
`
`Work stealing is implemented with a simple request-reply
`cation protocol between the thief and victim.
`Why steal work from the shallowest
`level of the ready queue? The
`reason is two-fold.
`First, we would like to steal large amounts of work,
`and shallow closures are likely to execute for longer
`than deep ones.
`Stealing large amounts of work tends to lower
`the communication
`cost of the program, because fewer steals are necessary. Second,
`the
`closures at the shallowest
`level of the ready queue are also the ones that
`are shallowest
`in the dag, a key fact proven in Section 6. Consequently,
`If processors are idle,
`the work they steal tends to make progress along
`the critical path.
`
`4
`
`Performance
`
`of Cilk applications
`
`that we have used to bench-
`This section presents several applications
`mark the Cilk scheduler. We also present empirical
`evidence from
`experiments run on a CM5 supercomputer
`to document
`the efficiency
`of our work-stealing
`scheduler.
`The CM5 is a massively parallel
`computer based on 32MHz SPARC processors with a fat-tree inter-
`connection network [30].
`The applications are described below:
`the
`that
`2, except
`l fib is
`in Section
`same as was presented
`the
`second recursive spawn is replaced by a “tailc all”
`that avoids
`thescheduler. Thisprogram isagoodmeasure
`of Cilkoverhead,
`because the thread length is so small.
`
`is a backtrack search program that solves the problem
`l queens
`of placing N queens on a A’ x N chessboard so that no two
`queens attack each other. The Cilkprogram
`is based onsertal
`code by R. Sargent of the MIT Media Laboratory. Thread length
`was enhanced by serializing
`the bottom 7 levels of
`the search
`tree.
`
`program [35] written in conjunc-
`is a protein-folding
`l pfold
`tion with V. Pandeof MIT’s Center
`for Material Sciences and
`Engineering.
`This program finds hamiltonian
`paths in a three-
`dimensional
`grid using backtrack search.
`Itwasthe
`first pro-
`gram toenumerate
`allhamiltonian
`paths ina3
`x 4 x 4 grid.
`We timed the enumeration
`of all paths starting with a certain
`sequence.
`l ray is a parallel program for graphics rendering based on the
`serial POV- Ray program,
`which
`uses a ray-tracing
`algorithm.
`The entire POV-Ray
`system contains
`over 20,000
`lines
`of C
`code, but
`the core of POV-Ray
`is a simple doubly nested loop
`that
`iterates over each pixel
`in a two-d~mensional
`image. For
`ray we converted the nested loops into a 4-ary divide-and-
`conquer control
`structure using spawns. 1 Our measurements
`do not
`include the approximately
`2.4 seconds of startup time
`required to read and process the scene description file.
`l knary
`(k, n, r ) is a synthetic benchmark whose parameters can
`be set to produce a variety of values for work and critical path.
`It generates a tree of branching factor k and depth n in which
`the first
`r children at every level are executed serially and the
`remainder are executed in parallel. At each node of the tree, the
`program runs an empty “for”
`loop for 400 iterations.
`chess program that uses the Jamboree
`l *Socrates
`is a parallel
`search algorlthm [23, 29]
`to parallelize
`a minmax tree search.
`The work of the algorithm vanes with the number of processors,
`because it does speculative work that may be aborted during
`runtime.
`#30crates is a production-quality
`program that won
`third prize in the 1994 ACM International
`Computer Chess
`
`1Imtiall y,the serial POV-Rayprogramwasabout5percemslowerthan the Cilk version
`on one processor. The reason was that
`the divide-and-conquer
`decomposition
`running
`performed
`by the C]lk
`code provides
`better
`Iocrdity
`than the doubly
`nested loop of
`the
`serial code Modlfymg
`the serial code to inmate
`the C]lk decomposition
`improved
`its
`performance
`Tunings
`for
`the improved
`version ~e given in the table,
`
`running on the 512-node CM5 m the National
`Championship
`Center
`for Supercomputing
`Applications
`at the University
`of
`Illinois, Urbana-Champaign.
`Table 4 shows typical performance measures for these Cilk appli-
`cations. Each column presents data from a single run of a benchmark
`application. We adopt
`the following
`notations, which are used in
`the table.
`For each application, we have an efficient
`serial C im-
`plementation,
`compiled using gcc
`-02, whose measured runtime is
`denoted T,end. The work T1 is the measured execution time for the
`Cilk program running on a single node of
`the CM5.2
`The critical
`path length T~ of the Cilk computation is measured by timestamping
`each thread and does not include scheduling or communication
`costs.
`The measured P-processor
`execution time of
`the Cilk program run-
`ning on the CM5 is given by TP, which includes all scheduling and
`communication
`costs. The row labeled “threads”
`indicates the number
`of
`threads executed, and “thread length”
`is the average thread length
`(work divided by the number of threads).
`Certain derived parameters are also displayed in the table. The
`ratio T~end/T1 N the ejjiciertcy of
`the Cilk program relative to the
`C program. The ratio TI /Tw is the average
`The value
`parallelism.
`T1 /P + T~ is a simple model of the runtime, which will be discussed
`is T1 /TP, and the parallel
`in the next section. The speedup
`ej$ciency
`is T1 / ( P. TP ). The row labeled “space/proc.”
`indicates the maximum
`number of closures allocated at any time on any processor. The row
`labeled “requests/proc.”
`indicates the average number of steal requests
`made by a processor during the execution, and “steals/proc.” gives the
`average number of closures actually stolen.
`one be-
`relationships:
`The data in Table 4 shows two important
`tween efficiency and thread length, and another between speedup and
`average parallelism.
`and
`T&ial/T1
`between efficiency
`Considering
`the relationship
`thread length, we see that for programs with moderately long threads,
`the Cilk scheduler
`induces very little overhead. The queens, pf old,
`ray, and knary
`programs have threads with average length greater
`than 50 microseconds
`and have efficiency
`greater
`than 90 percent.
`On the other hand,
`the fib program has low efficiency,
`because the
`threads are so short:
`fib does almost nothing besides spawn and
`send-argument.
`Despite it’s long threads, the *Socrates program has low efficiency,
`because its parallel Jamboree search algorithm [29]
`is based on specu-
`latively searching subtrees that are not searched by a serial algorithm.
`Consequently, as we increase the number of processors,
`the program
`executes more threads and, hence, does more work.
`For example,
`the 256-processor execution did 7023 seconds of work whereas the
`32-processor execution did only 3644 seconds of work. Both of these
`executions did considerably more work than the serial program’s 1665
`seconds of work. Thus, although we observe low efficiency,
`it is due
`to the parallel algorithm and not to Cilk overhead.
`Looking at the speedup T1 /TP measured on 32 and 256 proces-
`sors, we see that when the average parallelism T1 /T~
`is large com-
`pared with the number P of processors, Cilk programs achieve nearly
`perfect
`linear speedup, but when the average parallelism is small,
`the
`speedup is much less. The fib, queens,
`p f old,
`and ray programs,
`for example, have in excess of 7000-fold parallelism and achieve more
`than 99 percent of perfect
`linear speedup on 32 processors and more
`than 95 percent
`of perfect
`linear
`speedup
`on 256 processors.
`3 The
`*Socrates
`program exhibits
`somewhat
`less parallelism
`and also some-
`
`2For the +Socrates prosram, T1 IS not the measuredexecuoon time, but rather it IS
`an estimate of
`the work obtained hy stmtmmg
`the executmn
`times of all
`threads, wh]ch
`yields a shght underestimate.
`*Socrates
`is an unusually
`comphcated
`apphcation,
`because
`Its speculative
`execunon
`yields unpredictable
`work and critical
`path Consequently,
`the
`on P > 1
`measured runtime
`on one processor
`does not accurately
`reflect
`the work
`processors.
`to the
`speedup even when comparing
`program achieves superlinea
`the ray
`3 In fact,
`efficient
`serial
`implementation.
`We suspect
`that cache effects cause this phenomenon.
`
`210
`
`SAMSUNG-1014
`Page 4 of 10
`
`

`fib
`(33)
`
`queens
`(15)
`
`pfold
`(3,3,4)
`
`ray
`(500,500)
`
`knary
`(10,5,2)
`
`knary
`(10,4,1)
`
`Z&,
`T1
`T,JT1
`Tm
`T1 JTW
`threads
`thread length
`
`Tp
`T1/P + Tw
`T1 /TP
`TP)
`T1/(P
`space/proc.
`requests/proc.
`steals/proc.
`
`+ TM
`
`Tp
`TI/p
`T1 /TP
`~TP)
`T1/(P
`space/proc.
`requests/proc.
`steals/proc.
`
`8.487
`73.16
`0.116
`0.000326
`224417
`17,108,660
`4.276ps
`
`252.1
`254.6
`0.9902
`0.0345
`7380
`210,740
`1208/M
`
`615.15
`647.8
`0.9496
`0.04354
`14879
`9,515,098
`68.08fls
`
`2.298
`2.287
`31.84
`0,9951
`70
`185.8
`56.63
`
`0.2892
`0.2861
`253.0
`0.9882
`66
`73.66
`24.10
`
`8.012
`7.991
`31.78
`0.9930
`95
`48.0
`18.47
`
`1.045
`1.029
`243.7
`0.9519
`76
`80.40
`21.20
`
`20.26
`20.29
`31.97
`0.9992
`47
`88.6
`26.06
`
`2.590
`2.574
`250.1
`0.9771
`47
`97.79
`23.05
`
`(applicationparatneters)
`729.2
`288.6
`732.5
`314.6
`0.9955
`0.9174
`4.458
`0.0415
`17650
`70.56
`424,475
`5,859,374
`1726ps
`53.69,us
`(32-processor experiments)
`21.68
`15.13
`22.93
`14.28
`20.78
`33.79
`1.0558
`0.6495
`41
`39
`218.1
`92639
`79.25
`18031
`(256-processor experiments)
`2.765
`8.590
`2.903
`5.687
`265.0
`36.62
`1.035
`0.1431
`48
`32
`82.75
`151803
`18.34
`6378
`
`40.993
`45.43
`0.9023
`0.255
`178.2
`873,812
`51 .99ps
`
`1.633
`1.675
`27.81
`0.8692
`42
`3127
`1034
`
`0.4636
`0.4325
`98.00
`0.3828
`40
`7527
`550
`
`*Socrates
`(depth
`10)
`(32 proc.)
`
`*Socrates
`(depth
`10)
`(256 proc)
`
`1665
`3644
`0.4569
`3.134
`1163
`26,151,774
`139.3ps
`
`1665
`7023
`0.2371
`3.24
`2168
`51,685,823
`135.9ps
`
`126.1
`117.0
`28.90
`0.9030
`386
`23484
`2395
`
`34.32
`30.67
`204.6
`0.7993
`405
`30646
`1540
`
`Table
`
`4: Performance of Cilk on various applications. All
`
`times are in seconds, except where noted,
`
`what less speedup. On 32 processors the *Socrates program has 1163-
`fold parallelism,
`yielding 90 percent of perfect
`linear speedup, while
`on 256 processors it has 2168-fold
`parallelism yielding
`80 percent
`of perfect
`linear speedup. With even less parallelism,
`as exhibited
`in the knary
`benchmarks,
`less speedup is obtained.
`For example,
`the knary
`( 10, 5, 2 ) benchmark exhibits only 70-fold
`parallelism,
`and it realizes barely more than 20-fold
`speedup on 32 processors
`(less than 65 percent of perfect
`linear speedup). With 178-fold paral-
`lelism, knary
`( 10,4,1
`) achieves 27-fold speedup on 32 processors
`(87 percent of perfect
`linear speedup), but only 98-fold speedup on
`256 processors (38 percent of perfect
`linear speedup).
`Although these speedup measures reflect
`the Cilk scheduler’s abil-
`ity to exploit parallelism,
`to obtain application speedup, we must fac-
`tor
`in the efficiency
`of
`the Cilk program compared with the serial
`C program.
`Specifically,
`the application
`speedup T,efid/TP
`is the
`and speedup T1 /TP.
`For example,
`product of efficiency T~eti&/Tl
`applications
`such as fib and *Socrates with low efficiency
`generate
`correspondingly
`low application
`speedup. The *Socrates program,
`with efficiency 0.2371 and speedup 204.6 on 256 processors, exhibits
`application speedup of 0.2371 .204.6
`= 48.51. For the purpose
`of
`performance
`prediction,
`we prefer
`to decouple
`the efficiency
`of
`the
`application from the efficiency of the scheduler.
`Looking more carefully at the cost of a spawn in Cilk, we find that
`it takes a fixed overhead of about 50 cycles to allocate and initialize
`a
`closure, plus about 8 cycles for each word argument.
`In comparison,
`a C function call on a CM5 processor takes 2 cycles of fixed overhead
`(assuming no register window overtlow) plus 1 cycle for each word
`argument
`(assuming all arguments are transferred in registers). Thus,
`a spawn in Cilk is roughly an order of magnitude more expensive than
`a C function call. This Cilk overhead is quite apparent
`in the fib pro-
`gram,whichdoes
`almostnothingbesides
`spawnand send~rgument.
`Based on fib’s measured efficiency of 0.116, we can conclude that the
`
`in Cilk is between
`
`aggregate average cost of a spawdsendargument
`8 and 9 times the cost of a function call/return in C.
`Efficient execution of programs with short threads requires a low-
`overhead spawn operation.
`Ascanbe
`observed from Table4,
`the
`vast majority of threads execute on the same processor on which they
`are spawned. Forexample,
`the fibprogram executed over 17 million
`threads but migrated only 6170 (24. 10 per processor) when run with
`256processors.
`Taking advantage ofthisproperty,
`other researchers
`[25, 32] have developed techniques for implementing
`spawns such that
`when the child thread executes on the same processor as its parent,
`the cost of
`the spawn operation
`is roughly
`equal
`the cost of a C
`function
`call. We hope to incorporate
`such techniques
`into future
`implementations
`of Cilk.
`Finally, we make two observations about the space and communi-
`cation measures in Table 4.
`the space per
`rows, we observe that
`Looking at the “space/proc.”
`processor is generally quite small and does not grow with the number
`of processors. Forexample,
`*Socrates on32processors
`executes over
`26 million
`threads, yet no processor ever has more than 386 allocated
`closures. 0n256
`processors,
`thenumber
`ofexecuted
`threads nearly
`doubles to over51 million, but the space per processors barely changes.
`In Section 6 we show formally
`that for Cilk programs,
`the space per
`processor does not grow as we add processors.
`rows in Table 4,
`Looking at the “requests/proc.”
`and “steals/proc,”
`we observe that the amount of communication
`grows with the critical
`path but does not grow with the work.
`For example,
`fib, queens,
`pfold,
`and ray
`all have critical
`paths under a tenth of a second
`long and perform fewer than 220 requests and 80 steals per processor,
`whereas knary(10,5,2)
`and*Socrates
`havecritical
`paths morethan
`3 seconds long and perform more than 20,000 requests and 1500 steals
`perprocessor. Thetable
`does notshow anyclear
`correlation between
`work andeither
`requests or steals. For example,
`ray does more than
`
`211
`
`SAMSUNG-1014
`Page 5 of 10
`
`

`( 1

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go Monthly

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases