`
`Runtime System
`
`Robert
`
`D. Blumofe
`
`Christopher
`
`F. Joerg
`
`Bradley
`
`C. Kuszmaul
`
`Charles
`
`E. Leiserson
`
`Keith H. Randall
`
`Yuli
`
`Zhou
`
`MIT
`
`Laboratory
`
`for Computer
`
`Science
`
`545 Technology
`
`Square
`
`Cambridge,
`
`MA
`
`02139
`
`Abstract
`
`system for multi-
`is a C-based runtime
`“silk”)
`(pronounced
`Cilk
`threaded parallel programming.
`In this paper, we document
`the effi-
`ciency of the Cilk work-stealing
`scheduler, both empirically
`and ana-
`lytically. We show that on real and synthetic applications,
`the “work”
`and “critical
`path” of a Cilk computation
`can be used to accurately
`model performance. Consequently, a Cilk programmer
`can focus on
`reducing the work and critical path of his computation,
`insulated from
`load balancing and other
`rtmtime scheduling issues. We also prove
`that for the class of “fully
`strict”
`(well-structured)
`programs,
`the Cilk
`scheduler achieves space, time, and communication
`bounds all within
`a constant
`factor of optimal.
`runs on the Connection Ma-
`The Cilk rmrtime system currently
`chine CM5 MPP,
`the Intel Paragon MPP, the Silicon Graphics Power
`Challenge SMP, and the MIT Phish network of workstations.
`Ap-
`plications written in Cilk include protein folding,
`graphic rendering,
`backtrack search, and the *Socrates chess program, which won third
`prize in the 1994 ACM International Computer Chess Championship.
`
`1
`
`Introduction
`
`has become an increasingly popular way to implement
`Multithreading
`dynamic, highly asynchronous, concurrent programs [1, 8,9, 10, 11,
`12, 15, 19,21,22,24,25,28,33,
`34,36,39,
`40]. A multithreaded
`sys-
`tem provides the programmer with a means to create, synchronize, and
`schedule threads. Although
`the schedulers in many of
`these rtmtime
`systems seem to perform well
`in practice, none provide users with a
`guarantee of application performance. Cilk is a runtime system whose
`work-stealing
`scheduler
`is efficient
`in theory as well as in practice.
`Moreover,
`it gives the user an algorithmic model of application
`per-
`formance based on the measures of “work”
`and “critical path” which
`can be used to predict
`the runtime of a Cilk program accurately.
`A Cilk mtrltithreaded
`computation
`can be viewed as a directed
`acyclic graph (dag)
`that unfolds dynamically,
`as is shown schemat-
`ically in Figure 1. A Cilk program consists of a collection
`of Cilk
`each of which is broken into a sequence of threads, which
`procedures,
`C func-
`form the vertices of
`the dag. Each thread is a rtonblocking
`tion, which means that
`it can run to completion without waiting or
`
`This researchwassupportedin pm by the Advanced ResearchProjectsAgency under
`GrantsNIM014-94-1-09SSandNOW14-92-J-1310, Robert Blumofe is supportedin part by
`an ARPA High-PerformanceComputing GraduateFellowship. Keith Randatl is supported
`in part by a Department of Defense NDSEG Fellowship.
`
`Permission to make digital/hard copies of all or part of this material with-
`out fee is granted provided that the copies are not made or distributed
`for profit or commercial advantage,
`the ACM copyrigh~sewer
`notice, the title of the publication and its date appear, and notice is given
`that copyright
`is by permission of the Association for Computing Machinery,
`Inc. (ACM). To copy otherwise? to repubhsh,
`to post on servers or to
`redistribute to lists, requires prior specific permission and/or a fee.
`
`PPOPP ’95 Santa Clara, CA USA
`@ 1995 ACM 0-89791 -701 -619510007...$3.50
`
`207
`
`Figure
`Threads are
`computation.
`The Cilk model of multithreaded
`1:
`shown as circles, which are grouped into procedures.
`Each downward edge
`corresponds to a spawn of a child, each horizontal edge corresponds to a spawn
`of a successor, and each upward, curved edge corresponds to a data dependency.
`The numbers in tbe figure indicate the levels of procedures in the spawn tree.
`
`the threads from a
`suspending once it has been invoked. As one of
`Cilk procedure runs,
`it can spawn a child thread which begins a new
`child procedure.
`In the figure, downward edges connect
`threads and
`their procedures with the children they have spawned. A spawn is
`like a subroutine call, except
`that the calling thread may execute cost-
`ctrrrently with its child, possibly spawning additional
`children. Since
`threads cannot block in the Cilk model, a thread cannot spawn chil-
`dren and then wait
`for values to be returned. Rather,
`the thread must
`thread to receive the children’s
`return
`additionally
`spawn a successor
`vahres when they are produced. A thread and its successors are con-
`sidered to be parts of the same Cilk procedure.
`In the figure, sequences
`of successor threads that form Cilk procedures are connected by hori-
`zontal edges. Return values, and other values sent from one thread to
`among the threads, where a thread
`another,
`induce data dependencies
`receiving a value cannot begin until another
`thread sends the value.
`Data dependencies are shown as upward, curved edges in the figure.
`Thus, a Cilk computation unfolds as a spawn tree composed of proce-
`dures and the spawn edges that connect
`them to their children, but the
`execution is constrained to follow the precedence relation determined
`by the dag of threads.
`The execution time of any Cilk program on a parallel computer
`with P processors is constrained by two parameters of
`the computa-
`The work, denoted T1,
`the work and the critical
`is the
`tion:
`path.
`time used by a one-processor execution of
`the program, which cor-
`responds to the sum of
`the execution times of all
`the threads. The
`critical path length, denoted T~,
`is the total amount of time required
`by an infinite-processor
`execution, which corresponds to the largest
`sum of thread execution times along any path. Whh P processors,
`the
`execution time cannot be less than T’1/P or less than T~.
`The Cilk
`scheduler uses “work stealing”
`[3, 7, 13, 14, 15, 19,27,28,29,34,
`40]
`to achieve execution time very near to the sum of these two measures.
`
`SAMSUNG-1014
`Page 1 of 10
`
`
`
`Off-line techniques for computing such efficient schedules have been
`known for a long time [5, 16, 17], but this efficiency has been difficult
`to achieve on-line in a distributed environment while simultaneously
`using small amounts of space and communication.
`We demonstrate the efficiency of
`the Cilk scheduler both empir-
`ically and analytically.
`Empirically, we have been able to document
`that Cilk works well
`for dynamic, asynchronous,
`tree-like, MIMD-
`style computations.
`To date,
`the applications we have programmed
`include protein folding,
`graphic rendering, backtrack search, and the
`*Socrates
`chess program, which won third prize in the 1994 ACM
`International Computer Chess Championship. Many of these applica-
`tions pose problems for more traditional
`parallel environments,
`such
`as message passing [38] and data parallel
`[2, 20], because of the unpre-
`dictability
`of the dynamic workloads on processors. Analytically,
`we
`prove that
`for
`“fully
`strict”
`(well-structured)
`programs, Cilk’s work-
`stealing scheduler achieves execution space, time, and communication
`bounds all within a constant
`factor of optimal. To date, all of the ap-
`plications
`that we have coded are fully strict.
`The Cilk language is an extension to C that provides an abstraction
`threads in explicit
`continuation-passing
`style. A Cilk program is
`of
`preprocessed to C and then linked with a runtime library to run on the
`Connection Machine CM5 MPP,
`the Intel Paragon MPP,
`the Silicon
`Graphics Power Challenge SMP, or the MIT Phish [4] network of
`workstations.
`In this paper, we focus on the Connection Machine
`CM5 implementation
`of Cilk.
`The Cilk scheduler on the CM5 is
`written in about 30 pages of C, and it performs communication
`among
`processors using the Strata [6] active-message library.
`Section 2
`The remainder of
`this paper
`is organized as follows.
`describes Cilk’s runtime data structures and the C language extensions
`that are used for programming.
`Section 3 describes the work-stealing
`scheduler. Section 4 documents the performance of several Cilk ap-
`plications.
`Section 5 shows how the work and critical path of a Cilk
`computation can be used to model performance. Section 6 shows ana-
`lytically
`that the scheduler works well. Finally, Section 7 offers some
`concluding remarks and describes our plans for the future.
`
`2
`
`The Cilk
`tion
`
`programming
`
`environment
`
`and
`
`implementa-
`
`In this section we describe a C language extension that we have de-
`veloped to ease the task of coding Cilk programs. We also explain the
`basic runtime data structures that Cilk uses.
`In the Cilk language, a thread T is defined in a manner similar
`a C function definition:
`
`to
`
`thread
`
`T (ai-g-decls
`
`)
`
`{ stints
`
`. . . }
`
`translates T into a C function of one argument
`The Cilk preprocessor
`to a closure data
`and void return type. The one argument
`is a pointer
`structure,
`illustrated in Figure 2, which holds the arguments for T. A
`closure consists of a pointer
`to the C function for T, a slot
`for each
`of the specified arguments, and a join counter
`indicating
`the number
`of missing arguments that need to be supplied before T is ready to
`if
`it has obtained all of
`its arguments, and it
`run. A closure is ready
`is waidng if some arguments are missing. To run a ready closure,
`the
`Cilk scheduler invokes the thread as a procedure using the closure itself
`as its sole argument. Within
`the code for
`the thread,
`the arguments
`are copied out of
`the closure data structure into local variables. The
`closure is allocated from a simple runtime heap when it is created, and
`it is returned to the heap when the thread terminates.
`The Cilk
`language supports a data type called a cmrdnuatiorr,
`which is specified by the type modifier keyword cent. A continuation
`is essentially a global
`reference to an empty argument slot of a closure,
`implemented
`as a compound data structure containing
`a pointer
`to a
`closure and an offset
`that designates one of
`the closure’s argument
`
`208
`
`waiting
`
`closure
`
`code
`
`ready closure
`
`Figure
`
`2: The closure data structure.
`
`can be created and passed among threads, which
`slots. Continuations
`enables threads to communicate
`and synchronize with each other.
`Continuations
`are typed with the C data type of the slot in the closure.
`At runtime, a thread can spawn a child thread by creating a closure
`for the child. Spawning is specified in the Cilk language as follows:
`
`spawn T (args
`
`.
`
`)
`
`fills in all available arguments,
`This statement creates a child closure,
`and initializes
`the join counter
`to the number of missing arguments.
`Available arguments are specified as in C. To specify a missing argu-
`ment,
`the user specifies a continuation
`variable (type cent)
`preceded
`by a question mark. For example,
`if
`the second argument
`is ?k,
`then
`Cilk sets the variable k to a continuation
`that refers to the second argu-
`ment slot of the created closure.
`If the closure is ready, that is, it has no
`missing arguments,
`then spawn causes the closure to be immediately
`posted to the scheduler
`for execution.
`In typical applications,
`child
`closures are usually created with no missing arguments.
`To create a successor thread, a thread executes the following
`ment:
`
`state-
`
`spawnnext
`
`7’
`
`(args
`
`)
`
`informs
`it
`to spawn, but
`identical
`is semantically
`This statement
`the new closure should be treated as a successor,
`the scheduler
`that
`as opposed to a child.
`Successor closures are usually created with
`some missing arguments, which are filled in by values produced by
`the children.
`way
`inthenonnal
`retumvahres
`does notever
`ACilkprocedure
`Instead,
`theprogrammer must code the parent
`toaparent
`procedure.
`procedure astwo threads. The first
`thread spawns thechild
`procedure,
`passing it a continuation
`pointing
`to the successor thread’s closure.
`The child sends its “return”
`valtre explicitly
`as an argument
`to the
`waiting successor. This strategy of communicating
`between threads
`passirrg. Cilkprovides
`primitives of the
`iscalled
`explicit
`contirruadorr
`following
`form to send values from one closure to anothec
`
`s end.argument
`
`(k,
`
`value)
`
`sends the value value to the argument
`This statement
`slot of a waiting
`the continuation
`by the continuation
`k. The types of
`closure
`specified
`The join counter of
`the waiting
`and the value must be compatible.
`closure is decremented,
`and if
`it becomes zero,
`then the closure is
`ready and is posted to the scheduler.
`Figure 3 shows the familiar
`recursive Fibonacci procedure written
`in Cilk.
`It consists of
`two threads,
`fib and its successor sum. Re-
`flecting the explicit
`continuation
`passing style that Cilk supports,
`the
`
`SAMSUNG-1014
`Page 2 of 10
`
`
`
`3
`
`The Cilk work-stealing
`
`scheduler
`
`[3,7, 13, 14, 15,
`Cilk’s scheduler uses the technique of work-stealing
`19,27,28,29,34,
`40]
`in which a processor (the thief) who runs out of
`work selects another processor (the victim)
`from whom to steal work,
`and then steals the shallowest
`ready thread in the victim’s
`spawn tree.
`Cilk’s strategy is for thieves to choose victims at random [3, 27, 37].
`ready queue to hold
`At runtime, each processor maintains a local
`ready closures. Each closure has an associated level, which corre-
`sponds to the number of spawn’s (but not spawnnext
`‘s) on the path
`from the root of the spawn tree. The ready queue is an array in which
`the .Ltb element contains a linked list of all
`ready closures having
`level L.
`ready
`all
`Cilk begins executing the user program by initializing
`queues to be empty, placing the root
`thread into the level-O list of
`Processor O’s queue, and then starting a scheduling
`loop on each
`processor. Within a scheduling loop, a processor
`first checks to see
`whether
`its ready queue is empty.
`If
`it
`is, the processor commences
`“work
`stealing,” which will be described shortly.
`Otherwise,
`the
`processor performs the following
`steps:
`1. Remove the thread at the head of the list of the deepest nonempty
`level
`in the ready queue.
`2. Extract
`the thread from the closure, and invoke it.
`As a thread executes, it may spawn or send arguments to other threads,
`When the thread terminates, control
`returns to the scheduling loop.
`When a thread at level L spawns a child thread T,
`the scheduler
`executes the following
`operations:
`1. Allocate and initialize
`a closure for T.
`initialize
`2. Copy the available arguments into the closure,
`continuations
`to point
`to missing arguments, and initialize
`join counter
`to the number of missing arguments.
`3. Label
`the closure with level L + 1.
`4.
`If
`there are no missing arguments, post the closure to the ready
`+ 1)list.
`queue by inserting it at the head of the level-(L
`Execution of spawnnext
`is similar, except
`that the closure is labeled
`with level L and, if
`it is ready, posted to the level-L list.
`value) performs the
`A processor that executes send-argument(k,
`following
`steps:
`
`any
`the
`
`the join counter
`
`1. Find the closure and argument slot referenced by the continua
`tion k.
`2. Place value in the argument slot, and decrement
`of the cIosure.
`If
`the join counter goes to zero, post
`queue at the appropriate level.
`
`3.
`
`the closure to the ready
`
`k refers to a closure on a remote processor,
`When the continuation
`ensues.
`The processor
`that
`initiated
`ths
`network
`communication
`sends a message to the remote processor
`send-argument
`function
`to perform the operations. The only subtlety occurs in step 3.
`If
`the
`closure must be posted,
`it is posted to the ready queue of the initiating
`processor,
`rather than to that of
`the remote processor. This policy is
`necessary for
`the scheduler
`to be provably good, but as a practical
`matter, we have also had success with posting the closure to the re-
`mote processor’s queue, which can sometimes save a few percent
`in
`overhead.
`If
`the scheduler attempts to remove a thread from an empty ready
`queue,
`the processor becomes a thief and commences work
`srealing
`as follows:
`at random.
`1. Select a victim processor uniformly
`2.
`If
`the victim’s
`ready queue is empty, go to step 1.
`3.
`If
`the victim’s
`ready queue is nonempty, extract a thread from
`the tail of
`the list
`in the shallowest
`nonempty
`level of
`the ready
`queue, and invoke
`it.
`
`thread
`{
`if
`
`int
`
`flb (cent
`(n<Z)
`send.argument
`
`k,
`
`i.nt n)
`
`(k, n)
`
`else
`{
`
`}
`
`x, y;
`int
`ccmt
`sum (k, ?x, ?y) ;
`spawn-next
`spawn tib (x, n-1)
`;
`spawn rib (y, n-2)
`;
`
`int
`k,
`int
`(k, x.y)
`;
`
`x,
`
`int
`
`y)
`
`} t
`
`sum (cent
`hread
`{
`send-argument
`}
`
`3: A Cilk procedure, consisting of two threads,
`figure
`Fibonacci number.
`
`to compute the rtth
`
`the
`
`specifying where the
`
`to each thread is the continuation
`first argument
`“return”
`value should be placed.
`first checks to see if
`it
`When the fib function
`is invoked,
`boundary case has been reached, in which case it uses sendargument
`k.
`to “return”
`the value of n to the slot specified by continuation
`Otherwise,
`it spawns the successor thread sum, as well as two children
`to compute the two subcases. Each of
`these two children
`is given a
`continuation
`specifying to which argument
`in the sum thread it should
`send its result. The sum thread simply adds the two arguments when
`they arrive and sends this result
`to the slot designated by k.
`Although writing
`in explicit
`continuation
`passing style is some-
`what onerous for
`the programmer,
`the decision to break procedures
`into separate nonblocking
`threads simplifies
`the Cilk runtime system.
`Each Cilk thread leaves the C rrmtime stack empty when it completes.
`Thus, Cilk can run on top of a vanilla C rrtntime system. A com-
`mon alternative [19, 25, 32, 34]
`is to support a programming
`style
`in which a thread suspends whenever
`it discovers that required val-
`ues have not yet been computed,
`resuming when the values become
`available. When a thread suspends, however,
`it may leave temporary
`values on the rtmtime stack which must be saved, or each thread must
`have its own rtmtime stack. Consequently,
`this akemative strategy
`requires changes to the runtime system that depend on the C calling
`stack layout and register usage conventions.
`Another advantage of
`Cilk’s strategy is that it allows multiple children to be spawned from
`a single nonblocking
`thread, which saves on context switching.
`In
`Cilk,
`r children can be spawned and executed with only r + 1 context
`switches, whereas the alternative of suspending whenever a thread is
`spawned causes 2r context switches. Since our primary interest
`is in
`understanding how to build efficient multithreaded
`rnntime systems,
`but without
`redesigning
`the basic C runtime system, we chose the
`alternative of burdening the programmer with a requirement which
`is perhaps less elegant
`linguistically,
`but which yields a simple and
`portable runtime implementation.
`Cilk supports a variety of features that give the programmer greater
`control over rtmtime performance. For example, when the last action
`of a thread is to spawn a ready thread,
`the programmer
`can use the
`keyword cal 1 instead of spawn that produces a “tail call”
`to run the
`new thread immediately without
`invoking
`the scheduler. Cilk also
`allows arrays and subarrays to be passed as arguments to closures.
`Other
`features include various abilities
`to override the scheduler’s
`decisions,
`including on which processor a thread should be placed and
`how to pack and unpack data when a closure is migrated from one
`processor to another.
`
`209
`
`SAMSUNG-1014
`Page 3 of 10
`
`
`
`communi-
`
`Work stealing is implemented with a simple request-reply
`cation protocol between the thief and victim.
`Why steal work from the shallowest
`level of the ready queue? The
`reason is two-fold.
`First, we would like to steal large amounts of work,
`and shallow closures are likely to execute for longer
`than deep ones.
`Stealing large amounts of work tends to lower
`the communication
`cost of the program, because fewer steals are necessary. Second,
`the
`closures at the shallowest
`level of the ready queue are also the ones that
`are shallowest
`in the dag, a key fact proven in Section 6. Consequently,
`If processors are idle,
`the work they steal tends to make progress along
`the critical path.
`
`4
`
`Performance
`
`of Cilk applications
`
`that we have used to bench-
`This section presents several applications
`mark the Cilk scheduler. We also present empirical
`evidence from
`experiments run on a CM5 supercomputer
`to document
`the efficiency
`of our work-stealing
`scheduler.
`The CM5 is a massively parallel
`computer based on 32MHz SPARC processors with a fat-tree inter-
`connection network [30].
`The applications are described below:
`the
`that
`2, except
`l fib is
`in Section
`same as was presented
`the
`second recursive spawn is replaced by a “tailc all”
`that avoids
`thescheduler. Thisprogram isagoodmeasure
`of Cilkoverhead,
`because the thread length is so small.
`
`is a backtrack search program that solves the problem
`l queens
`of placing N queens on a A’ x N chessboard so that no two
`queens attack each other. The Cilkprogram
`is based onsertal
`code by R. Sargent of the MIT Media Laboratory. Thread length
`was enhanced by serializing
`the bottom 7 levels of
`the search
`tree.
`
`program [35] written in conjunc-
`is a protein-folding
`l pfold
`tion with V. Pandeof MIT’s Center
`for Material Sciences and
`Engineering.
`This program finds hamiltonian
`paths in a three-
`dimensional
`grid using backtrack search.
`Itwasthe
`first pro-
`gram toenumerate
`allhamiltonian
`paths ina3
`x 4 x 4 grid.
`We timed the enumeration
`of all paths starting with a certain
`sequence.
`l ray is a parallel program for graphics rendering based on the
`serial POV- Ray program,
`which
`uses a ray-tracing
`algorithm.
`The entire POV-Ray
`system contains
`over 20,000
`lines
`of C
`code, but
`the core of POV-Ray
`is a simple doubly nested loop
`that
`iterates over each pixel
`in a two-d~mensional
`image. For
`ray we converted the nested loops into a 4-ary divide-and-
`conquer control
`structure using spawns. 1 Our measurements
`do not
`include the approximately
`2.4 seconds of startup time
`required to read and process the scene description file.
`l knary
`(k, n, r ) is a synthetic benchmark whose parameters can
`be set to produce a variety of values for work and critical path.
`It generates a tree of branching factor k and depth n in which
`the first
`r children at every level are executed serially and the
`remainder are executed in parallel. At each node of the tree, the
`program runs an empty “for”
`loop for 400 iterations.
`chess program that uses the Jamboree
`l *Socrates
`is a parallel
`search algorlthm [23, 29]
`to parallelize
`a minmax tree search.
`The work of the algorithm vanes with the number of processors,
`because it does speculative work that may be aborted during
`runtime.
`#30crates is a production-quality
`program that won
`third prize in the 1994 ACM International
`Computer Chess
`
`1Imtiall y,the serial POV-Rayprogramwasabout5percemslowerthan the Cilk version
`on one processor. The reason was that
`the divide-and-conquer
`decomposition
`running
`performed
`by the C]lk
`code provides
`better
`Iocrdity
`than the doubly
`nested loop of
`the
`serial code Modlfymg
`the serial code to inmate
`the C]lk decomposition
`improved
`its
`performance
`Tunings
`for
`the improved
`version ~e given in the table,
`
`running on the 512-node CM5 m the National
`Championship
`Center
`for Supercomputing
`Applications
`at the University
`of
`Illinois, Urbana-Champaign.
`Table 4 shows typical performance measures for these Cilk appli-
`cations. Each column presents data from a single run of a benchmark
`application. We adopt
`the following
`notations, which are used in
`the table.
`For each application, we have an efficient
`serial C im-
`plementation,
`compiled using gcc
`-02, whose measured runtime is
`denoted T,end. The work T1 is the measured execution time for the
`Cilk program running on a single node of
`the CM5.2
`The critical
`path length T~ of the Cilk computation is measured by timestamping
`each thread and does not include scheduling or communication
`costs.
`The measured P-processor
`execution time of
`the Cilk program run-
`ning on the CM5 is given by TP, which includes all scheduling and
`communication
`costs. The row labeled “threads”
`indicates the number
`of
`threads executed, and “thread length”
`is the average thread length
`(work divided by the number of threads).
`Certain derived parameters are also displayed in the table. The
`ratio T~end/T1 N the ejjiciertcy of
`the Cilk program relative to the
`C program. The ratio TI /Tw is the average
`The value
`parallelism.
`T1 /P + T~ is a simple model of the runtime, which will be discussed
`is T1 /TP, and the parallel
`in the next section. The speedup
`ej$ciency
`is T1 / ( P. TP ). The row labeled “space/proc.”
`indicates the maximum
`number of closures allocated at any time on any processor. The row
`labeled “requests/proc.”
`indicates the average number of steal requests
`made by a processor during the execution, and “steals/proc.” gives the
`average number of closures actually stolen.
`one be-
`relationships:
`The data in Table 4 shows two important
`tween efficiency and thread length, and another between speedup and
`average parallelism.
`and
`T&ial/T1
`between efficiency
`Considering
`the relationship
`thread length, we see that for programs with moderately long threads,
`the Cilk scheduler
`induces very little overhead. The queens, pf old,
`ray, and knary
`programs have threads with average length greater
`than 50 microseconds
`and have efficiency
`greater
`than 90 percent.
`On the other hand,
`the fib program has low efficiency,
`because the
`threads are so short:
`fib does almost nothing besides spawn and
`send-argument.
`Despite it’s long threads, the *Socrates program has low efficiency,
`because its parallel Jamboree search algorithm [29]
`is based on specu-
`latively searching subtrees that are not searched by a serial algorithm.
`Consequently, as we increase the number of processors,
`the program
`executes more threads and, hence, does more work.
`For example,
`the 256-processor execution did 7023 seconds of work whereas the
`32-processor execution did only 3644 seconds of work. Both of these
`executions did considerably more work than the serial program’s 1665
`seconds of work. Thus, although we observe low efficiency,
`it is due
`to the parallel algorithm and not to Cilk overhead.
`Looking at the speedup T1 /TP measured on 32 and 256 proces-
`sors, we see that when the average parallelism T1 /T~
`is large com-
`pared with the number P of processors, Cilk programs achieve nearly
`perfect
`linear speedup, but when the average parallelism is small,
`the
`speedup is much less. The fib, queens,
`p f old,
`and ray programs,
`for example, have in excess of 7000-fold parallelism and achieve more
`than 99 percent of perfect
`linear speedup on 32 processors and more
`than 95 percent
`of perfect
`linear
`speedup
`on 256 processors.
`3 The
`*Socrates
`program exhibits
`somewhat
`less parallelism
`and also some-
`
`2For the +Socrates prosram, T1 IS not the measuredexecuoon time, but rather it IS
`an estimate of
`the work obtained hy stmtmmg
`the executmn
`times of all
`threads, wh]ch
`yields a shght underestimate.
`*Socrates
`is an unusually
`comphcated
`apphcation,
`because
`Its speculative
`execunon
`yields unpredictable
`work and critical
`path Consequently,
`the
`on P > 1
`measured runtime
`on one processor
`does not accurately
`reflect
`the work
`processors.
`to the
`speedup even when comparing
`program achieves superlinea
`the ray
`3 In fact,
`efficient
`serial
`implementation.
`We suspect
`that cache effects cause this phenomenon.
`
`210
`
`SAMSUNG-1014
`Page 4 of 10
`
`
`
`fib
`(33)
`
`queens
`(15)
`
`pfold
`(3,3,4)
`
`ray
`(500,500)
`
`knary
`(10,5,2)
`
`knary
`(10,4,1)
`
`Z&,
`T1
`T,JT1
`Tm
`T1 JTW
`threads
`thread length
`
`Tp
`T1/P + Tw
`T1 /TP
`TP)
`T1/(P
`space/proc.
`requests/proc.
`steals/proc.
`
`+ TM
`
`Tp
`TI/p
`T1 /TP
`~TP)
`T1/(P
`space/proc.
`requests/proc.
`steals/proc.
`
`8.487
`73.16
`0.116
`0.000326
`224417
`17,108,660
`4.276ps
`
`252.1
`254.6
`0.9902
`0.0345
`7380
`210,740
`1208/M
`
`615.15
`647.8
`0.9496
`0.04354
`14879
`9,515,098
`68.08fls
`
`2.298
`2.287
`31.84
`0,9951
`70
`185.8
`56.63
`
`0.2892
`0.2861
`253.0
`0.9882
`66
`73.66
`24.10
`
`8.012
`7.991
`31.78
`0.9930
`95
`48.0
`18.47
`
`1.045
`1.029
`243.7
`0.9519
`76
`80.40
`21.20
`
`20.26
`20.29
`31.97
`0.9992
`47
`88.6
`26.06
`
`2.590
`2.574
`250.1
`0.9771
`47
`97.79
`23.05
`
`(applicationparatneters)
`729.2
`288.6
`732.5
`314.6
`0.9955
`0.9174
`4.458
`0.0415
`17650
`70.56
`424,475
`5,859,374
`1726ps
`53.69,us
`(32-processor experiments)
`21.68
`15.13
`22.93
`14.28
`20.78
`33.79
`1.0558
`0.6495
`41
`39
`218.1
`92639
`79.25
`18031
`(256-processor experiments)
`2.765
`8.590
`2.903
`5.687
`265.0
`36.62
`1.035
`0.1431
`48
`32
`82.75
`151803
`18.34
`6378
`
`40.993
`45.43
`0.9023
`0.255
`178.2
`873,812
`51 .99ps
`
`1.633
`1.675
`27.81
`0.8692
`42
`3127
`1034
`
`0.4636
`0.4325
`98.00
`0.3828
`40
`7527
`550
`
`*Socrates
`(depth
`10)
`(32 proc.)
`
`*Socrates
`(depth
`10)
`(256 proc)
`
`1665
`3644
`0.4569
`3.134
`1163
`26,151,774
`139.3ps
`
`1665
`7023
`0.2371
`3.24
`2168
`51,685,823
`135.9ps
`
`126.1
`117.0
`28.90
`0.9030
`386
`23484
`2395
`
`34.32
`30.67
`204.6
`0.7993
`405
`30646
`1540
`
`Table
`
`4: Performance of Cilk on various applications. All
`
`times are in seconds, except where noted,
`
`what less speedup. On 32 processors the *Socrates program has 1163-
`fold parallelism,
`yielding 90 percent of perfect
`linear speedup, while
`on 256 processors it has 2168-fold
`parallelism yielding
`80 percent
`of perfect
`linear speedup. With even less parallelism,
`as exhibited
`in the knary
`benchmarks,
`less speedup is obtained.
`For example,
`the knary
`( 10, 5, 2 ) benchmark exhibits only 70-fold
`parallelism,
`and it realizes barely more than 20-fold
`speedup on 32 processors
`(less than 65 percent of perfect
`linear speedup). With 178-fold paral-
`lelism, knary
`( 10,4,1
`) achieves 27-fold speedup on 32 processors
`(87 percent of perfect
`linear speedup), but only 98-fold speedup on
`256 processors (38 percent of perfect
`linear speedup).
`Although these speedup measures reflect
`the Cilk scheduler’s abil-
`ity to exploit parallelism,
`to obtain application speedup, we must fac-
`tor
`in the efficiency
`of
`the Cilk program compared with the serial
`C program.
`Specifically,
`the application
`speedup T,efid/TP
`is the
`and speedup T1 /TP.
`For example,
`product of efficiency T~eti&/Tl
`applications
`such as fib and *Socrates with low efficiency
`generate
`correspondingly
`low application
`speedup. The *Socrates program,
`with efficiency 0.2371 and speedup 204.6 on 256 processors, exhibits
`application speedup of 0.2371 .204.6
`= 48.51. For the purpose
`of
`performance
`prediction,
`we prefer
`to decouple
`the efficiency
`of
`the
`application from the efficiency of the scheduler.
`Looking more carefully at the cost of a spawn in Cilk, we find that
`it takes a fixed overhead of about 50 cycles to allocate and initialize
`a
`closure, plus about 8 cycles for each word argument.
`In comparison,
`a C function call on a CM5 processor takes 2 cycles of fixed overhead
`(assuming no register window overtlow) plus 1 cycle for each word
`argument
`(assuming all arguments are transferred in registers). Thus,
`a spawn in Cilk is roughly an order of magnitude more expensive than
`a C function call. This Cilk overhead is quite apparent
`in the fib pro-
`gram,whichdoes
`almostnothingbesides
`spawnand send~rgument.
`Based on fib’s measured efficiency of 0.116, we can conclude that the
`
`in Cilk is between
`
`aggregate average cost of a spawdsendargument
`8 and 9 times the cost of a function call/return in C.
`Efficient execution of programs with short threads requires a low-
`overhead spawn operation.
`Ascanbe
`observed from Table4,
`the
`vast majority of threads execute on the same processor on which they
`are spawned. Forexample,
`the fibprogram executed over 17 million
`threads but migrated only 6170 (24. 10 per processor) when run with
`256processors.
`Taking advantage ofthisproperty,
`other researchers
`[25, 32] have developed techniques for implementing
`spawns such that
`when the child thread executes on the same processor as its parent,
`the cost of
`the spawn operation
`is roughly
`equal
`the cost of a C
`function
`call. We hope to incorporate
`such techniques
`into future
`implementations
`of Cilk.
`Finally, we make two observations about the space and communi-
`cation measures in Table 4.
`the space per
`rows, we observe that
`Looking at the “space/proc.”
`processor is generally quite small and does not grow with the number
`of processors. Forexample,
`*Socrates on32processors
`executes over
`26 million
`threads, yet no processor ever has more than 386 allocated
`closures. 0n256
`processors,
`thenumber
`ofexecuted
`threads nearly
`doubles to over51 million, but the space per processors barely changes.
`In Section 6 we show formally
`that for Cilk programs,
`the space per
`processor does not grow as we add processors.
`rows in Table 4,
`Looking at the “requests/proc.”
`and “steals/proc,”
`we observe that the amount of communication
`grows with the critical
`path but does not grow with the work.
`For example,
`fib, queens,
`pfold,
`and ray
`all have critical
`paths under a tenth of a second
`long and perform fewer than 220 requests and 80 steals per processor,
`whereas knary(10,5,2)
`and*Socrates
`havecritical
`paths morethan
`3 seconds long and perform more than 20,000 requests and 1500 steals
`perprocessor. Thetable
`does notshow anyclear
`correlation between
`work andeither
`requests or steals. For example,
`ray does more than
`
`211
`
`SAMSUNG-1014
`Page 5 of 10
`
`
`
`( 1