`
`D e v 0 n 9 5h cL-h
`
`-
`
`r—.
`
`Programming with
`
`Bart Smgqlders
`
`Microsoft Corp. Exhibit 1057
`
`
`
`Programming with Threads
`
`Steve Kleiman
`
`Devang Shah
`
`Bart Smaalders
`
`
`SunSoft Press :
`
`A Prentice Hall Title '5—
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`© 1996 Sun Microsystems, Inc. — Printed in the United States of America.
`2550 Garcia Avenue, Mountain View, California 94043-1100 USA.
`
`All rights reserved. This book is protected by copyright and distributed under licenses
`restricting its use, copying, distribution, and ecompilation. No part of this book may be
`reproduced in any form by any means without prior written aut orization of Sun and its
`licensors, if any.
`Portions of the products described in this book may be derived from the UNIX® and Berkeley
`4.3 BSD systems, licensed from UNIX System Laboratories, Inc., a wholly owned subsidiary
`of Novell, Inc., and the University of California, respectively. Third-party font software in
`this product is protected by copyright and licensed from Sun’s font suppliers.
`RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the United States
`government is subject to restrictions as set forth in DFARS 252227-7013 (c)(1)(ii) and FAR
`52.227—19.
`
`The products described in this book may be protected by one or more US. patents, foreign
`patents, or pending applications.
`
`TRADEMARKS— Sun, Sun Microsystems, the Sun 10 o, SunSoft, Solaris, Solaris Sunburst
`Design, OpenWindows, ONC, ONC+, SunOS, Answer ook, Sun FORTRAN, Wabi, ToolTalk,
`NFS, XView, SunView, and The Network is the Computer are trademarks or registered
`trademarks of Sun Microsystems, Inc. UNIX is a re istered trademark in the United States
`and other countries exclus1vel
`licensed through X / pen Company, Ltd. OPEN LOOK® is a
`registered trademark of Nove 1, Inc. Adobe, PostScript, Displa PostScript, and PhotoShop
`are trademarks or registered trademarks of Adobe Systems ncor orated. PowerPC is a
`trademark of International Business Machines Corporation. Xenix,
`icrosoft Windows, and
`Windows NT are trademarks or‘registered trademarks of Microsoft Corporation. All other
`product names mentioned herein are the trademarks of their respective owners.
`SuperSPARC and all SPARC trademarks, including the SCD Compliant Logo, are trademarks
`or
`registered trademarks of SPARC International,
`Inc. SPARCstation, SPARCserver,
`SPARCengine, SPARCworks, SPARCworks iMPact, and SPARCompiler are licensed
`exclusively to Sun Microsystems, Inc. Products bearing SPARC trademarks are based upon
`an architecture developed by Sun Microsystems, Inc.
`Interfaces were developed by Sun
`The OPEN LOOK® and SunTM Graphical User
`Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of
`Xerox in researching and developing the concept of visual or graphical user interfaces for the
`computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User
`Interface, which license also covers Sun’s licensees who implement OPEN LOOK GUIs and
`otherwise comply with Sun’s written license agreements.
`X Window System is a trademark of X Consortium, Inc.
`
`The publisher offers discounts on this book when ordered in bulk quantities. For more
`information, contact: Corporate Sales Department, Prentice Hall PTR, One Lake Street,
`Upper Saddle River, NJ 07458. Phone: 800-382-3419 or 201-236-7156, Fax: 201-236-7141,
`email: corpsales@prenhall.com
`
`Cover designer: M 5* K Design, Palo Alto, California
`Manufacturing manager: Alexis R. Heydt
`Acquisitions editor: Gregory C. Doench
`Cover Design Director: Ierry Votta
`Production Supervisor: Ioanne Anzalone
`10 9 8 7 6 5 4 3 2 1
`
`ISBN 0-13-172389-8
`
`SunSoft Press
`
`A Prentice Hall Title
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`Parallel Computation
`
`16E
`
`
`
`One of the most powerful uses of threads is to speed up the execution of
`computationally intensive programs on shared memory multiprocessors.
`Unfortunately, effectively applying threads in this environment usually requires a
`deep understanding of the structure of the algorithm and how a proposed
`parallelized version of the algorithm is affected by the overheads of threads and
`the shared memory multiprocessor environment. Fortunately, many different
`algorithms have similar structures. This chapter covers in more detail a number
`of the thread paradigms introduced in Chapter 7, “Using Threads,” that are
`useful in many parallel computation situations. Detailed templates and examples
`are presented.
`
`Using Threads to Parallelize Algorithms
`Threads running on separate processors in a shared-memory multiprocessor
`allow you to use “parallel processing algorithms” in your program. Unlike the
`other uses of threads described in this book, using threads to implement parallel
`algorithms can be frustrating:
`° There are lots of techniques for “parallelizing” your program. How do you
`choose one that’s not too hard to program and that offers substantial speedups
`compared to uniprocessor execution? Does the performance of the technique
`scale up in proportion to the number of processors you use?
`° The overheads involved in synchronizing threads and sharing data among
`multiple processors may actually reduce the performance of your program.
`How can you anticipate and mitigate these problems?
`° Like many performance improvements, parallelizing increases the complexity
`of your program. How can you be sure it’s still correct?
`These are all tough problems: we do not yet know how to solve an arbitrary
`problem efficiently on a multiprocessor of arbitrary size. This section does not
`offer universal solutions, but tries to demonstrate some simple ways to get
`started. By sticking with some common “paradigms” for parallel algorithms and
`threads, you can avoid a lot of errors and aggravation.
`
`277
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`516
`
`
`Though it may seem simplistic, the most important step in writing a parallel
`program is to think carefully about the global structure of your program and the
`computing structures that threads offer. To speed up the program, we’re looking
`for a way to divide the work into a set of tasks so that:
`
`° The tasks interact little with each other;
`' The data shared by separate tasks is contained in a minimum of simple data
`structures that can be protected by locking mechanisms to prevent
`unsynchronized access;
`' The number of tasks can be varied so as to match the number of processors;
`° All tasks have equal computing requirements, or, instead, are configured in
`such a way that the set of tasks can keep all the processors fairly busy.
`As we’ve seen, Amdahl’s Law (Figure 7—3 on page 103) sets limits on the
`scalability of parallelized computations. Scalability is limited due to three kinds
`of overheads: synchronization overhead, contention, and balance. The
`synchronization operations required for correct multithreaded operation take
`time to execute, even when there is no contention. When two or more threads
`share data or locks, the system slows down due to contention for shared
`resources. And finally, balance refers to the ability of the algorithm to divide work
`evenly among the available processors. In the “serial” sections of your program,
`where only a single processor is used, balance is worst. Poor balance is often the
`result of dividing the work into large, unequal chunks: when the smaller chunks
`finish, they leave processors idle. If there are always at least as many runnable
`threads as available processors, the thread scheduler can keep the processors
`busy. While balance can be improved by dividing work into many small chunks
`that can be easily scheduled onto idle processors, making the grain size of the
`processing chunks small is usually accompanied by an increase in
`synchronization overhead, which hurts scalability.
`
`Thread Paradigms
`There are many ways to make your program parallel. Some techniques are very
`complex, depend on complicated data structures and locking strategies, depend
`on clever non-obvious algorithms, or require compiler support. We present a
`different approach in this section: three simple control structures, or paradigms,
`that can be applied to a wide range of applications.
`
`Each paradigm can be characterized by:
`' How the work is divided among parallel threads, and whether each thread
`executes the same code;
`
`' How the threads are synchronized;
`
`278
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16E
`
`° What data is shared among the threads and how it is protected by locks or
`other means to avoid data races.
`
`The simplest paradigm is the master-slave. The main master thread launches a set
`of slave threads and allocates to each slave a portion of the work to be done; the
`amount of work is known in advance by the master and divided evenly among
`the slaves. The master starts the slaves and waits for all of them to reach a
`synchronization point, or barrier. The master may then compute and release the
`slaves again if needed. This pattern is repeated until the work is done. Example: a
`loop requiring 1,000 iterations is divided among four threads, each of which does
`250 iterations.
`
`Another versatile paradigm is the workpile, in which a set of worker threads
`request chunks of work to do from a “workpile,” usually some form of queue. As
`part of the work, a worker may cause additional entries to be added to the
`workpile. The pattern terminates when the workpile is finally emptied;
`alternatively, a worker thread may choose to terminate the process. Example: a
`search algorithm uses a workpile of promising paths that are extended by worker
`threads.
`
`The final paradigm is the pipeline, in which a task is passed to a succession of
`threads, each of which performs part of the overall work required; the threads are
`much like workers on an assembly line. In most cases, the processing in each
`pipeline stage is different, but there are applications for homogeneous pipelines in
`which the code executed in each stage is the same. Example: a graphics rendering
`pipeline in which stages successively transform, clip, and render geometry to
`form an image.
`
`Each of these three paradigms is explained in more detail in the following
`sections, and each is illustrated with one or more examples.
`
`Master-Slave Paradigm
`The simplest paradigm for parallel algorithms is the l’master-slave” arrangement.
`A single master thread, usually the main thread of your program, partitions work
`evenly among a set of slave threads. The slave threads are launched together, and
`each one is assigned to a separate partition of work. The master thread waits for
`all slave threads to complete their work and then continues.
`
`Parallel Computation
`
`279
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`516
`
`Master _xx~.\\\\xx\\xxmxxx—
`
`Slaves
`
`Time
`———_—-——>
`
`r x x x x mocked
`
`-— runnable — active
`
`Figure 16-1 Master-Slave Thread Execution
`
`This pattern is illustrated by the following template:
`
`Master
`Initialization
`i++)
`for (i = 0;
`i < n;
`pthread_create(...,slave,...);
`/* slaves are now running */
`
`Master proceséing if any
`
`i++)
`i < n;
`for (i = 0;
`pthread_join(...);
`Finalization if any
`
`Slave
`
`slave(void *params);
`(in
`Slave processing:
`index i
`params) tells each slave what work
`to do. Put results into global
`memory or params
`pthread_exit();
`
`'
`
`This simple kind of master-slave paradigm is used when the amount of work to
`be done is known in advance and when it’s easy to partition the work into n
`roughly equal parts that don’t depend on each other. The most common I
`application is to a loop, where m iterations of the loop are partitioned among the
`11 threads. Usually m is much greater than n, so each thread invocation executes
`quite a few iterations of the loop. A key requirement is that each iteration of the
`loop must be independent of every other iteration of the loop; thus no
`synchronization is required among the slaves. Here are some examples where this
`method applies:
`
`' Multiplying two matrices. Each row in the product matrix can be computed by
`a separate slave. The result for each row depends only on the values in the
`matrices being multiplied, and not on the results of other rows.
`
`280
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16E
`
`° Transform-coding an image, for example in computing the discrete cosine
`transform (DCT) of every 8x8 pixel block in an image that is 640 by 480 pixels.
`There are m = 4,800 blocks to be done, and the blocks are independent of each
`other.
`
`' Computing the cross-correlation of two signals, e.g., two images, by iterating
`over all samples in the signal, computing the mean-square distance between
`the 'sample and its correlate, and summing the distances. This example
`computes a running sum over the iterations of the loop. In the master-slave
`paradigm, each slave will compute a sum for the iterations it is handling, and
`store the sum in the thread’s parameter record. After all slave threads have
`finished, the master thread computes the final sum by summing the entries in
`the parameter records of all threads.
`
`Master-Slave Example: Matrix Multiply
`To illustrate the master-slave paradigm, we present an example of a matrix
`multiply routine that also computes the trace of the resulting matrix. The trace is
`the sum of the diagonal elements. The outermost loop, which iterates over rows
`in the result matrix, is partitioned among the slave threads.
`
`#define MAX_PARALLEL
`#define SCHED_FUDGE
`
`/* maximum number of threads to use */
`20
`1.2 /* Threads/processor ratio (see text) */
`
`#define MREF(mt,
`
`row, col)mt—>m[(row)
`
`* mt—>n_cols + (col)]
`
`typedef struct {
`int n_rows;
`int n_cols;
`float m[1];
`} matrix_t;
`
`/* Matrix of variable size */
`/* # rows in matrix:
`0 <= row < n_rows */
`/* # columns in matrix:
`0 <= col < n_cols */
`/* MREF(m,row,col) elt in row—major order */
`
`/* parameter record for slave procedure */
`typedef struct {
`/* index of slave thread 0 <= i < n */
`int i;
`/* number of slave threads */
`int n;
`matrix_t *a,*b,*c; /* compute c = a*b */
`float trace_ans;
`/* trace contribution for this part of matrix */
`} mm_params;
`
`Parallel Computation
`
`281
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`516
`
`/*
`* This is the "slave procedure“ that each‘slave thread will execute.
`* Each thread is passed a structure of type mm_params;
`these parameter
`* records differ only in the value of index i for each slave.
`*/
`void *
`matrix_multiply_slave(void *param)
`{
`
`(mm_params *)param;
`mm_params *p =
`matrix_t *a = p—>a;
`matrix_t *b = p—>b;
`matrix_t *c = p—>C;
`int row, col,
`j;
`float trace = 0.0;
`
`/* c[row,col]
`
`is what is being computed */
`
`/ 'k
`* Calculate which rows the p—>i th thread will process. Since
`*
`the number of threads may not divide m = c—>n_rows exactly, we
`* adopt the convention that for index 0 <= i < m%n,
`the thread
`* will process floor(m/n)+1 rows, while all others will process
`* floor(m/n)
`rows.
`*/
`int quot = c—>n_rows / p—>n;
`int rem = c—>n_rows % p—>n;
`0);
`int do_rows = quot +
`((p—>i < rem)? 1
`int first_row = quot
`* p—>i +
`((p—>i < rem)? p—>i
`
`rem);
`
`row < first_row + do_rows;
`for (row 2 first_row;
`for (col = 0; col < c—>n_cols; col++)
`{
`/* Compute c[row,col] */
`float sum = 0.0;
`
`row++)
`
`{
`
`j < a—>n_cols;
`for (j = 0;
`sum += MREF(a,
`row,
`j)
`MREF(c,
`row, col)
`= sum;
`if (row == col)
`trace += sum;
`
`j++)
`* MREF(b,
`
`j, col);
`
`}
`
`} /
`
`* Record partial trace answer in parameter record */
`p—>trace_ans = trace;
`return (NULL);
`
`}
`
`282
`
`Programming with'Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16
`
`/* Compute c = a * b, return trace of c */
`float
`
`matrix_multiply_trace(matrix_t *a, matrix_t *b, matrix*t *c)
`(
`
`mm_params params[MAX_PARALLEL];
`pthread_t thread[MAX_PARALLEL];
`int thread_present[MAX_PARALLEL];
`int n_threads = min(MAX_PARALLEL,
`(int)(sysconf(_SC_NPROCESSORS_ONLN)
`int i, res;
`float trace = 0.0;
`
`* SCHED_FUDGE));
`
`{
`
`i++)
`for (i = 0;.i < n_threads;
`thread_present[i] = TRUE;
`params[i].i = i;
`params[i].n = n_threads;
`params[i].a = a;
`params[i].b = b;
`params[i].c = c;
`res = pthread_create(&thread[i],
`NULL, matrix_multiply_slave,
`if (res != 0)
`{
`/* flag no thread created */
`thread_present[i] = FALSE;
`matrix_multiply_slave(¶ms[i]);/* just do it */
`
`(void *)¶ms[i]);
`
`{
`i++)
`i < n_threads;
`for (i = O;
`!= FALSE)
`if (thread_present[i]
`pthread_join(thread[i], NULL);/* wait for thread to exit */
`trace += params[i].trace_ans;
`
`} r
`
`eturn (trace);
`
`} C
`
`ode Example 16~1 Master-Slave Matrix Multiply
`
`The matrix_multiply_trace ( ) routine shown above correctly computes the
`product of the two matrices (provided no arithmetic exceptions occur). Although
`the code does not show the reasoning, it has been constructed carefully to avoid
`data races. Let’s categorize the data that is shared among the threads, and argue
`in each case that there are no data races.
`
`1. Master->slave.
`
`The matrices a and b, and the parameter records, are all written by the
`master thread and read by the slave threads. In the absence of
`synchronization, this would cause a data race. But all writing by the master
`
`Parallel Computation
`
`283
`
`Microsoft Corp_. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16
`
`
`is separated from reading by a slave with a call to pthread_create ( ) , a
`function that ensures all writes have propagated to main memory (see “Data
`Races” on page 30). So all slaves will see the values written by the master.
`
`2. Slave->slave.
`In this example, no slave writes a memory location that another slave reads.
`Each slave is writing a separate region of the result matrix, c, and the partial
`results of the trace computation are stored by each slave in a separate
`memory location. Thus there is no need for locks or any other mechanism
`for synchronizing the slaves to each other; they are independent of one
`another.
`
`3. Slave->master.
`The slaves write the result matrix, c, and the trace_ans entry in the
`parameter record corresponding to each slave. The master refrains from
`reading this data until after executing pthread_j oin ( ) , a function that
`ensures that all writes initiated by the exiting thread have propagated to
`main memory (see “Data Races” on page 30). So the master will correctly see
`the values written by the slaves.
`\
`
`Note one other feature of the program: we have been careful to deal with the case
`when a thread-cannot be created. There’s a simple way to insure a correct result
`when this happens: the master thread calls the slave procedure directly.
`Performance will suffer, but the result will be correct. Note too that the main
`thread is idle while the slaves work; an alternative would be to use the main
`thread to do part of the work as well as the slaves.
`The program is modular about its use of threads: every thread that is created is
`also explicitly destroyed by waiting at pthread_j oin ( ) . The routine is thus
`reentrant, i.e., several separate threads could call matrix_multiply_trace ( ) ,
`and each call would create a set of slave threads that would be correctly managed
`
`by their respective master.
`Note theuse of stack allocation for the slave parameter records used by the slave
`threads. The example is careful to ensure that all of the threads that reference the
`parameter records exit before the procedure that allocated the stack storage
`(matrix_multiply_trace ( )) exits.
`’
`) create? To
`How many slave threads should matrix_multiply_trace (
`exploit parallelism to the maximum extent, we should use at least as many
`threads as there are processors available. But if we use exactly one thread per
`processor a balance problem may reduce performance. To see why, consider what
`happens on a four-processor system running four matrix multiply threads when
`the kernel scheduler decides to run another job on one of the processors, as
`shown in Figure 16-2. One thread now becomes delayed relative to the other
`
`284
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`165
`
`three, causing the entire parallel section, when all slaves are expected to be
`running, to be delayed. During some of this time, several processors may be idle.
`So the ideal balance, in which all processors are busy, is not achieved.
`
`Slaves
`
`Time
`_——__>
`
`1U”-
`
`blocked —— runnable — active
`
`Figure 16-2 Master-Slave with Scheduling Delay
`
`If the computation is broken up into a larger number of smaller parts, the
`scheduler is better able to achieve good balance, as shown in Figure 16-2. Too
`many threads, however, incur overheads in creating and destroying them, in
`memory for stack space, and in other system resources. You may wish to
`experiment with different strategies for choosing the right number of threads;
`your experiments need to consider both the number of processors available and
`the amount of work required for each thread (sometimes called the grain size).
`Allocating a few more threads than the number of processors is a good starting
`point. The thread scheduler will time-slice runnable threads, which helps avoid
`this problem —— in effect, replacing the grain size of the computation with that of
`a scheduling quantum.
`
`How does our algorithm scale? That is, does the program speed up in proportion
`to the number of processors available to share the work? Of course, if the
`matrices to be multiplied are small, the overheads of creating and joining threads
`will exceed the computational work. Indeed, a general-purpose routine would
`test the matrix size and use a conventional serial implementation for small
`matrices. But for large matrices, we should expect the program to scale well, in
`part because there is no slave—to-slave synchronization required and because the
`data sharing is modest.
`
`Consider the effects of caching parts of the result matrix, c, which is shared
`among all the slaves. The program above allocates to each slave responsibility for
`computing a certain number of rows of the result matrix. The matrix is in
`row—major order which would tend to put nearby elements in the same row in the
`same cache line, and therefore keeps the chances of false sharing to a minimum.
`
`Parallel Computation
`
`285
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16
`
`
`Slaves
`
`Time
`—————————->
`
`Java blocked
`
`—-— runnable — active
`
`Figure 16-3 Master-Slave with Scheduling Delay and Extra Threads
`
`Compilers sometime use the master-slave paradigm to automatically parallelize
`loops within a program. Typically a parallelizing compiler will generate code
`such that the main flow of the computation is controlled by the master. When the
`master encounters a parallelizable loop, it divides up the loop’s work and gives it
`to a set of slaves. When the slaves finish, the master continues with the main
`computation. These techniques are illustrated more fully in l’Parallelizing Using
`Compiler Techniques” on page 311.
`
`Barrier Synchronization
`
`The key element of the master-slave paradigm is that every slave thread must
`complete its task before the computation can proceed. Rather than requiring each
`thread to exit in order to synchronize we can allow a pool of threads to remain
`ready to work on subsequent computations, as shown in Figure 16-4. Note that
`each thread must synchronize to start the computation and to end it. When each
`thread resumes execution, it can retain its state and local variables, which is not
`possible when threads exit and are recreated.
`
`A synchronization structure that conveniently implements this pattern is called a
`barrier. A barrier is initialized by specifying how many threads share the barrier;
`in the master-slave paradigm, this number is the number of slaves plus one for
`the master. In the example of Figure 16-4, five threads share the barrier, one
`master and four slaves. The main operation on a barrier is barrier__wait ( )
`
`286
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16E
`
`
`Section
`
`A
`
`B
`
`A
`
`B
`
`A
`
`B
`
`A
`
`Slaves
`
`‘\\\1—.\\\—\\\\—\\\3
`
`L\\\_\\\_\\\\_\1\‘
`
`11k\_\\k\—\\K\_l\\‘
`
`L‘L‘h‘k—‘t‘L‘K—XEX‘C—‘h‘h‘h'
`
`Time
`w
`
`“\v blocked — runnable — active
`
`Figure 16-4 Master—Slave with Slave Pool
`
`which will block until the number of threads specified in the initialization are
`waiting at the barrier. When the last thread arrives at the barrier, all the threads
`are released. The master-slave paradigm then becomes:
`
`Master
`Initialization
`
`Slave
`
`bar = barrier_init(n+l);
`for (i = 0;
`i < n;
`i++)
`pthread_create(...,slave,...);
`slave(void *params);
`/* Slaves are now running */
`while (not done)
`{
`while (not done)
`{
`A: Fill in work for slave to dO‘
`/* A: wait for work */
`barrier_wait(bar);
`barrier_wait(bar);
`/* Slaves are now working */
`B: Slave processing: data in
`B: Master processing if any'
`global memory or params tells
`what
`to do. Put results
`
`/* Wait for slaves to finish */
`barrier_wait(bar);
`A: Process results
`
`}
`
`into global memory or params
`/* Wait till all are done */
`barrier_wait(bar);
`
`Pthread_eXit()i
`
`l f
`
`i++)
`i < n;
`or (i = 0;
`pthread_join(...);
`Finalization if any
`
`This paradigm is structured so that the slaves are in a loop in which two calls to
`barrier_wait ( ) divide the loop into two sections we have labeled A and B.
`During section A, the master is running and sets up work for the slaves to do; this
`section corresponds to the region where the master thread is active in Figure 16-4.
`When the master and all n slaves have entered the barrier by calling
`barrier_wait ( )
`, the barrier releases all the threads and all return from the
`
`Parallel Computation
`
`287
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`516
`
`call. Now both master and slaves are in section B, where the slaves perform work,
`and the master may be idle. (However, this structure makes it easy for the master
`thread to assume a share of the work during section B, in effect participating as
`another slave.) At the end of section B, the master and slaves call
`barrier_wait ( ) again; now all threads are in section A and the process repeats.
`The barriers provide the synchronization necessary to avoid data races between
`the master and the slaves. Data written by the master or slaves during one section
`can be safely read by the master and all slaves in any subsequent section, because
`barrier_wait ( ) not only contains the necessary synchronization function that
`forces data writes to memory, but also sequences all of the threads so that a
`reader of shared data can know that the data was written during a previous
`section. The template shown above uses this synchronization so that the master
`can safely pass work parameters to the slaves and the slaves can safely pass
`results back to the master. For example, the master could specify a different
`matrix multiplication task each time, or could use a more general parameter
`record to specify different slave tasks with each invocation. This design allows a
`pool of slave threads, waiting in the barrier, to be used by the master in a Wide
`variety of ways at different times during the execution of the program.
`An implementation of a symmetric summing barrier is shown below. The design is
`symmetric in that master and slave threads are not distinguished. An asymmetric
`barrier, in which master and slaves are treated differently, is more appropriate in
`the special case where the master and slaves truly alternate processing (i.e., the
`master does no work in section B and the slaves do no work in section A). In this
`case, you will notice that both master and slaves make two immediately adjacent
`calls to barrier_wait ( ) with no intervening computation; these can be
`collapsed into a single call that requires less synchronization overhead.
`
`The summing nature of the barrier allows all the threads to exchange an important
`piece of information as part of the synchronization at the barrier. Each caller of
`barrier_wait ( ) specifies an integer. When barrier_wait ( ) returns, it
`provides a result that is the sum of all the values specified in the calls to
`barrier_wait (
`) by threads entering the barrier. This feature is not used in the
`template shown above, but will be important in a more complex example that
`follows. It’s obvious how to remove the code used for summing if you don’t need
`it.
`
`288
`
`Programming with Threads
`Mlcrosoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`165
`
`typedef struct barrier_struct {
`
`lock;
`
`/*Mutex lock for the entire structure*/
`pthread_mutex_t
`/*Number of threads to wait for at barrier*/
`int n_clients;
`/*Number‘ofthreadshavecallaflbarrier_wait*/
`int n_waiting;
`‘/*Flagtoseparate waiters fromfastworkers*/
`int phase;
`/*Sum of arguments passed to barrier_wait*/
`int sum;
`/*Answer to be returned by barrier_wait*/
`‘
`int result;
`pthread_cond_t wait_cv;
`/*Clientswait on conditionvalzto proceed*/
`} *barrier_t;
`
`/*Createézinitializea.barrierwith the given.number of client threads*/
`barrier_t
`barrier_init(int n_clients)
`{
`
`barrier_t barrier =
`
`(barrier_t) malloc(sizeof (struct barrier_struet));
`if (barrier != NULL)
`{
`barrier—>n_clients = n_clients;
`barrier—>n_waiting = O; barrier—>phase = 0; barrier—>sum = O;
`pthread_mutex_init(&barrier—>lock, NULL);
`pthread_cond_init(&barrier—>wait_cv, NULL);
`
`) r
`
`eturn (barrier);
`
`/* Destroy a barrier */
`void
`
`barrier_destroy(barrier_t barrier)
`{
`
`pthread_mutex_destroy(&barrier—>lock);
`pthread_cond_destroy(&barrier—>wait_cv);
`free(barrier);
`
`Parallel Computation
`
`289
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`516
`
`/* Wait until the required number of threads enter the barrier */
`int
`barrier_wait(barrier_t barrier,
`{
`
`int increment)
`
`int my_phase;
`pthread_mutex_lock(&barrier—>lock);
`my_phase = barrier—>phase;
`barrier—>sum += increment;
`barrier—>n_waiting++;
`if (barrier—>n_waiting == barrier—>n_clients)
`barrier—>result = barrier—>sum;
`barrier—>sum = 0;
`barrier—>n_waiting = O;
`barrier—>phase = l — my_phase;
`pthread_cond_broadcast(&barrier—>wait_cv);
`
`{
`
`} /
`
`* Wait for the end of this synchronization phase */
`while (barrier—>phase == my_phase)
`{
`pthread_cond_wait(&barrier—>wait_cv, &barrier—>lock);
`,
`}
`pthread_mutex_unlock(&barrier—>lock);
`return (barrier—>result);
`
`} C
`
`ode Example 16-2 Symmetric Summing Barrier
`
`There is a subtle point in the design of the barrier in the use of the phase element
`which can have two values, 0 or 1. When the final thread arrives at the barrier it
`flips the value of phase. This tells the other threads waiting at the barrier that the
`current phase of waiting has ended and the next has begun. A fast thread may
`wake up, and hit the next use of the barrier before a slower thread wakes up and
`tests the condition. A slow thread can never see the value of phase flip twice
`while it is waiting since only the final thread arriving at the barrier will change
`the value of phase.
`
`Master-Slave Example: Relaxation
`
`Barrier synchronization can be used to enforce sequencing among slaves to avoid
`data races when one slave must write data that another slave must read. The
`
`matrix multiplication example does not show this form of data dependency: the
`work of each thread is independent of that of the other threads. This section
`presents a more general example that uses the symmetric summing barrier to
`coordinate slave dependencies.
`
`The example is a relaxation algorithm or "Jacobi iteration” of the form often used
`to solve Laplace’s equation to determine, for example, the steady—state
`temperature over a strangely shaped body when the temperature of its
`boundaries is held fixed. The algorithm approaches a solution iteratively; in each
`iteration, the temperature of a point is computed to be the average of the
`
`290
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16E
`
`temperatures of its four neighbors, except that temperatures at the boundary are
`not changed. If Ti(x, y) is the temperature at location (x,y) after the ith iteration of
`the algorithm assuming that space has been quantized into a two-dimensional
`grid, then the algorithm computes the temperature values for the next iteration as
`follows:
`
`T,+1(x,y)= (T,(x —1,y)+ T,-(x +1,y)+ T,(x,y - 1) + T,-(x,y + 1)) /4
`
`The algorithm below uses a matrix t (the matrix__t type introduced in the
`matrix multiplication example) to represent the solution matrix. It identifies
`boundary points using another matrix, boundary. On each iteration, it computes
`a new solution matrix and tests for convergence.
`void
`
`relax(matrix_t *boundary, matrix_t *t)
`{
`
`matrix_t *t_old = t;
`
`matrix_t *t_new = matrix_alloc(t—>n_rows,
`matrix_t *temp;
`int row, col, converged;
`
`t—>n_cols);
`
`do {
`
`converged = TRUE;
`
`row++)
`row < boundarye>n_rows;
`for (row = 0;
`for (col = 0; col < boundary—>n_cols; col++)
`if (MREF(boundary,
`row, col) ==
`)
`{
`/* Not on boundary */
`row—l, col)
`float v =
`(MREF(t_old,
`MREF(t_old,
`row+l, co