`
`Sec ae Pa |
`Devang SA - =
`
`Programming with
`
`,
`
`Microsoft Corp. Exhibit 1057
`
`
`
`Programming with Threads
`
`Steve Kleiman
`Devang Shah
`Bart Smaalders
`
`A Prentice Hall Title =
`
`SunSoft Press
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`© 1996 Sun Microsystems, Inc. — Printed in the United States of America.
`2550 Garcia Avenue, Mountain View,California 94043-1100 U.S.A.
`
`All rights reserved. This book is protected Py copyright and distributed under licenses
`restricting its use, copying, distribution, and decompilation. No pe of this book may be
`reproduced in any form by any means without prior written authorization of Sun andits
`licensors,if any.
`Portions of the products described in this book may be derived from the UNIX® and Berkeley
`4.3 BSD systems, licensed from UNIX System Laboratories, Inc., a wholly owned subsidiary
`of Novell, Inc., and the University of California, respectively. Third-party font software in
`this product is protected by copyright andlicensed from Sun’s font suppliers.
`RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the United States
`governmentis subject to restrictions as set forth in DFARS 252.227-7013 (c)(1)(ii) and FAR
`52.227-19.
`
`The products described in this book may be protected by one or more U.S. patents, foreign
`patents, or pending applications.
`TRADEMARKS— Sun, Sun Microsystems, the Sun logo, SunSoft, Solaris, Solaris Sunburst
`Design, OpenWindows, ONC, ONC+, SunOS, AnswerBook, Sun FORTRAN,Wabi, ToolTalk,
`NFS, XView, SunView, and The Network is the Computer are trademarks or registered
`trademarks of Sun Microsystems, Inc. UNIX is a registered trademark in the United States
`and other countries exclusively licensed through X/Open Company, Ltd. OPEN LOOK® is a
`registered trademark of Novell, Inc. Adobe, PostScript, Display PostScript, and PhotoShop
`are trademarks or registered trademarks of Adobe Systems Incorporated. PowerPC is a
`trademarkof International Business Machines Corporation. Xenix, Microsoft Windows, and
`WindowsNTare trademarks or registered trademarks of Microsoft Corporation. All other
`product names mentionedherein are the trademarks of their respective owners.
`SuperSPARCandall SPARC trademarks, including the SCD Compliant Logo, are trademarks
`or
`registered trademarks of SPARC International,
`Inc. SPARCstation, SPARCserver,
`SPARCengine, SPARCworks, SPARCworks iMPact, and SPARCompiler are licensed
`exclusively to Sun Microsystems, Inc. Products bearing SPARC trademarks are based upon
`an architecture developed by Sun Microsystems,Inc.
`Interfaces were developed by Sun
`The OPEN LOOK® and Sun™ Graphical User
`Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of
`Xerox in researching and developing the conceptofvisual or graphical user interfaces for the
`computer industry. Sun holds a non-exclusivelicense from Xerox to the Xerox Graphical User
`Interface, which license also covers Sun’s licensees who implement OPEN LOOK GUIs and
`otherwise comply with Sun’s written license agreements.
`X WindowSystemis a trademark of X Consortium,Inc.
`The publisher offers discounts on this book when ordered in bulk quantities. For more
`information, contact: Corporate Sales Department, Prentice Hall PTR, One Lake Street,
`Upper Saddle River, NJ 07458. Phone: 800-382-3419 or 201-236-7156, Fax: 201-236-7141,
`email: corpsales@prenhall.com
`Cover designer: M & K Design, Palo Alto, California
`Manufacturing manager: Alexis R. Heydt
`Acquisitions editor: Gregory G. Deench
`Cover Design Director: Jerry Votta
`Production Supervisor: Joanne Anzalone
`10987654321
`
`ISBN 0-13-172389-8
`
`SunSoft Press
`A Prentice Hall Title
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`Parallel Computation
`
`16=
`
`
`
`Oneof the most powerful uses of threads is to speed up the execution of
`computationally intensive programs on shared memory multiprocessors.
`Unfortunately, effectively applying threadsin this environment usually requires a
`deep understanding of the structure of the algorithm and how a proposed
`parallelized version of the algorithm is affected by the overheads of threads and
`the shared memory multiprocessor environment. Fortunately, many different
`algorithms havesimilar structures. This chapter covers in more detail a number
`of the thread paradigmsintroduced in Chapter 7, “Using Threads,” that are
`useful in many parallel computation situations. Detailed templates and examples
`are presented.
`
`Using Threadsto Parallelize Algorithms
`Threads running on separate processors in a shared-memory multiprocessor
`allow you to use “parallel processing algorithms” in your program. Unlike the
`other uses of threads described in this book, using threads to implementparallel
`algorithms can be frustrating:
`® There are lots of techniques for “parallelizing” your program. How do you
`chooseonethat’s not too hard to program andthat offers substantial speedups
`compared to uniprocessor execution? Does the performance of the technique
`scale up in proportion to the numberof processors you use?
`© The overheads involved in synchronizing threads and sharing data among
`multiple processors may actually reduce the performance of your program.
`Howcan you anticipate and mitigate these problems?
`® Like many performance improvements, parallelizing increases the complexity
`of your program. How can you be sureit’sstill correct?
`These are all tough problems: we do not yet know howtosolve an arbitrary
`problem efficiently on a multiprocessor of arbitrary size. This section does not
`offer universal solutions, but tries to demonstrate some simple ways to get
`started. By sticking with some common “paradigms”for parallel algorithms and
`threads, you can avoid a lot of errors and aggravation.
`
`277
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`== 16
`
`
`Though it may seem simplistic, the most important step in writing a parallel
`program is to think carefully about the global structure of your program and the
`computing structures that threads offer. To speed up the program, we’re looking
`for a way to divide the work into a set of tasks so that:
`® Thetasks interact little with each other;
`® The data shared by separate tasks is contained in a minimum of simple data
`structures that can be protected by locking mechanismsto prevent
`unsynchronized access;
`* The numberof tasks can be varied so as to match the numberof processors;
`° All tasks have equal computing requirements, or, instead, are configured in
`such a way that the set of tasks can keepall the processors fairly busy.
`As we've seen, Amdahl’s Law (Figure 7-3 on page 103) sets limits on the
`scalability of parallelized computations. Scalability is limited due to three kinds
`of overheads: synchronization overhead, contention, and balance. The
`synchronization operations required for correct multithreaded operation take
`time to execute, even when there is no contention. When two or more threads
`share dataor locks, the system slows down due to contention for shared
`resources. Andfinally, balancerefers to the ability of the algorithm to divide work
`evenly amongthe available processors. In the “serial” sections of your program,
`where only a single processor is used, balance is worst. Poor balance is often the
`result of dividing the work into large, unequal chunks: when the smaller chunks
`finish, they leave processorsidle. If there are alwaysat least as many runnable
`threads as available processors, the thread scheduler can keep the processors
`busy. While balance can be improved by dividing work into many small chunks
`that can be easily scheduled onto idle processors, making the grain size of the
`processing chunks small is usually accompanied by an increase in
`synchronization overhead, which hurts scalability.
`
`Thread Paradigms
`There are many ways to make your program parallel. Some techniques are very
`complex, depend on complicated data structures and locking strategies, depend
`on clever non-obviousalgorithms, or require compiler support. We present a
`different approachin this section: three simple control structures, or paradigms,
`that can be applied to a wide range ofapplications.
`Each paradigm can be characterized by:
`° Howthe work is divided amongparallel threads, and whether each thread
`executes the same code;
`° Howthe threads are synchronized;
`
`278
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16 =
`
`® Whatdata is shared amongthe threads and howit is protected by locks or
`other means to avoid data races.
`
`The simplest paradigm is the master-slave. The main master thread launchesa set
`of slave threads andallocates to each slave a portion of the work to be done; the
`amount of work is known in advance by the master and divided evenly among
`the slaves. The masterstarts the slaves and waits for all of them to reach a
`synchronization point, or barrier. The master may then compute andrelease the
`slaves again if needed. This pattern is repeated until the work is done. Example: a
`loop requiring 1,000 iterations is divided amongfour threads, each of which does
`250 iterations.
`
`Anotherversatile paradigm is the workpile, in which a set of worker threads
`request chunks of work to do from a “workpile,” usually some form of queue. As
`part of the work, a worker may cause additional entries to be added to the
`workpile. The pattern terminates when the workpile is finally emptied;
`alternatively, a worker thread may chooseto terminate the process. Example: a
`search algorithm uses a workpile of promising paths that are extended by worker
`threads.
`
`The final paradigm is the pipeline, in which a task is passed to a succession of
`threads, each of which performspart of the overall work required; the threads are
`muchlike workers on an assembly line. In most cases, the processing in each
`pipeline stage is different, but there are applications for homogeneouspipelines in
`whichthe code executed in each stage is the same. Example: a graphics rendering
`pipeline in which stages successively transform,clip, and render geometry to
`form an image.
`Each of these three paradigmsis explained in more detail in the following
`sections, and eachis illustrated with one or more examples.
`
`Master-Slave Paradigm
`The simplest paradigm for parallel algorithmsis the “master-slave” arrangement.
`A single master thread, usually the main thread of your program, partitions work
`evenly amonga set of slave threads. The slave threads are launched together, and
`each oneis assigned to a separate partition of work. The master thread waits for
`all slave threads to complete their work and then continues.
`
`Parallel Computation
`
`279
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`== 16
`
`Master BEE
`
`Slaves
`
`Time
`ee
`
`«xx blocked
`
`=aane= runnable
`
`mame active
`
`Figure 16-1 Master-Slave Thread Execution
`
`This pattern is illustrated by the following template:
`
`Masier
`Initialization
`i++)
`for (i = 0;
`i < n;
`pthread_create(...,slave,...);
`/* slaves are now running */
`
`Master processing if any
`
`i++)
`i <n;
`for (i = 0;
`pthread_join(...);
`Finalization if any
`
`Slave
`
`slave(void *params) ;
`(in
`Slave processing:
`index i
`params) tells each slave what work
`to do. Put results into global
`memory or params
`pthread_exit();
`
`This simple kind of master-slave paradigm is used when the amount of work to
`be done is known in advance and whenit’s easy to partition the work into n
`roughly equal parts that don’t depend on each other. The most common .-
`application is to a loop, where m iterations of the loop are partitioned among the
`n threads. Usually m is much greater than n, so each thread invocation executes
`quite a few iterations of the loop. A key requirement is that each iteration of the
`loop must be independentof every other iteration of the loop; thus no
`synchronization is required among the slaves. Here are some examples wherethis
`method applies:
`© Multiplying two matrices. Each row in the product matrix can be computed by
`a separate slave. The result for each row dependsonly on the values in the
`matrices being multiplied, and not on the results of other rows.
`
`280
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16 =
`° Transform-coding an image, for example in computing the discrete cosine
`transform (DCT)of every 8x8 pixel block in an imagethatis 640 by 480 pixels.
`There are m = 4,800 blocks to be done, and the blocks are independent of each
`other.
`* Computing the cross-correlation of twosignals, e.g., two images, by iterating
`overall samplesin the signal, computing the mean-square distance between
`the sample andits correlate, and summingthe distances. This example
`computes a running sum overthe iterations of the loop. In the master-slave
`paradigm,each slave will compute a sum fortheiterationsit is handling, and
`store the sum in the thread’s parameter record. After all slave threads have
`finished, the master thread computesthe final sum by summingthe entries in
`the parameter recordsof all threads.
`
`Master-Slave Example: Matrix Multiply
`Toillustrate the master-slave paradigm, we present an example of a matrix
`multiply routine that also computes thetrace of the resulting matrix. The traceis
`the sum of the diagonal elements. The outermost loop, which iterates over rows
`in the result matrix, is partitioned amongthe slave threads.
`20
`
`#define MAX_PARALLEL
`#define SCHEDFUDGE
`
`/* maximum number of threads to use */
`1.2 /* Threads/processor ratio (see text) */
`
`#define MREF(mt,
`
`row, col)mt->m[ (row)
`
`* mt->n_cols +
`
`(col) ]
`
`typedef struct {
`int n_rows;
`int n_cols;
`float m[1];
`} matrix_t;
`
`typedef struct {
`int i;
`int n;
`matrix_t *a,*b,*c;
`float trace_ans;
`} mm_params;
`
`/*
`/*
`/*
`/*
`
`/*
`/*
`/*
`/*
`/*
`
`Matrix of variable size */
`# rows in matrix:
`0 <= row < n_rows */
`# columns in matrix:
`0 <= col < n_cols */
`MREF (m,row,col) elt in row-major order */
`
`parameter record for slave procedure */
`index of slave thread 0 <= i <n */
`number of slave threads */
`compute c = a*b */
`trace contribution for this part of matrix */
`
`Parallel Computation
`
`281
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`== 16
`
`/*
`* This is the "slave procedure" that each slave thread will execute.
`* Bach thread is passed a structure of type mm_params;
`these parameter
`* records differ only in the value of index i for each slave.
`ay
`void *
`matrix_multiply_slave(void *param)
`{
`
`(mm_params *)param;
`mm_params *p =
`matrix_t *a = p->a;
`matrix_t *b = p->b;
`Matrix_t *c = p->c;
`int row, col,
`j;
`float trace
`0.0;
`
`/* c[row,col]
`
`is what is being computed */
`
`* *
`
`Calculate which rows the p->i th thread will process. Since
`the number of threads may not divide m = c->n_rows exactly, we
`*
`* adopt the convention that for index 0 <= i < mtn,
`the thread
`* will process floor(m/n)+1 rows, while all others will process
`* floor(m/n)
`rows.
`ey
`int quot := c->n_rows / p->n;
`int rem = c->n_rows % p->n;
`: 0);
`int do_rows = quot +
`((p->i < rem)? 1
`int first_row = quot
`* p->i + ((p->i < rem)? p->i
`
`:
`
`rem);
`
`row < first_row + do_rows;
`for (row = first_row;
`for (col = 0; col < c->n_cols; col++)
`{
`/* Compute c[row,col] */
`float sum = 0.0;
`for (j = 0;
`j < a->n_cols;
`sum += MREF(a,
`row,
`j)
`MREF(c,
`row, col)
`= sum;
`if (row == col)
`trace += sum;
`
`j++)
`* MREF(b,
`
`j, col);
`
`rowtt+)
`
`{
`
`}
`
`} /
`
`* Record partial trace answer in parameter record */
`p->trace_ans = trace;
`return (NULL);
`
`282
`
`Programming withThreads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16
`
`/* Compute c = a * b, return trace of c */
`float
`matrix_multiply_trace(matrix_t *a, matrix_t *b, matrix_t *c)
`{
`
`mm_params params [MAX_PARALLEL] ;
`pthread_t thread[MAX_PARALLEL] ;
`int thread_present [MAX_PARALLEL] ;
`int n_threads = min(MAX_PARALLEL,
`(int) (sysconf (_SC_NPROCESSORS_ONLN)
`int i, res;
`float trace = 0.0;
`
`* SCHED_FUDGE) )
`
`;
`
`{
`
`i++)
`for (1 = 0;.i < n_threads;
`thread_present[i] = TRUE;
`params[i].i = i;
`params[i].n = n_threads;
`params[i].a = a;
`params[i].b = b;
`params[i].c = c;
`res = pthread_create(&thread[i],
`NULL, matrix_multiply_slave,
`if (res != 0)
`{
`/* flag no thread created */
`thread_present[i] = FALSE;
`matrix_multiply_slave(¶ms[i]);/* just do it */
`
`(void *)¶ms[i]);
`
`}
`
`{
`i++)
`i < n_threads;
`for (1 = 0;
`!= FALSE)
`if (thread_present[i]
`pthread_join(thread[i], NULL);/* wait for thread to exit */
`trace += params[i].trace_ans;
`
`} r
`
`eturn (trace);
`
`} C
`
`ode Example 16-1 Master-Slave Matrix Multiply
`The matrix_multiply_trace() routine shown above correctly computes the
`product of the two matrices (provided no arithmetic exceptions occur). Although
`the code does not show the reasoning, it has been constructed carefully to avoid
`data races. Let’s categorize the data that is shared among the threads, and argue
`in each case that there are no data races.
`
`1. Master->slave.
`The matrices a and b, and the parameter records,are all written by the
`master thread and read bythe slave threads. In the absence of
`synchronization, this would cause a data race. But all writing by the master
`
`Parallel Computation
`
`283
`
`Microsoft Corp.. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16
`
`
`is separated from reading bya slave with a call to pthread_create (),a
`function that ensures all writes have propagated to main memory(see “Data
`Races” on page 30). So all slaves will see the values written by the master.
`
`2. Slave->slave.
`In this example, no slave writes a memory location that another slave reads.
`Fachslave is writing a separate region of the result matrix, c, and the partial
`results of the trace computation are stored by each slave in a separate
`memory location. Thus there is no need for locks or any other mechanism
`for synchronizingthe slaves to each other; they are independent of one
`another.
`
`3. Slave->master.
`The slaves write the result matrix, c, and the trace_ansentry in the
`parameterrecord correspondingto each slave. The master refrains from
`reading this data until after executing pthread_join(),a function that
`ensures that all writes initiated by the exiting thread have propagated to
`main memory(see “Data Races” on page 30). So the master will correctly see
`the values written by the slaves.
`Note one other feature of the program: wehavebeencareful to deal with the case
`whena thread-cannot be created. There’s a simple way to insure a correct result
`whenthis happens: the master thread calls the slave procedure directly.
`Performance will suffer, but the result will be correct. Note too that the main
`thread is idle while the slaves work; an alternative would be to use the main
`thread to do part of the work as wellas the slaves.
`The program is modular aboutits use of threads: every thread that is created is
`also explicitly destroyed by waiting at pthread_join(). The routine is thus
`reentrant, i.e., several separate threads could call matrix_multiply_trace(),
`and each call would create a set of slave threads that wouldbe correctly managed
`by their respective master.
`Notethe use ofstack allocation for the slave parameterrecords used bythe slave
`threads. The example is careful to ensure that all of the threads that reference the
`parameter records exit before the procedure thatallocated the stack storage
`(matrix_multiply_trace()) exits.
`How manyslave threads should matrix_multiply_trace() create? To
`exploit parallelism to the maximum extent, we should use at least as many
`threads as there are processorsavailable. But if we use exactly one thread per
`processor a balance problem may reduce performance. To see why, consider what
`happens on a four-processor system running four matrix multiply threads when
`the kernel scheduler decides to run another job on one of the processors, as
`shownin Figure 16-2. One thread now becomes delayedrelative to the other
`
`284
`
`Programming with Threads
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16 =
`
`three, causing the entire parallel section, whenall slaves are expected to be
`running, to be delayed. During someofthis time, several processors may beidle.
`So the ideal balance, in which all processors are busy, is not achieved.
`
`Slaves
`
`Time
`-
`
`~*** blocked
`
`=== funnable m= active
`
`Figure 16-2 Master-Slave with Scheduling Delay
`If the computation is broken up into a larger number of smaller parts, the
`scheduler is better able to achieve good balance, as shownin Figure 16-2. Too
`many threads, however, incur overheadsin creating and destroying them, in
`memory for stack space, and in other system resources. You may wish to
`experiment with different strategies for choosing the right number of threads;
`your experiments need to consider both the numberof processors available and
`the amount of work required for each thread (sometimescalled the grain size).
`Allocating a few more threads than the numberof processorsis a goodstarting
`point. The thread schedulerwill time-slice runnable threads, which helps avoid
`this problem — in effect, replacing the grain size of the computation with that of
`a scheduling quantum.
`Howdoesouralgorithm scale? Thatis, does the program speed up in proportion
`to the numberof processors available to share the work? Of course, if the
`matrices to be multiplied are small, the overheadsof creating and joining threads
`will exceed the computational work. Indeed, a general-purpose routine would
`test the matrix size and use a conventionalserial implementation for small
`matrices. But for large matrices, we should expect the program to scale well, in
`part becausethere is no slave-to-slave synchronization required and because the
`data sharing is modest.
`Considerthe effects of caching parts of the result matrix, c, which is shared
`amongall the slaves. The program aboveallocates to each slave responsibility for
`computing a certain numberof rowsof the result matrix. The matrix is in
`row-major order which would tend to put nearby elements in the same row in the
`samecacheline, and therefore keeps the chancesof false sharing to a minimum.
`
`Parallel Computation
`
`285
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`
`
`Slaves
`
`Time
`————_———————ee
`
`~~~ blocked
`
`=== runnable mas active
`
`Figure 16-3 Master-Slave with Scheduling Delay and Extra Threads
`Compilers sometime use the master-slave paradigm to automatically parallelize
`loops within a program. Typically a parallelizing compiler will generate code
`such that the main flow of the computation is controlled by the master. When the
`master encounters a parallelizable loop, it divides up the loop’s work and givesit
`to a set of slaves. When theslaves finish, the master continues with the main
`computation. Thesetechniquesareillustrated more fully in “Parallelizing Using
`Compiler Techniques” on page 311.
`
`Barrier Synchronization
`The key element of the master-slave paradigm is that every slave thread must
`complete its task before the computation can proceed. Rather than requiring each
`thread to exit in order to synchronize wecan allow a pool of threads to remain
`ready to work on subsequent computations, as shown in Figure 16-4. Note that
`each thread must synchronize to start the computation and to end it. When each
`thread resumes execution, it can retain its state and local variables, which is not
`possible when threads exit and are recreated.
`A synchronization structure that conveniently implements this pattern is called a
`barrier. A barrier is initialized by specifying how many threads share the barrier;
`in the master-slave paradigm, this number is the numberof slaves plus one for
`the master. In the example of Figure 16-4, five threads share the barrier, one
`master and four slaves. The main operation on a barrier is barrier_wait ()
`
`286
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16 =
`Section
`A
`B
`A
`B
`A
`B
`A
`
`SS 7 A, 7 7),SE 3, 4, 5, >
`
`Slaves
`TS1,1ee 5 4 4:
`
`SSa Ff TS, + 7, +,ee 4 4°
`
`SSa, 7,ET 7, 7 7, 7c, 3, 5 4
`
`Time
`aS
`
`mam active
`
`~*~" blocked =e runnable
`Figure 16-4 Master-Slave with Slave Pool
`which will block until the number of threads specified in the initialization are
`waiting at the barrier. Whenthelast thread arrives at the barrier, all the threads
`are released. The master-slave paradigm then becomes:
`Master
`Slave
`Initialization
`bar = barrier_init(n+1);
`for (i = 0;
`i <n;
`i++)
`pthread_create(...,slave,...);
`slave(void *params) ;
`/* Slaves are now running */
`while (not done)
`{
`while (not done)
`{
`A: Fill in work for slave to do-
`/* A: wait for work */
`barrier_wait (bar);
`barrier_wait (bar) ;
`/* Slaves are now working */
`B: Slave processing: data in
`B: Master processing if any
`global memary or params tells
`what
`to do. Put results
`into global memory or params
`/* Wait till all are done */
`barrier_wait (bar);
`
`}
`
`pthread_exit();
`
`/* Wait for slaves to finish */
`barrier_wait (bar) ;
`A; Process results
`
`} f
`
`i++)
`i < n;
`or (i = 0;
`pthread_join(...);
`Finalization if any
`
`This paradigm is structured so that the slaves are in a loop in which twocalls to
`barrier_wait() divide the loop into two sections we have labeled A andB.
`During section A, the master is running and sets up work forthe slaves to do;this
`section correspondsto the region where the masterthreadis active in Figure 16-4.
`When the master andall n slaves have entered the barrier by calling
`barrier_wait(), the barrier releasesall the threads and all return from the
`
`Parallel Computation
`
`287
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`== 16
`
`call. Now both master andslavesare in section B, where the slaves perform work,
`and the master may be idle. (However, this structure makesit easy for the master
`thread to assumea share of the work during section B, in effect participating as
`another slave.) At the end of section B, the master and slavescall
`barrier_wait () again; nowall threads are in section A andthe process repeats.
`The barriers provide the synchronization necessary to avoid data races between
`the master andthe slaves. Data written by the masteror slaves during onesection
`can be safely read by the master andall slaves in any subsequentsection, because
`barrier_wait() not only contains the necessary synchronization function that
`forces data writes to memory, but also sequencesall of the threads so that a
`reader of shared data can know that the data was written during a previous
`section. The template shown aboveuses this synchronization so that the master
`can safely pass work parametersto the slaves and the slaves can safely pass
`results back to the master. For example, the master could specify a different
`matrix multiplication task each time, or could use a more general parameter
`record to specify different slave tasks with each invocation. This design allows a
`pool of slave threads, waiting in the barrier, to be used by the master in a wide
`variety of ways at different times during the execution of the program.
`An implementation of a symmetric summing barrier is shown below. Thedesign is
`symmetric in that master and slave threadsare not distinguished. An asymmetric
`barrier, in which master andslavesare treated differently, is more appropriate in
`the special case where the master and slaves truly alternate processing(i.e., the
`master does no workin section B and the slaves do no workin section A). In this
`case, you will notice that both master and slaves make two immediately adjacent
`calls to barrier_wait() with no intervening computation; these can be
`collapsed into a single call that requires less synchronization overhead.
`The summing nature of the barrier allowsall the threads to exchange an important
`piece of information as part of the synchronization at the barrier. Each caller of
`barrier_wait() specifies an integer. When barrier_wait () returns, it
`providesa result that is the sum ofall the values specified in the calls to
`barrier_wait() by threads entering the barrier. This feature is not used in the
`template shownabove,butwill be important in a more complex example that
`follows. It’s obvious how to removethe code used for summingif you don’t need
`it.
`
`288
`
`Programming with Threads
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16 =
`
`typedef struct barrier_struct {
`pthread_mutex_t
`lock;
`/*Mutex lock for the entire structure */
`int
`n_clients;
`/*Number of threads to wait for at barrier */
`int
`n_waiting;
`/*Number of threads have called barrier_wait */
`int
`phase;
`_/* Flag to separate waiters from fast workers */
`int
`sum;
`/*Sum of arguments passed to barrier_wait */
`int
`result;
`/*Answer to be returned by barrier_wait */
`pthread_cond_t wait_ev;
`/*Clientswait on conditionvar.to proceed*/
`} *barrier_t;
`
`/* Create & initializea barrierwith the given number of client threads */
`barrier_t
`barrier_init(int n_clients)
`{
`
`barrier_t barrier =
`(barrier_t) malloc(sizeof (struct barrier_struct));
`if (barrier != NULL)
`{
`barrier->n_clients = n_clients;
`barrier->n_waiting = 0; barrier->phase = 0; barrier->sum = 0;
`pthread_mutex_init (&barrier->lock, NULL);
`pthread_cond_init (&barrier->wait_cv, NULL) ;
`
`}
`return (barrier);
`
`/* Destroy a barrier */
`void
`
`barrier_destroy(barrier_t barrier)
`{
`
`pthread_mutex_destroy(&barrier->lock) ;
`pthread_cond_destroy (&barrier->wait_cv);
`free (barrier);
`
`Parallel Computation
`
`289
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`== 16
`
`
`/* Wait until the required number of threads enter the barrier */
`int
`barrier_wait(barrier_t barrier,
`{
`
`int increment)
`
`int my_phase;
`pthread_mutex_lock (&barrier->lock) ;
`my_phase = barrier->phase;
`barrier->sum += increment;
`barrier->n_waiting++;
`if (barrier->n_waiting == barrier->n_clients)
`barrier->result = barrier->sum;
`barrier->sum = 0;
`barrier->n_waiting = 0;
`barrier->phase = 1 - my_phase;
`pthread_cond_broadcast (&barrier->wait_cv) ;
`
`{
`
`} /
`
`* Wait for the end of this synchronization phase */
`while (barrier->phase == my_phase)
`{
`pthread_cond_wait (&barrier->wait_cv, &barrier->lock) ;
`
`}p
`
`thread_mutex_unlock (&barrier->lock) ;
`return (barrier->result) ;
`
`} C
`
`ode Example 16-2 Symmetric Summing Barrier
`Thereis a subtle point in the design of the barrier in the use of the phase element
`which can have two values, 0 or 1. When the final thread arrives at the barrier it
`flips the value of phase. Thistells the other threads waiting at the barrier that the
`current phase of waiting has ended and the next has begun.A fast thread may
`wake up,and hit the next use of the barrier before a slower thread wakes up and
`tests the condition. A slow thread can never see the value of phase flip twice
`while it is waiting since only the final thread arriving at the barrier will change
`the value of phase.
`
`Master-Slave Example: Relaxation
`Barrier synchronization can be used to enforce sequencing among slaves to avoid
`data races when one slave must write data that another slave must read. The
`matrix multiplication example does not show this form of data dependency: the
`work of each thread is independentof that of the other threads. This section
`presents a more general example that uses the symmetric summingbarrier to
`coordinate slave dependencies.
`The exampleis a relaxation algorithm or “Jacobi iteration” of the form often used
`to solve Laplace’s equation to determine, for example, the steady-state
`temperature over a strangely shaped body whenthe temperatureofits
`boundaries is held fixed. The algorithm approachesa solution iteratively; in each
`iteration, the temperature of a point is computed to be the average of the
`
`290
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`
`
`16 =
`temperatures ofits four neighbors, except that temperatures at the boundary are
`not changed.If Tx, y) is the temperature at location (x,y) after the i‘ iteration of
`the algorithm assuming that space has been quantized into a two-dimensional
`grid, then the algorithm computes the temperature valuesfor the next iteration as
`follows:
`7, Qoy= (Tx - Ly) + T(x + Ly) + T,(xy - 1) + Ty + 1) /4
`The algorithm below uses a matrix t (the matrix_t type introduced in the
`matrix multiplication example) to represent the solution matrix. It identifies
`boundary points using another matrix, boundary. On each iteration, it computes
`a new solution matrix and tests for convergence.
`void
`relax(matrix_t *boundary, matrix_t *t)
`{
`
`Matrix_t *t_old = t;
`matrix_t *t_new = matrix_alloc(t->n_rows,
`matrix_t *temp;
`int row, col, converged;
`
`t->n_cols);
`
`do {
`
`{
`
`+
`
`= v;
`
`converged = TRUE;
`row++)
`for (row = 0;
`row < boundary~>n_rows;
`for (col = 0; col < boundary->n_cols; col++)
`if (MREF (boundary,
`row, col) ==
`0)
`{
`/* Not on boundary */
`row-1, col)
`float v =
`(MREF(t_old,
`MREF(t_old,
`row+1, col)
`+
`MREF(t_old,
`row, col-1)
`+
`/ 4.0;
`MREF(t_old,
`row, col+1))
`if (abs(v - MREF(t_old,
`row, col)) > 0.000001)
`converged = FALSE;
`MREF(t_new,
`row, col)
`} else {
`MREF(t_new,
`
`}
`
`row, col)
`
`= MREF(t_old,
`
`row, col);
`
`}
`if (converged == TRUE && t_new == t) break;
`/* Swap old and n