throbber
Bart Smaalders
`
`Sec ae Pa |
`Devang SA - =
`
`Programming with
`
`,
`
`Microsoft Corp. Exhibit 1057
`
`

`

`Programming with Threads
`
`Steve Kleiman
`Devang Shah
`Bart Smaalders
`
`A Prentice Hall Title =
`
`SunSoft Press
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`© 1996 Sun Microsystems, Inc. — Printed in the United States of America.
`2550 Garcia Avenue, Mountain View,California 94043-1100 U.S.A.
`
`All rights reserved. This book is protected Py copyright and distributed under licenses
`restricting its use, copying, distribution, and decompilation. No pe of this book may be
`reproduced in any form by any means without prior written authorization of Sun andits
`licensors,if any.
`Portions of the products described in this book may be derived from the UNIX® and Berkeley
`4.3 BSD systems, licensed from UNIX System Laboratories, Inc., a wholly owned subsidiary
`of Novell, Inc., and the University of California, respectively. Third-party font software in
`this product is protected by copyright andlicensed from Sun’s font suppliers.
`RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the United States
`governmentis subject to restrictions as set forth in DFARS 252.227-7013 (c)(1)(ii) and FAR
`52.227-19.
`
`The products described in this book may be protected by one or more U.S. patents, foreign
`patents, or pending applications.
`TRADEMARKS— Sun, Sun Microsystems, the Sun logo, SunSoft, Solaris, Solaris Sunburst
`Design, OpenWindows, ONC, ONC+, SunOS, AnswerBook, Sun FORTRAN,Wabi, ToolTalk,
`NFS, XView, SunView, and The Network is the Computer are trademarks or registered
`trademarks of Sun Microsystems, Inc. UNIX is a registered trademark in the United States
`and other countries exclusively licensed through X/Open Company, Ltd. OPEN LOOK® is a
`registered trademark of Novell, Inc. Adobe, PostScript, Display PostScript, and PhotoShop
`are trademarks or registered trademarks of Adobe Systems Incorporated. PowerPC is a
`trademarkof International Business Machines Corporation. Xenix, Microsoft Windows, and
`WindowsNTare trademarks or registered trademarks of Microsoft Corporation. All other
`product names mentionedherein are the trademarks of their respective owners.
`SuperSPARCandall SPARC trademarks, including the SCD Compliant Logo, are trademarks
`or
`registered trademarks of SPARC International,
`Inc. SPARCstation, SPARCserver,
`SPARCengine, SPARCworks, SPARCworks iMPact, and SPARCompiler are licensed
`exclusively to Sun Microsystems, Inc. Products bearing SPARC trademarks are based upon
`an architecture developed by Sun Microsystems,Inc.
`Interfaces were developed by Sun
`The OPEN LOOK® and Sun™ Graphical User
`Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of
`Xerox in researching and developing the conceptofvisual or graphical user interfaces for the
`computer industry. Sun holds a non-exclusivelicense from Xerox to the Xerox Graphical User
`Interface, which license also covers Sun’s licensees who implement OPEN LOOK GUIs and
`otherwise comply with Sun’s written license agreements.
`X WindowSystemis a trademark of X Consortium,Inc.
`The publisher offers discounts on this book when ordered in bulk quantities. For more
`information, contact: Corporate Sales Department, Prentice Hall PTR, One Lake Street,
`Upper Saddle River, NJ 07458. Phone: 800-382-3419 or 201-236-7156, Fax: 201-236-7141,
`email: corpsales@prenhall.com
`Cover designer: M & K Design, Palo Alto, California
`Manufacturing manager: Alexis R. Heydt
`Acquisitions editor: Gregory G. Deench
`Cover Design Director: Jerry Votta
`Production Supervisor: Joanne Anzalone
`10987654321
`
`ISBN 0-13-172389-8
`
`SunSoft Press
`A Prentice Hall Title
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`Parallel Computation
`
`16=
`
`
`
`Oneof the most powerful uses of threads is to speed up the execution of
`computationally intensive programs on shared memory multiprocessors.
`Unfortunately, effectively applying threadsin this environment usually requires a
`deep understanding of the structure of the algorithm and how a proposed
`parallelized version of the algorithm is affected by the overheads of threads and
`the shared memory multiprocessor environment. Fortunately, many different
`algorithms havesimilar structures. This chapter covers in more detail a number
`of the thread paradigmsintroduced in Chapter 7, “Using Threads,” that are
`useful in many parallel computation situations. Detailed templates and examples
`are presented.
`
`Using Threadsto Parallelize Algorithms
`Threads running on separate processors in a shared-memory multiprocessor
`allow you to use “parallel processing algorithms” in your program. Unlike the
`other uses of threads described in this book, using threads to implementparallel
`algorithms can be frustrating:
`® There are lots of techniques for “parallelizing” your program. How do you
`chooseonethat’s not too hard to program andthat offers substantial speedups
`compared to uniprocessor execution? Does the performance of the technique
`scale up in proportion to the numberof processors you use?
`© The overheads involved in synchronizing threads and sharing data among
`multiple processors may actually reduce the performance of your program.
`Howcan you anticipate and mitigate these problems?
`® Like many performance improvements, parallelizing increases the complexity
`of your program. How can you be sureit’sstill correct?
`These are all tough problems: we do not yet know howtosolve an arbitrary
`problem efficiently on a multiprocessor of arbitrary size. This section does not
`offer universal solutions, but tries to demonstrate some simple ways to get
`started. By sticking with some common “paradigms”for parallel algorithms and
`threads, you can avoid a lot of errors and aggravation.
`
`277
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`== 16
`
`
`Though it may seem simplistic, the most important step in writing a parallel
`program is to think carefully about the global structure of your program and the
`computing structures that threads offer. To speed up the program, we’re looking
`for a way to divide the work into a set of tasks so that:
`® Thetasks interact little with each other;
`® The data shared by separate tasks is contained in a minimum of simple data
`structures that can be protected by locking mechanismsto prevent
`unsynchronized access;
`* The numberof tasks can be varied so as to match the numberof processors;
`° All tasks have equal computing requirements, or, instead, are configured in
`such a way that the set of tasks can keepall the processors fairly busy.
`As we've seen, Amdahl’s Law (Figure 7-3 on page 103) sets limits on the
`scalability of parallelized computations. Scalability is limited due to three kinds
`of overheads: synchronization overhead, contention, and balance. The
`synchronization operations required for correct multithreaded operation take
`time to execute, even when there is no contention. When two or more threads
`share dataor locks, the system slows down due to contention for shared
`resources. Andfinally, balancerefers to the ability of the algorithm to divide work
`evenly amongthe available processors. In the “serial” sections of your program,
`where only a single processor is used, balance is worst. Poor balance is often the
`result of dividing the work into large, unequal chunks: when the smaller chunks
`finish, they leave processorsidle. If there are alwaysat least as many runnable
`threads as available processors, the thread scheduler can keep the processors
`busy. While balance can be improved by dividing work into many small chunks
`that can be easily scheduled onto idle processors, making the grain size of the
`processing chunks small is usually accompanied by an increase in
`synchronization overhead, which hurts scalability.
`
`Thread Paradigms
`There are many ways to make your program parallel. Some techniques are very
`complex, depend on complicated data structures and locking strategies, depend
`on clever non-obviousalgorithms, or require compiler support. We present a
`different approachin this section: three simple control structures, or paradigms,
`that can be applied to a wide range ofapplications.
`Each paradigm can be characterized by:
`° Howthe work is divided amongparallel threads, and whether each thread
`executes the same code;
`° Howthe threads are synchronized;
`
`278
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16 =
`
`® Whatdata is shared amongthe threads and howit is protected by locks or
`other means to avoid data races.
`
`The simplest paradigm is the master-slave. The main master thread launchesa set
`of slave threads andallocates to each slave a portion of the work to be done; the
`amount of work is known in advance by the master and divided evenly among
`the slaves. The masterstarts the slaves and waits for all of them to reach a
`synchronization point, or barrier. The master may then compute andrelease the
`slaves again if needed. This pattern is repeated until the work is done. Example: a
`loop requiring 1,000 iterations is divided amongfour threads, each of which does
`250 iterations.
`
`Anotherversatile paradigm is the workpile, in which a set of worker threads
`request chunks of work to do from a “workpile,” usually some form of queue. As
`part of the work, a worker may cause additional entries to be added to the
`workpile. The pattern terminates when the workpile is finally emptied;
`alternatively, a worker thread may chooseto terminate the process. Example: a
`search algorithm uses a workpile of promising paths that are extended by worker
`threads.
`
`The final paradigm is the pipeline, in which a task is passed to a succession of
`threads, each of which performspart of the overall work required; the threads are
`muchlike workers on an assembly line. In most cases, the processing in each
`pipeline stage is different, but there are applications for homogeneouspipelines in
`whichthe code executed in each stage is the same. Example: a graphics rendering
`pipeline in which stages successively transform,clip, and render geometry to
`form an image.
`Each of these three paradigmsis explained in more detail in the following
`sections, and eachis illustrated with one or more examples.
`
`Master-Slave Paradigm
`The simplest paradigm for parallel algorithmsis the “master-slave” arrangement.
`A single master thread, usually the main thread of your program, partitions work
`evenly amonga set of slave threads. The slave threads are launched together, and
`each oneis assigned to a separate partition of work. The master thread waits for
`all slave threads to complete their work and then continues.
`
`Parallel Computation
`
`279
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`== 16
`
`Master BEE
`
`Slaves
`
`Time
`ee
`
`«xx blocked
`
`=aane= runnable
`
`mame active
`
`Figure 16-1 Master-Slave Thread Execution
`
`This pattern is illustrated by the following template:
`
`Masier
`Initialization
`i++)
`for (i = 0;
`i < n;
`pthread_create(...,slave,...);
`/* slaves are now running */
`
`Master processing if any
`
`i++)
`i <n;
`for (i = 0;
`pthread_join(...);
`Finalization if any
`
`Slave
`
`slave(void *params) ;
`(in
`Slave processing:
`index i
`params) tells each slave what work
`to do. Put results into global
`memory or params
`pthread_exit();
`
`This simple kind of master-slave paradigm is used when the amount of work to
`be done is known in advance and whenit’s easy to partition the work into n
`roughly equal parts that don’t depend on each other. The most common .-
`application is to a loop, where m iterations of the loop are partitioned among the
`n threads. Usually m is much greater than n, so each thread invocation executes
`quite a few iterations of the loop. A key requirement is that each iteration of the
`loop must be independentof every other iteration of the loop; thus no
`synchronization is required among the slaves. Here are some examples wherethis
`method applies:
`© Multiplying two matrices. Each row in the product matrix can be computed by
`a separate slave. The result for each row dependsonly on the values in the
`matrices being multiplied, and not on the results of other rows.
`
`280
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16 =
`° Transform-coding an image, for example in computing the discrete cosine
`transform (DCT)of every 8x8 pixel block in an imagethatis 640 by 480 pixels.
`There are m = 4,800 blocks to be done, and the blocks are independent of each
`other.
`* Computing the cross-correlation of twosignals, e.g., two images, by iterating
`overall samplesin the signal, computing the mean-square distance between
`the sample andits correlate, and summingthe distances. This example
`computes a running sum overthe iterations of the loop. In the master-slave
`paradigm,each slave will compute a sum fortheiterationsit is handling, and
`store the sum in the thread’s parameter record. After all slave threads have
`finished, the master thread computesthe final sum by summingthe entries in
`the parameter recordsof all threads.
`
`Master-Slave Example: Matrix Multiply
`Toillustrate the master-slave paradigm, we present an example of a matrix
`multiply routine that also computes thetrace of the resulting matrix. The traceis
`the sum of the diagonal elements. The outermost loop, which iterates over rows
`in the result matrix, is partitioned amongthe slave threads.
`20
`
`#define MAX_PARALLEL
`#define SCHEDFUDGE
`
`/* maximum number of threads to use */
`1.2 /* Threads/processor ratio (see text) */
`
`#define MREF(mt,
`
`row, col)mt->m[ (row)
`
`* mt->n_cols +
`
`(col) ]
`
`typedef struct {
`int n_rows;
`int n_cols;
`float m[1];
`} matrix_t;
`
`typedef struct {
`int i;
`int n;
`matrix_t *a,*b,*c;
`float trace_ans;
`} mm_params;
`
`/*
`/*
`/*
`/*
`
`/*
`/*
`/*
`/*
`/*
`
`Matrix of variable size */
`# rows in matrix:
`0 <= row < n_rows */
`# columns in matrix:
`0 <= col < n_cols */
`MREF (m,row,col) elt in row-major order */
`
`parameter record for slave procedure */
`index of slave thread 0 <= i <n */
`number of slave threads */
`compute c = a*b */
`trace contribution for this part of matrix */
`
`Parallel Computation
`
`281
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`== 16
`
`/*
`* This is the "slave procedure" that each slave thread will execute.
`* Bach thread is passed a structure of type mm_params;
`these parameter
`* records differ only in the value of index i for each slave.
`ay
`void *
`matrix_multiply_slave(void *param)
`{
`
`(mm_params *)param;
`mm_params *p =
`matrix_t *a = p->a;
`matrix_t *b = p->b;
`Matrix_t *c = p->c;
`int row, col,
`j;
`float trace
`0.0;
`
`/* c[row,col]
`
`is what is being computed */
`
`* *
`
`Calculate which rows the p->i th thread will process. Since
`the number of threads may not divide m = c->n_rows exactly, we
`*
`* adopt the convention that for index 0 <= i < mtn,
`the thread
`* will process floor(m/n)+1 rows, while all others will process
`* floor(m/n)
`rows.
`ey
`int quot := c->n_rows / p->n;
`int rem = c->n_rows % p->n;
`: 0);
`int do_rows = quot +
`((p->i < rem)? 1
`int first_row = quot
`* p->i + ((p->i < rem)? p->i
`
`:
`
`rem);
`
`row < first_row + do_rows;
`for (row = first_row;
`for (col = 0; col < c->n_cols; col++)
`{
`/* Compute c[row,col] */
`float sum = 0.0;
`for (j = 0;
`j < a->n_cols;
`sum += MREF(a,
`row,
`j)
`MREF(c,
`row, col)
`= sum;
`if (row == col)
`trace += sum;
`
`j++)
`* MREF(b,
`
`j, col);
`
`rowtt+)
`
`{
`
`}
`
`} /
`
`* Record partial trace answer in parameter record */
`p->trace_ans = trace;
`return (NULL);
`
`282
`
`Programming withThreads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16
`
`/* Compute c = a * b, return trace of c */
`float
`matrix_multiply_trace(matrix_t *a, matrix_t *b, matrix_t *c)
`{
`
`mm_params params [MAX_PARALLEL] ;
`pthread_t thread[MAX_PARALLEL] ;
`int thread_present [MAX_PARALLEL] ;
`int n_threads = min(MAX_PARALLEL,
`(int) (sysconf (_SC_NPROCESSORS_ONLN)
`int i, res;
`float trace = 0.0;
`
`* SCHED_FUDGE) )
`
`;
`
`{
`
`i++)
`for (1 = 0;.i < n_threads;
`thread_present[i] = TRUE;
`params[i].i = i;
`params[i].n = n_threads;
`params[i].a = a;
`params[i].b = b;
`params[i].c = c;
`res = pthread_create(&thread[i],
`NULL, matrix_multiply_slave,
`if (res != 0)
`{
`/* flag no thread created */
`thread_present[i] = FALSE;
`matrix_multiply_slave(&params[i]);/* just do it */
`
`(void *)&params[i]);
`
`}
`
`{
`i++)
`i < n_threads;
`for (1 = 0;
`!= FALSE)
`if (thread_present[i]
`pthread_join(thread[i], NULL);/* wait for thread to exit */
`trace += params[i].trace_ans;
`
`} r
`
`eturn (trace);
`
`} C
`
`ode Example 16-1 Master-Slave Matrix Multiply
`The matrix_multiply_trace() routine shown above correctly computes the
`product of the two matrices (provided no arithmetic exceptions occur). Although
`the code does not show the reasoning, it has been constructed carefully to avoid
`data races. Let’s categorize the data that is shared among the threads, and argue
`in each case that there are no data races.
`
`1. Master->slave.
`The matrices a and b, and the parameter records,are all written by the
`master thread and read bythe slave threads. In the absence of
`synchronization, this would cause a data race. But all writing by the master
`
`Parallel Computation
`
`283
`
`Microsoft Corp.. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16
`
`
`is separated from reading bya slave with a call to pthread_create (),a
`function that ensures all writes have propagated to main memory(see “Data
`Races” on page 30). So all slaves will see the values written by the master.
`
`2. Slave->slave.
`In this example, no slave writes a memory location that another slave reads.
`Fachslave is writing a separate region of the result matrix, c, and the partial
`results of the trace computation are stored by each slave in a separate
`memory location. Thus there is no need for locks or any other mechanism
`for synchronizingthe slaves to each other; they are independent of one
`another.
`
`3. Slave->master.
`The slaves write the result matrix, c, and the trace_ansentry in the
`parameterrecord correspondingto each slave. The master refrains from
`reading this data until after executing pthread_join(),a function that
`ensures that all writes initiated by the exiting thread have propagated to
`main memory(see “Data Races” on page 30). So the master will correctly see
`the values written by the slaves.
`Note one other feature of the program: wehavebeencareful to deal with the case
`whena thread-cannot be created. There’s a simple way to insure a correct result
`whenthis happens: the master thread calls the slave procedure directly.
`Performance will suffer, but the result will be correct. Note too that the main
`thread is idle while the slaves work; an alternative would be to use the main
`thread to do part of the work as wellas the slaves.
`The program is modular aboutits use of threads: every thread that is created is
`also explicitly destroyed by waiting at pthread_join(). The routine is thus
`reentrant, i.e., several separate threads could call matrix_multiply_trace(),
`and each call would create a set of slave threads that wouldbe correctly managed
`by their respective master.
`Notethe use ofstack allocation for the slave parameterrecords used bythe slave
`threads. The example is careful to ensure that all of the threads that reference the
`parameter records exit before the procedure thatallocated the stack storage
`(matrix_multiply_trace()) exits.
`How manyslave threads should matrix_multiply_trace() create? To
`exploit parallelism to the maximum extent, we should use at least as many
`threads as there are processorsavailable. But if we use exactly one thread per
`processor a balance problem may reduce performance. To see why, consider what
`happens on a four-processor system running four matrix multiply threads when
`the kernel scheduler decides to run another job on one of the processors, as
`shownin Figure 16-2. One thread now becomes delayedrelative to the other
`
`284
`
`Programming with Threads
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16 =
`
`three, causing the entire parallel section, whenall slaves are expected to be
`running, to be delayed. During someofthis time, several processors may beidle.
`So the ideal balance, in which all processors are busy, is not achieved.
`
`Slaves
`
`Time
`-
`
`~*** blocked
`
`=== funnable m= active
`
`Figure 16-2 Master-Slave with Scheduling Delay
`If the computation is broken up into a larger number of smaller parts, the
`scheduler is better able to achieve good balance, as shownin Figure 16-2. Too
`many threads, however, incur overheadsin creating and destroying them, in
`memory for stack space, and in other system resources. You may wish to
`experiment with different strategies for choosing the right number of threads;
`your experiments need to consider both the numberof processors available and
`the amount of work required for each thread (sometimescalled the grain size).
`Allocating a few more threads than the numberof processorsis a goodstarting
`point. The thread schedulerwill time-slice runnable threads, which helps avoid
`this problem — in effect, replacing the grain size of the computation with that of
`a scheduling quantum.
`Howdoesouralgorithm scale? Thatis, does the program speed up in proportion
`to the numberof processors available to share the work? Of course, if the
`matrices to be multiplied are small, the overheadsof creating and joining threads
`will exceed the computational work. Indeed, a general-purpose routine would
`test the matrix size and use a conventionalserial implementation for small
`matrices. But for large matrices, we should expect the program to scale well, in
`part becausethere is no slave-to-slave synchronization required and because the
`data sharing is modest.
`Considerthe effects of caching parts of the result matrix, c, which is shared
`amongall the slaves. The program aboveallocates to each slave responsibility for
`computing a certain numberof rowsof the result matrix. The matrix is in
`row-major order which would tend to put nearby elements in the same row in the
`samecacheline, and therefore keeps the chancesof false sharing to a minimum.
`
`Parallel Computation
`
`285
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`
`
`Slaves
`
`Time
`————_———————ee
`
`~~~ blocked
`
`=== runnable mas active
`
`Figure 16-3 Master-Slave with Scheduling Delay and Extra Threads
`Compilers sometime use the master-slave paradigm to automatically parallelize
`loops within a program. Typically a parallelizing compiler will generate code
`such that the main flow of the computation is controlled by the master. When the
`master encounters a parallelizable loop, it divides up the loop’s work and givesit
`to a set of slaves. When theslaves finish, the master continues with the main
`computation. Thesetechniquesareillustrated more fully in “Parallelizing Using
`Compiler Techniques” on page 311.
`
`Barrier Synchronization
`The key element of the master-slave paradigm is that every slave thread must
`complete its task before the computation can proceed. Rather than requiring each
`thread to exit in order to synchronize wecan allow a pool of threads to remain
`ready to work on subsequent computations, as shown in Figure 16-4. Note that
`each thread must synchronize to start the computation and to end it. When each
`thread resumes execution, it can retain its state and local variables, which is not
`possible when threads exit and are recreated.
`A synchronization structure that conveniently implements this pattern is called a
`barrier. A barrier is initialized by specifying how many threads share the barrier;
`in the master-slave paradigm, this number is the numberof slaves plus one for
`the master. In the example of Figure 16-4, five threads share the barrier, one
`master and four slaves. The main operation on a barrier is barrier_wait ()
`
`286
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16 =
`Section
`A
`B
`A
`B
`A
`B
`A
`
`SS 7 A, 7 7),SE 3, 4, 5, >
`
`Slaves
`TS1,1ee 5 4 4:
`
`SSa Ff TS, + 7, +,ee 4 4°
`
`SSa, 7,ET 7, 7 7, 7c, 3, 5 4
`
`Time
`aS
`
`mam active
`
`~*~" blocked =e runnable
`Figure 16-4 Master-Slave with Slave Pool
`which will block until the number of threads specified in the initialization are
`waiting at the barrier. Whenthelast thread arrives at the barrier, all the threads
`are released. The master-slave paradigm then becomes:
`Master
`Slave
`Initialization
`bar = barrier_init(n+1);
`for (i = 0;
`i <n;
`i++)
`pthread_create(...,slave,...);
`slave(void *params) ;
`/* Slaves are now running */
`while (not done)
`{
`while (not done)
`{
`A: Fill in work for slave to do-
`/* A: wait for work */
`barrier_wait (bar);
`barrier_wait (bar) ;
`/* Slaves are now working */
`B: Slave processing: data in
`B: Master processing if any
`global memary or params tells
`what
`to do. Put results
`into global memory or params
`/* Wait till all are done */
`barrier_wait (bar);
`
`}
`
`pthread_exit();
`
`/* Wait for slaves to finish */
`barrier_wait (bar) ;
`A; Process results
`
`} f
`
`i++)
`i < n;
`or (i = 0;
`pthread_join(...);
`Finalization if any
`
`This paradigm is structured so that the slaves are in a loop in which twocalls to
`barrier_wait() divide the loop into two sections we have labeled A andB.
`During section A, the master is running and sets up work forthe slaves to do;this
`section correspondsto the region where the masterthreadis active in Figure 16-4.
`When the master andall n slaves have entered the barrier by calling
`barrier_wait(), the barrier releasesall the threads and all return from the
`
`Parallel Computation
`
`287
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`== 16
`
`call. Now both master andslavesare in section B, where the slaves perform work,
`and the master may be idle. (However, this structure makesit easy for the master
`thread to assumea share of the work during section B, in effect participating as
`another slave.) At the end of section B, the master and slavescall
`barrier_wait () again; nowall threads are in section A andthe process repeats.
`The barriers provide the synchronization necessary to avoid data races between
`the master andthe slaves. Data written by the masteror slaves during onesection
`can be safely read by the master andall slaves in any subsequentsection, because
`barrier_wait() not only contains the necessary synchronization function that
`forces data writes to memory, but also sequencesall of the threads so that a
`reader of shared data can know that the data was written during a previous
`section. The template shown aboveuses this synchronization so that the master
`can safely pass work parametersto the slaves and the slaves can safely pass
`results back to the master. For example, the master could specify a different
`matrix multiplication task each time, or could use a more general parameter
`record to specify different slave tasks with each invocation. This design allows a
`pool of slave threads, waiting in the barrier, to be used by the master in a wide
`variety of ways at different times during the execution of the program.
`An implementation of a symmetric summing barrier is shown below. Thedesign is
`symmetric in that master and slave threadsare not distinguished. An asymmetric
`barrier, in which master andslavesare treated differently, is more appropriate in
`the special case where the master and slaves truly alternate processing(i.e., the
`master does no workin section B and the slaves do no workin section A). In this
`case, you will notice that both master and slaves make two immediately adjacent
`calls to barrier_wait() with no intervening computation; these can be
`collapsed into a single call that requires less synchronization overhead.
`The summing nature of the barrier allowsall the threads to exchange an important
`piece of information as part of the synchronization at the barrier. Each caller of
`barrier_wait() specifies an integer. When barrier_wait () returns, it
`providesa result that is the sum ofall the values specified in the calls to
`barrier_wait() by threads entering the barrier. This feature is not used in the
`template shownabove,butwill be important in a more complex example that
`follows. It’s obvious how to removethe code used for summingif you don’t need
`it.
`
`288
`
`Programming with Threads
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16 =
`
`typedef struct barrier_struct {
`pthread_mutex_t
`lock;
`/*Mutex lock for the entire structure */
`int
`n_clients;
`/*Number of threads to wait for at barrier */
`int
`n_waiting;
`/*Number of threads have called barrier_wait */
`int
`phase;
`_/* Flag to separate waiters from fast workers */
`int
`sum;
`/*Sum of arguments passed to barrier_wait */
`int
`result;
`/*Answer to be returned by barrier_wait */
`pthread_cond_t wait_ev;
`/*Clientswait on conditionvar.to proceed*/
`} *barrier_t;
`
`/* Create & initializea barrierwith the given number of client threads */
`barrier_t
`barrier_init(int n_clients)
`{
`
`barrier_t barrier =
`(barrier_t) malloc(sizeof (struct barrier_struct));
`if (barrier != NULL)
`{
`barrier->n_clients = n_clients;
`barrier->n_waiting = 0; barrier->phase = 0; barrier->sum = 0;
`pthread_mutex_init (&barrier->lock, NULL);
`pthread_cond_init (&barrier->wait_cv, NULL) ;
`
`}
`return (barrier);
`
`/* Destroy a barrier */
`void
`
`barrier_destroy(barrier_t barrier)
`{
`
`pthread_mutex_destroy(&barrier->lock) ;
`pthread_cond_destroy (&barrier->wait_cv);
`free (barrier);
`
`Parallel Computation
`
`289
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`== 16
`
`
`/* Wait until the required number of threads enter the barrier */
`int
`barrier_wait(barrier_t barrier,
`{
`
`int increment)
`
`int my_phase;
`pthread_mutex_lock (&barrier->lock) ;
`my_phase = barrier->phase;
`barrier->sum += increment;
`barrier->n_waiting++;
`if (barrier->n_waiting == barrier->n_clients)
`barrier->result = barrier->sum;
`barrier->sum = 0;
`barrier->n_waiting = 0;
`barrier->phase = 1 - my_phase;
`pthread_cond_broadcast (&barrier->wait_cv) ;
`
`{
`
`} /
`
`* Wait for the end of this synchronization phase */
`while (barrier->phase == my_phase)
`{
`pthread_cond_wait (&barrier->wait_cv, &barrier->lock) ;
`
`}p
`
`thread_mutex_unlock (&barrier->lock) ;
`return (barrier->result) ;
`
`} C
`
`ode Example 16-2 Symmetric Summing Barrier
`Thereis a subtle point in the design of the barrier in the use of the phase element
`which can have two values, 0 or 1. When the final thread arrives at the barrier it
`flips the value of phase. Thistells the other threads waiting at the barrier that the
`current phase of waiting has ended and the next has begun.A fast thread may
`wake up,and hit the next use of the barrier before a slower thread wakes up and
`tests the condition. A slow thread can never see the value of phase flip twice
`while it is waiting since only the final thread arriving at the barrier will change
`the value of phase.
`
`Master-Slave Example: Relaxation
`Barrier synchronization can be used to enforce sequencing among slaves to avoid
`data races when one slave must write data that another slave must read. The
`matrix multiplication example does not show this form of data dependency: the
`work of each thread is independentof that of the other threads. This section
`presents a more general example that uses the symmetric summingbarrier to
`coordinate slave dependencies.
`The exampleis a relaxation algorithm or “Jacobi iteration” of the form often used
`to solve Laplace’s equation to determine, for example, the steady-state
`temperature over a strangely shaped body whenthe temperatureofits
`boundaries is held fixed. The algorithm approachesa solution iteratively; in each
`iteration, the temperature of a point is computed to be the average of the
`
`290
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16 =
`temperatures ofits four neighbors, except that temperatures at the boundary are
`not changed.If Tx, y) is the temperature at location (x,y) after the i‘ iteration of
`the algorithm assuming that space has been quantized into a two-dimensional
`grid, then the algorithm computes the temperature valuesfor the next iteration as
`follows:
`7, Qoy= (Tx - Ly) + T(x + Ly) + T,(xy - 1) + Ty + 1) /4
`The algorithm below uses a matrix t (the matrix_t type introduced in the
`matrix multiplication example) to represent the solution matrix. It identifies
`boundary points using another matrix, boundary. On each iteration, it computes
`a new solution matrix and tests for convergence.
`void
`relax(matrix_t *boundary, matrix_t *t)
`{
`
`Matrix_t *t_old = t;
`matrix_t *t_new = matrix_alloc(t->n_rows,
`matrix_t *temp;
`int row, col, converged;
`
`t->n_cols);
`
`do {
`
`{
`
`+
`
`= v;
`
`converged = TRUE;
`row++)
`for (row = 0;
`row < boundary~>n_rows;
`for (col = 0; col < boundary->n_cols; col++)
`if (MREF (boundary,
`row, col) ==
`0)
`{
`/* Not on boundary */
`row-1, col)
`float v =
`(MREF(t_old,
`MREF(t_old,
`row+1, col)
`+
`MREF(t_old,
`row, col-1)
`+
`/ 4.0;
`MREF(t_old,
`row, col+1))
`if (abs(v - MREF(t_old,
`row, col)) > 0.000001)
`converged = FALSE;
`MREF(t_new,
`row, col)
`} else {
`MREF(t_new,
`
`}
`
`row, col)
`
`= MREF(t_old,
`
`row, col);
`
`}
`if (converged == TRUE && t_new == t) break;
`/* Swap old and n

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket