`
`8.8 NONBLOCKING POINT-TO-POINT COMMUNICATION
`
`279
`
`8.8 NONBLOCKING POINT-TO-POINT COMMUNICATION
`The point-to-point communication calls introduced in Sections 8.5.1 and 8.5.3, MPI_Send and
`MPI_Recv, do not return from the respective function call until the send and receive operations have
`completed. While this ensures that the send and receive buffers used in the MPI_Send and MPI_Recv
`arguments are safe to use or reuse after the function call, it also means that unless there is a simul(cid:173)
`taneously matching send for each receive, the code will deadlock, resulting in the code hanging. This
`common type of bug is examined in Chapter 14. One way to avoid this is by using nonblocking point(cid:173)
`to-point communication.
`Nonblocking point-to-point communication returns immediately from the function call before
`confirming that the send or the receive has completed. These nonblocking calls are MPI_Isend and
`MPI_Irecv. They are used coupled with MPI_ Wait, which will wait until the operation is completed.
`When querying whether a nonblocking point-to-point communication has completed, MPI_Test is
`often paired with MPl_lsend and MPI_Irecv. Nonblocking point-to-point calls can simplify code
`development to avoid such deadlocks more easily and also potentially enable the overlap of useful
`computation while checking to see if the communication has completed.
`The syntax of each of these calls is the same as for the blocking calls except for the addition of a
`request argument and the elimination of the status output in the MPI_Recv arguments.
`
`int MPI_I1end (void •message, i~t count, MPI_Datatype datatype. int desti int tag,
`MPI_Comm comm. MPI_Request •send_request)
`
`int MPI_Irecv (void *message, tnt count, MPI_Datatype datatype, tnt Source~ int tag,
`MPI_Comm comm, MPI_Request •recv_requestl
`
`Because both MPl_lsend and MPI_Irecv return immediately after calling without confirming that
`the message-passing operations have completed, the application user needs a way to specify when
`these operations must complete. This is done with MPI_ Wait:
`
`int MPI_Wait(MPI_Request •request, MPI_Status •status)
`
`When MPI_ Wait is called, the nonblocking request originating from MPl_lsend or MPI_Irecv is
`provided as an argument. The status that was previously provided directly from MPI_Recv is now
`supplied as an output from MPI_ Wait.
`Similar to MPI_ Wait, MPI_Test can be paired with an MPI_lsend or MPI_Irecv call to query
`whether the message passing has completed while performing other work. MPI_ Test shares similar
`syntax to MPI_ Wait, adding only a flag that is set to true if the request being queried has completed.
`
`int MPCTest(MPI.:..Request Hequest, int •flag, MPI_Status •status l
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 301
`
`
`
`280 CHAPTER 8 THE ESSENTIAL MPI
`
`An example of using nonblocking communication is presented in Code 8.12. In this example, the
`send commands are issued first, followed by the receive commands. If using blocking communication
`and sending a sufficiently large message this would normally result in a deadlock, but nonblocking
`communication avoids this pitfall.
`
`1 #include <stdlib.h>
`2 #include <stdio.h>
`3 #include <mpi .h>
`4
`
`,,,.I
`
`}else{
`
`a=667;
`
`int main(int argc, char* argv[J) {
`int a, b;
`int size. rank;
`int tag= O; // Pick a tag arbitrarily
`8
`9 MPI_Status status;
`10 MPI_Request send_request, recv_request;
`11
`12 MPI_Init(&argc, &argv);
`13 MPI_Comm_size(MPI_COMM_WORLD, &size);
`14 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
`15
`16 if (size !=2) {
`17
`printf("Example is designed for 2 processes\n");
`18 MPI_Finalize();
`19
`exit(O);
`20
`21 if (rank=O) {
`22
`a=314159; //Value picked arbitrarily
`23
`MP I_! send (&a, 1, MP!_! NT, 1, tag, MP l_COMM_WORLD, &send_reques t);
`24
`25 MPI_Irecv (&b, 1, MPI_INT, 1, tag, MPI_COMM_WORLD, &recv_request);
`26
`27 MPI_Wait(&send_request, &status);
`28 MPI_Wait(&recv_request, &status);
`29
`printf ("Process %d received value %d\n", rank, b);
`30
`31
`32
`33
`34
`35
`36
`37
`38 MPI_WaitC&send_request, &status);
`39 MPI_Wait(&recv_request, &status):
`40
`printf("Process%dreceivedvalue%d\n", rank, b);
`41
`42
`
`MPI_Isend (&a, 1, MPI_INT, 0, tag, MP!_COMM_WORLD, &send_request);
`MPI_Irecv (&b, 1, MP!_!NT, 0, tag, MPI_COMM_WORLD. &recv_request);
`
`l
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 302
`
`
`
`l,-,
`
`1~
`
`8•9 USER-DEFINED DATA TYPES
`
`281
`
`43 MPI_Finalize(J:
`return o:
`44
`45 I
`
`·
`·
`Code 8.12. Example of nonblocking point-to-point co
`3 I 4 I 59 to process I while process I sends the inte er 6~~umcat1on. Process O . sends the integer
`o process 0. The particular order of the
`oes not matter because the calls are nonblocking.
`
`listing oflsend and Irecv in lines 24-25 and 35-36 J
`
`> mpirun -np 2 ./code12
`Process O rec~ived value .667
`Pr'ocess 1 received value 314159
`
`8.9 USER-DEFINED DATA TYPES
`Application develop~rs wil_l frequently wish_ to_ create a user-defined data type built out of the pre(cid:173)
`defined MPI types hsted m Table 8.1. This 1s accomplished using MPI_Type_create_struct and
`MPI_Type_commit.
`
`MPI_Type'""create_struct(int number_items;
`const int *blocklengths,
`canst MPI_Ai nt •arraLof _offsets.
`constMPI_Oatatype *arraLof_types,
`MPCDatatype *new_datatype_name)
`
`;M PI_Type_commi t(MPI_Datatype · *new_datatype_name)
`
`Creating a user-defined data type consists of providing the number of different partitions of existing
`MPI data-type elements (number _items), three separate arrays of length number _items containing the
`number of elements per block, byte offsets of each block and the MPI data types of each block, and the
`new name for the user-defined type. This name is then passed as an argument to MPI_Type_commit,
`after which it can be used in all existing MPI functions.
`An example of creating a user-defined data type from a C struct and broadcasting it to all processes
`is provided in Code 8.13. In this example, a C struct containing some typical variable names for a
`simulation is populated with values on process 0. The user-defined data type for this C struct, mpi_par,
`is created and committed on lines 38-39. The values for the structure are then broadcast to all other
`processes in line 41.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 303
`
`
`
`CHAPTER 8 iHE ESSENTIAL MP\
`
`282
`
`1 4tinclude <stdio.h>
`2 jfai ncl ude <s tddef, h>
`3 #include "mpi .h"
`4
`5 typedef struct \
`intmax_iter:
`6
`7 doubletO:
`8 doubletf:
`9 daub le xmi n:
`10 l Pars:
`11
`12 int main(int argc,char **argv)
`13
`14 MPI_lnit(&argc,&argv):
`15
`int rank:
`16
`; nt root= O: //define the root process
`17 MP I_Comm_rank ( MP J_(OMM_WORLD, &rank) : / / identify the rank
`18
`19 Pars pars:
`20
`if ( rank= root l
`21
`pars.max_iter=lO;
`22
`pars.tO=O.O;
`23
`pars.tf=l.O;
`24
`pars.xmin=-5.0;
`25
`26
`intnitems=4:
`27
`28 MPI_Datatype types[nitems]:
`29 MPI_Datatype mpi_par: // give my new type a name
`30 MPI_Aint offsets[nitems]: // an array for storing the element offsets
`31
`int blocklengths[nitemsJ;
`32
`33
`34
`35
`36
`37
`38 MPI_Type_create_struct(nitems,blocklengths,offsets,types,&mpi_par);
`39 MPI_Type_commi t(&mpi_par);
`40
`41 MPI_Bcast(&pars,1,mpi_par,root,MPI_COMM_WORLD);
`42
`43
`44
`45
`46
`47
`48
`
`types[OJ = MPI_INT: offsets[OJ = offsetof( Pars ,max_iter) ;bl ockl engths[O] = 1;
`types[l] =MPI_DOUBLE; offsets[l] =offsetof(Pars,tOJ;blocklengths[lJ = 1;
`types [2] = MP !_DOUBLE; offsets [2] = offsetof (Pars, t f); b 1 ockl engths [2] = 1;
`types [3] = MPI_DOUBLE; offsets [3] = offset of (Pars, xmi n l; bl ockl engths [3] = 1;
`
`printf("Hello from rank %d; my max_iter value is %d\n" ,rank,pars.max_iter);
`
`MPI_Final i ze();
`
`return O;
`
`Code 8.13. Example of creating and using a user-defined data type in an MPI collective.
`
`_I l J l
`
`.\
`
`I
`'I
`
`)
`
`I
`I
`I
`
`j
`! L
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 304
`
`
`
`./
`''-,
`:;;:'
`
`8.11 EXERCISES
`
`283
`
`> mpirun -np 4 . /c-0del3
`Hello from rank 0: my max_iter value s 10
`Hello from rank 2; my max_i-ter value s 10
`Hello from rank 1; my max_J ter value s 10
`Hello from rank 3; my max_iter value s 10
`
`8.10 SUMMARY AND OUTCOMES OF CHAPTER 8
`• There was probably no greater achievement of practical utility for the advancement of HPC than
`the development of MPI.
`• MPI is a community-driven specification that continues to evolve.
`• MPI is a library with an API, not a language.
`• MPICH was the first reduction to practice the MPI standard.
`• Key elements of MPI are point-to-point communication and collective communication.
`• MPI has a set of predefined data types for use in library calls.
`• Point-to-point communication calls are typified by the MPI_Send and MPI_Recv calls.
`• Collective communication is typified by the broadcast, gather, and scatter operations.
`Important extensions of these collective operations are allgather, reduce, and alltoall.
`•
`• Nonblocking point-to-point communications are frequently used to simplify code development
`and avoid deadlocks.
`• User-defined data types can be built up starting from existing MPI data types and used in MPI
`function calls.
`
`8. 11 EXERCISES
`1. Modify Code 8.13 to send and receive the user-defined data type mpi_par multiple times between
`two processes. Add an integer to the Par struct to count how many times the data have been passed
`back and forth.
`2. Modify Code 8.13 so that each process sends mpi_par to the process with rank+ 2 and receives
`data from rank - 2. For example, if there were 16 processes, process O would send to process 2
`and receive from process 14 while process 1 would send to process 3 and receive from process 15.
`3. Write a distributed matrix-vector multiplication code using MPI. Use a dense matrix and a dense
`vector. Call C language Basic Linear Algebra Subprograms (CBLAS) on each process for the
`local matrix-vector multiplication.
`4. Rewrite Code 8.11 using point-to-point communication. Generalize the code to run on an
`arbitrary number of processes. Compare the performance of MPI_Alltoall with your point-to(cid:173)
`point communication implementation.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 305
`
`
`
`284 CHAPTER 8 THE ESSENTIAL MPI
`
`5. Rewrite Code 8.9 so that the global vector sizes stay the same as the number of processes varies.
`This will require changing the local vector size depending on the number of processes on which
`MPI is launched. Plot the time to solution for your code as a function of the number of processes
`for various global vector sizes.
`
`REFERENCES
`[IJ MPI Forum, MPI Standardization Forum, [Online]. http://mpi-forum.org/.
`[2] Argonne National Laboratory, MPICH, [Online]. www.mpich.org.
`[3] B. Kernighan, D. Ritchie, The C Programming Language, Prentice Hall, s.1., 1988.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 306
`
`
`
`PARALLEL ALGORITHMS
`
`9
`
`LINE
`................. 285
`.. ......... 286
`...........................................................
`CHAPTER OUT
`287
`......................................................
`9 1 Introduction ...................................................
`9. 2 Fork-Join•······· ................................ : ......................................................................................... 291
`. ..................................................... 292
`9. 3 Divide and Conquer ...................................................
`9:4 Manager-Worker ........................ ::::::::::::::::::::::::: ...................................................... •::::::::::::::: 294
`295
`.............................................
`9.5 Embarrassingly Parallel...............
`,,,,,,,,.,,,,,,.
`9 6 Halo Exchange........................................................
`·
`1 The Advection Equation Using Finite Difference........................................
`·
`297
`tor Multiplication ......................................... ······························ .. .
`. V
`9.6.
`.. ......................... 30,
`•.
`g 6.2 Sparse Matrix ec
`9. 7 P~rmutation: Cannon"s Algorithm······· ............................................... ::::::::::: .. ::: ........................... 306
`9.s Task Dataflow: Breadth First Searc\ .. ····· .. ···················:::::::::::::::::::: ........................................... 310
`.. .......... 311
`9.9 summary and Outcomes of Chapter
`............................
`:~~~re!::~c.~~~~.:::::::::::::::::::::::::::::::::::::::::·:········ ··········. · ....................... ::::::::::::::::::::::::::::::::: ............ 311
`
`INTRODUCTION
`9.1
`Modern supercomputers employ several different modalities of ?peration to .ta~e advant~ge of
`parallelism both to exploit enabling technologies and to contribute to ach1evmg the highest
`. performance possible across a wide spectrum of algorithms and applications. Three of these most
`common hardware architecture fonns present in a supercomputer are single-instruction multiple data
`Sh d
`/l /'
`(SIMD) parallelism, shared memo,y parallelism, and dlstril,ure.f memory p
`are
`memory and distributed memory parallelism are subclasses of the multiple-inst a~ e I(~..
`rue 100 mu hp)e data
`(MIMD) class of Flynn's computer architecture taxonomv.
`Each of these modalities is present in a modern supe~omputer. \VhiJ
`. .
`e th~ urufymg theme of a
`parallel computer architecture is parallelism, the ways in which a P~l
`I
`parallelis~ in each of the~ modalities can differ substantially. Some Paral~I algon~hm exploits physical
`algonthms are beuer SUited
`for one kind of parallelism versus another. Often a completely difti
`needed, depending on the targeted parallel computer architecflae Sfnlc erenr Parallel algorithm will be
`all three will be necessary for a parallel algorithm to achiee the hi~ure. Frequently a combination of
`,,._
`e5i f'OSsibJe performan
`supercomputer can provide.
`ce ut,ll a
`
`•
`High Perrormance Co
`Capyfigh1 © 20/8 lilscv:;utmg. https://d~O.IOJ6!B97i
`'8.{).J2.42QJ56-J./)(JfJ(J,;.J
`Inc, All rights n:>cntd_
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 307
`
`
`
`../
`\_,.ff l
`
`~
`Table 9.1 Examples of G
`.
`· ,~-.-- . ---
`enenc Classes f p
`Generic,:~ot
`.
`-
`o arallel Algorithms
`J;l~~~r~l'#Jun
`··• ,,Ji;·
`I
`·•··--~~t
`_ Fork-join
`
`-
`
`Divide and conquer
`Halo exchange
`
`-
`Permutation
`
`-
`
`Embarrassingly parallel
`Manager worker
`Task dataflow
`
`OpenMP parallel for-loop
`F~s-t Fourier Transfonn, parallel sort
`Fm1te difti
`.
`1
`.
`. erence finite element Partial
`differential equation solvers
`Cannon's algorithm Fast F
`.
`Transfonn
`'
`ouner
`
`Monte Carlo
`Simple adaptive mesh refinement
`Breadth first search
`
`In 2004 an influential set of seven classes of numerical
`th d
`computers were identified [1] Th
`me O s commonly used on super-
`kn
`"
`I'
`.
`algebra
`ese are
`own as the seven dwarfs" or "seven motifs"· dense linear
`' sparse mear algebra, spectral methods, N-body methods, structured grids unst~ctured grids
`d M
`onte Carlo meth d Th
`I
`an
`·
`'
`'
`0 s.
`ese seven c asses of algonthms represent a large segment of super-
`.
`.
`.
`com~utmg apphcat10ns today and many high performance computing (HPC) benchmarks are built
`specifically to target them. In addition to the original "seven dwarfs", researchers have added other
`important emerging classes of numerical methods found in supercomputing applications, including
`graph traversal, finite state machines, combinational logic, and statistical machine learning [2].
`Optimally mapping these numerical methods to a parallel algorithm implementation is a key challenge
`for supercomputing application developers.
`Several classes of parallel algorithms share key characteristics and are driven by the same underlying
`mechanism from which the parallelism is derived. Some examples of these generic classes of parallel
`algorithms include fork-join, divide and conquer, manager-worker, embarrassingly parallel, task
`dataflow, permutation, and halo exchange. Some examples of each class are listed in Table 9.1.
`This chapter examines a wide variety of parallel algorithms and the means by w~ich the
`parallelism is exposed and exploited. While the specific implementation of t~e algonthm _for
`SIMD or MIMD parallel computer architectures will differ. the co~c~ptual ba~1s_ for extractmg
`·
`'thm wi\\ not. The chapter begins by exammmg fork-Jom type parallel
`parallelism from the a\gon
`'d
`/,a·
`d-conquer class of parallel algorithms. parallel sort.
`.
`le from the u1v1 e-an
`.
`.
`I
`algonthms and an examp
`l
`..lthms and a specific subclass of it, embarrassmg y
`worker type a go..
`· 1 ct·
`·
`Examples from mana~er-
`cnangtpparallel algorithm examples are examm~d also me u mg
`
`n\\r~\\t\ 'M.t \\\t\\ t\'3.\'(\\l\ec\. H.a\o-e.t matrix vector multiplication. A permut~t1on example t
`
`ihe ad~ection equation and sp~o~ example of a breadth first search algonthm complete t e
`Cannon's a\gorithm and a task
`chapter.
`
`10\N
`MP execution model presented in
`9 2 fQR\(-
`is a~ component of the Ope~
`d
`ry parallelism. In
`•
`. . nara\\e\ design paitern .1. •
`\'.!Vamming models targetmg share memo
`ihe tork-1oin ~ \ ue\\\\':j em\l\l)':jeu \1\
`C,\\a\?~t 1 an\\ \'i, tt~
`
`i ,-.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 308
`
`
`
`9.3 DIVIDE AND CONQUER
`
`287
`
`:al°f ..
`
`d_..~ ..
`
`FIGURE 9.1
`
`This is an illustration of the fork-join parallel design pattern frequently used in parallel algorith~s intended for
`shared memory parallelism. The empty boxes indicate work that is serial (i.e., not parallelizable), while the filled
`boxes indicate work that can be performed concuITently. In the "fork" phase, concurrent operators known as
`threads (denoted here by the branching lines) are created to perform the concurrent work. In the "join" phase, the
`results of those concurrent operators are acc umulated into a single resulting operator.
`
`regions of a sequential algorithm where work can proceed concurrently, a group of lightweight concurrent
`operators, frequent! y called "threads", are created to perform that work. Once the work is completed, the
`results from each of these operators are accumulated during the "join" phase. This process is illustrated in
`Fig. 9.1.
`The OpenMP parallel for-loop construct is a simple example of this type of parallel algorithm.
`Consider the example of parallel work sharing presented in Code l. A previously initialized array b is
`added to another expression to initialize array a. Because each work element in the for-loop (see line 3)
`is independent of every other element, the work in this loop can proceed concurrently. Consequently, a
`parallel for-loop construct is added in line I. Fig. 9.2 illustrates the fork- join behavior of the resulting
`concurrent operators.
`
`1 #pragma omp parallel for
`for ( i =O ;i <30 ;i++)
`2
`a[i J = b[i] + s in (i l
`3
`
`Code I: An OpenMP example of fork-join for work sharing.
`While the fork-join parallel design pattern is the main execution model for OpenMP, it is also
`found in other parallel programming models, especially those which target shared memory parallelism.
`
`9.3 DIVIDE AND CONQUER
`Algorithms denoted as "divide and conquer" break a problem into smaller subproblems which
`share similar enough algorithmic prope11ies to the original problem that they can in tum al~o be
`subdivided. Using recursion, the larger problem is broken down into small enough pieces that ,t can
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 309
`
`
`
`288 CHAPTER g PARALLEL ALGORITHMS
`
`Tim, I
`
`i=0 .. 5
`a[i] = ...
`
`,-,~~.,,·
`
`,1,1•A.;,Jit
`
`FIGURE 9.2
`The resulting fork-join structure from the work-sharing example in Code 9.1 for five threads. Each concurrent
`operator performs an independent computation that is then joined into the final serial operator.
`
`be easily solved with minimal computation. Because the original problem has been broken down into
`several smaller computations that are independent of one another, there is a natural concurrency for
`exploiting parallel computation resources. Frequently, divide-and-conquer type algorithms are also
`naturally parallel algorithms because of this concurrency and, like fork- join type algorithms, can
`perform very well on shared memory architectures. On distributed memory architectures, however,
`network latency and load imbalance can complicate the direct application of divide-and-conquer
`type algorithms.
`One well-studied example of a divide-and-conquer algorithm with natural concurrency is quicksort
`[3]. As a s01ting algorithm, it aims to sort a list of numbers in order of increasing value. To start, a
`random element of the array is selected to serve as a pivot point. Using this pivot, the rest of the list is
`divided into a list containing numbers smaller than the pivot and a list containing numbers larger than
`the pivot. This process is then repeated recursively for each of the two lists. Upon completion of
`recursion the resulting sorted child subproblems are concatenated for the final result. An example is
`given in Fig. 9.3.
`The efficiency of the algorithm is significantly impacted by which element is chosen as the pivot
`point. If the array has N data items, the worst-case performance will be proportional to N2
`; however,
`for most cases the performance is much faster, proportional to N log N. Because the two branched
`lists in quicksort can be s01ted independently, there is a natural concurrency of computation that can
`be used for parallelization. On a distributed memory architecture, exploiting this concurrency incurs
`a significant communication cost as sorted lists are passed from one process to another during
`recursion. This makes direct application of quicksort on a distributed memory architecture
`undesirable. However, a modification to the approach based on sampling can be made to improve this
`situation .
`The regular sampling parallel sort algorithm is designed for better perfonnance on distributed
`memory architectures with quicksort underlyi ng the approach [4]. The algorithm is detailed below.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 310
`
`
`
`9.3 DIVIDE AND CONQUER
`
`289
`
`Given List of numbers { 3, 14, 15, 12, 9, 7, 5 }
`
`Random choice of pivot: 12
`
`Low list
`
`{3,9,7,5)
`
`Random choice of pivot: 7
`
`{14,15)
`
`High list
`
`Random choice of pivot: 15
`
`{3,5)
`
`{9}
`
`{14)
`
`{)
`
`Random choice (!f pivot: 3
`
`{}
`
`{5}
`
`Concatenated result: {3,5,7,9,12,14,15}
`
`FIGURE 9.3
`Example of serial quicksort algorithm.
`
`• An array of numbers to be sorted is distributed equally among P processes. Thus if the array size
`is N, each process will have NIP local elements.
`
`FIGURE 9.4A
`
`• Each process runs sequential quicksort on its local data.
`
`FIGURE 9.48
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 311
`
`
`
`29a CHAPTER 9 PARALLEL ALGORITHMS
`
`• db th global array size N and the
`. .
`d
`• The resulting sorted arrays are sampled at intervals et~rlmme_
`y t ~-
`t O i e . array element
`number of processes P. Samples are taken at every NIP ocat1on s aI mg a
`, · ·•· ,
`• ct·
`0 NIP2 2NIP2
`(P _ I) N/P2 form the sample array from each local data.
`1n ices
`,
`,
`, ... ,
`
`{3,14}
`Proc:essO
`
`FIGURE 9.4C
`
`• The resulting samples are gathered to a root process and sorted sequentially with quicksort.
`
`FIGURE 9.4D
`
`{3,5,9,14}
`ProcessO
`
`• Regularly sampled P -
`l pivot values computed from the sample set are broadcast to the other
`processes. Thus NIP2
`, 2NIP2
`, •.• , (P - 1) N/P2 indices form the sample P - 1 pivot points. In this
`example, the only pivot point broadcast is 9.
`
`FIGURE 9.4E
`
`• Each process divides its sorted segment of the array into P segments using the broadcast P - 1
`pivot values.
`
`Pivot: 9
`{ 3, 12, 14, 15}
`
`✓\
`
`{3}
`
`{12,14,1S}
`
`FIGURE 9.4F
`
`ProcessO
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 312
`
`
`
`9.4 MANAGER-WORKER
`
`291
`
`• Each process performs an all-to-all operation on the P segments. Thus the ith process keeps the ith
`segment and sends the jth segment to the jth process.
`
`FIGURE 9.4G
`
`• The arriving segments are merged into a single list and locally sorted.
`r-. ~• r:~ •-:~7
`.
`'
`. . -~ -··' ~- ,!
`Final result: {3,5, 7,9,10,12,14,15}
`
`{3,5,7}
`ProcessO
`
`Pivot: 9
`
`FIGURE 9.4H
`
`An example of this algorithm for P = 2 and N = 8 is illustrated in Fig. 9.4A-H.
`
`9.4 MANAGER-WORKER
`Manager-worker incorporates two different workflows in its execution: one intended for execution by just
`one process called the manager process, and another intended for execution by several other processes
`called worker processes. This approach has also historically been called "master-slave". Applications
`that are dynamic in nature frequently use this type of parallel design algorithm so that the manager process
`can coordinate and issue task actions to worker processes in response to changes in a simulation outcome.
`Many adaptive mesh refinement applications also use this parallel design algorithm because the meshes
`and data placement patterns change in response to a solution value. Such an adaptive mesh refinement is
`illustrated in Fig. 9.5 . Manager-worker codes frequently take the form illustrated in Code 9.2, where an
`"if' statement distinguishes the workflow between manager and worker ta~ks.
`
`for (i nt i=O ; i<num_timesteps ; i++l
`send_action(REFINEl ;
`send_acti on( INTEGRATE ) ;
`send_action(OUTPUT) ;
`
`1 if ( my_rank == master l {
`2
`send_action(INITIALIZEl ;
`3
`4
`5
`6
`7
`8
`9
`10
`11
`
`else I
`listen_for_actions(l :
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 313
`
`
`
`292 CHAPTER 9 PARALLEL ALGORITHMS
`
`FIGURE 9.5
`
`Example of a manager-worker adaptive mesh refinement code evolving a dynamic system of two compact objects
`orbiting each other. The meshes follow the orbiting compact objects as they move clockwise in the computational
`domain.
`
`Code 9.2 is manager-worker example code adapted from the adaptive mesh refinement code used to
`generate Fig. 9.5. The manager process (called "master" in this example) directs the refinement
`characteristics and sends actions to worker processes that are always listening for additional
`instructions from the manager.
`
`9.5 EMBARRASSINGLY PARALLEL
`The term "embarrassingly parallel" is a common phrase in scientific computing that is both widely
`used and poorly defined. It suggests lots of parallelism with essentially no intertask communication
`or coordination, as well as a highly partitionable workload with minimal overhead. In general, embar(cid:173)
`rassingly parallel algorithms are a subclass of manager-worker algorithms. They are called embarrass(cid:173)
`ingly parallel because the available concun-ency is trivially extracted from the workflow. These algorithms
`sometimes require reduction operation at the end to gather the results into a manager process. While this
`does require some minimal coordination and intertask communication, these "almost embarrassingly
`parallel" algorithms are still generally referred to as embarrassingly parallel.
`Monte Carlo simulations mainly fall into the category of embarrassingly parallel. Monte Carlo
`methods are statistical approaches for studying systems with a large number of coupled degrees of
`freedom, modeling phenomena with significant uncertainty in the inputs, and solving partial differential
`equations with more than four dimensions. Computing the value of 1r is a simple example.
`
`• Define a square domain and inscribe a circle inside that domain.
`• Randomly generate the coordinates of points lying inside the square domain; count the points that
`also lie in the circle.
`rr/4 is the ratio of the number of points that lie in the circle to the total number of random points
`generated.
`
`•
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 314
`
`
`
`9.5 EMBARRASSINGLY PARALLEL
`
`293
`
`2r
`
`2
`
`Acirc1, = 1tr
`A,qua.e = 4r2
`
`FIGURE 9.6
`When generating random coordinates inside a square, the ratio of the number of points lying inside an inscribed
`circle to the total number of random points will be 1r/4.
`
`The reasoning behind this algorithm is as follows. A circle with radius r inscribed in a square will
`have an area of 1r? while the square will have an area of (2r)2 = 4r2
`, as seen in Fig. 9.6. The ratio of the
`area of the circle to the area of the square will also be the probability that a random point generated in
`the square lies in the circle. The ratio of these two areas is 1r/4.
`The parallel version of this algorithm is illustrated in Fig. 9.7.
`
`Initialize variables
`~":.<.:·,.·.:Z:/J:.N,,<'.:..:·;·• .. ;.,.,~·
`
`Generate random x,y I
`
`Generate random x,y
`
`Generate random x,y
`
`FIGURE 9.7
`Embarrassingly parallel example: computing 7r using statistical methods. The manager is in light gray while the
`various workers are in black.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 315
`
`
`
`294 CHAPTER 9 PARALLEL ALGORITHMS
`
`9.6 HALO EXCHANGE
`Many parallel algorithms fall into a problem class where every parallel task is executing the same
`algmithm on different data without any manager algorithm present. This is sometimes referred to as
`the data parallel model. Data parallelism is frequently used in applications that are static in nature
`because a computational task can be mapped to particular subset of data throughout the life of the
`simulation. However, in all but the most simple of applications, some information in each data subset
`mapped to the parallel task has to be exchanged and synchronized for the application algorithm to
`function properly. This exchange of intertask information is called halo exchange.
`As the name implies, a halo is a region exterior to the data subset mapped to a parallel task. It acts
`as an artificial boundary to that data subset and contains infonnation that miginates from the data
`subsets of neighboring parallel tasks. A halo is illustrated in Fig. 9.8.
`
`Process O
`
`•
`
`• •
`
`Process 1
`
`Process O
`
`Process 2
`
`• ••
`• • •
`••••
`
`Process 3
`
`• • • •
`•
`
`/
`
`Halo for process 0
`
`FIGURE 9.8
`
`Illustration of a one-deep halo. In this illustration, various data points (colored dark) are split across four different
`processes (top figure) . For each process there are two boundaries in the data that a.re interprocess boundaries. For
`process 0, these are the right and bottom edges of the square. A one-deep halo for process 0 (bottom figure)
`consists of those data points that are closest to the interprocess boundary of process 0 but not mapped to process 0.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 316
`
`
`
`9.6 HALO EXCHANGE
`
`295
`
`Halo exchange enables each task to perform computations and update the subset of data mapped to
`that task while having access to any data necessary for such computations that may not be local. Halo
`exchange is extremely common in parallel toolkits for solving partial differential equations and in
`linear algebra computations. Two parallel algorithm examples are presented in this section using halo
`exchange: the advection equation and sparse matrix vector multiplication.
`
`9.6. 1 THE ADVECTION EQUATION USING FINITE DIFFERENCE
`Wavelike phenomena permeate nature: examples include light, sound, gravitation, fluid flow, and
`weather, to name just a few. The study of wavelike phenomena is ubiquitous in supercomputing systems
`and is frequently modeled using a partial differential equation, or an expression involving derivatives
`taken against different independent variables. One of the simplest ways to solve these wavelike partial
`differential equations on a supercomputer is through the use of finite differencing and halo exchange.
`Finite differencing involves replacing the derivative expressions in the partial differential equation with
`approximations originating from estimating the slope between neighboring points on a uniform grid.
`As an example of this parallel algorithm, consider the advection equation in Eq. (9.1).
`aJ
`aJ
`at= -v ax
`This advection equation transports a scalar field fix, t) toward increasing x with speed v. The
`analytic solution to this partial differential equation is
`f(x, t) = F(x - vt)
`where F(x) is an arbitrary function describing the initial condition of the system. So if the initial
`condition of the wavelike phenomenon for solution is
`
`(9.1)
`
`(9.2)
`
`then the analytic solution to Eq. (9.1) would be
`f(x, t) = e-(x-vt)2
`
`F(x) = e-x2
`
`(9.3)
`
`(9.4)
`
`as plotted in Fig. 9.9.
`While the advection equation in Eq. (9.1) can be solved analytically with the solution shown in
`Fig. 9.9, a parallel algorithm based on halo exchange can be crafted to solve this equation numerically.
`The left- and right-hand partial derivatives in Eq. (9.1) are replaced with finite difference approximations
`to those derivatives:
`
`-J;.n
`J;!'
`f;n+ I -J;.n
`l = -V~
`dx
`dt
`where the fieldf(x,t) has been discretized to a