throbber
\,.,,
`
`8.8 NONBLOCKING POINT-TO-POINT COMMUNICATION
`
`279
`
`8.8 NONBLOCKING POINT-TO-POINT COMMUNICATION
`The point-to-point communication calls introduced in Sections 8.5.1 and 8.5.3, MPI_Send and
`MPI_Recv, do not return from the respective function call until the send and receive operations have
`completed. While this ensures that the send and receive buffers used in the MPI_Send and MPI_Recv
`arguments are safe to use or reuse after the function call, it also means that unless there is a simul(cid:173)
`taneously matching send for each receive, the code will deadlock, resulting in the code hanging. This
`common type of bug is examined in Chapter 14. One way to avoid this is by using nonblocking point(cid:173)
`to-point communication.
`Nonblocking point-to-point communication returns immediately from the function call before
`confirming that the send or the receive has completed. These nonblocking calls are MPI_Isend and
`MPI_Irecv. They are used coupled with MPI_ Wait, which will wait until the operation is completed.
`When querying whether a nonblocking point-to-point communication has completed, MPI_Test is
`often paired with MPl_lsend and MPI_Irecv. Nonblocking point-to-point calls can simplify code
`development to avoid such deadlocks more easily and also potentially enable the overlap of useful
`computation while checking to see if the communication has completed.
`The syntax of each of these calls is the same as for the blocking calls except for the addition of a
`request argument and the elimination of the status output in the MPI_Recv arguments.
`
`int MPI_I1end (void •message, i~t count, MPI_Datatype datatype. int desti int tag,
`MPI_Comm comm. MPI_Request •send_request)
`
`int MPI_Irecv (void *message, tnt count, MPI_Datatype datatype, tnt Source~ int tag,
`MPI_Comm comm, MPI_Request •recv_requestl
`
`Because both MPl_lsend and MPI_Irecv return immediately after calling without confirming that
`the message-passing operations have completed, the application user needs a way to specify when
`these operations must complete. This is done with MPI_ Wait:
`
`int MPI_Wait(MPI_Request •request, MPI_Status •status)
`
`When MPI_ Wait is called, the nonblocking request originating from MPl_lsend or MPI_Irecv is
`provided as an argument. The status that was previously provided directly from MPI_Recv is now
`supplied as an output from MPI_ Wait.
`Similar to MPI_ Wait, MPI_Test can be paired with an MPI_lsend or MPI_Irecv call to query
`whether the message passing has completed while performing other work. MPI_ Test shares similar
`syntax to MPI_ Wait, adding only a flag that is set to true if the request being queried has completed.
`
`int MPCTest(MPI.:..Request Hequest, int •flag, MPI_Status •status l
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 301
`
`

`

`280 CHAPTER 8 THE ESSENTIAL MPI
`
`An example of using nonblocking communication is presented in Code 8.12. In this example, the
`send commands are issued first, followed by the receive commands. If using blocking communication
`and sending a sufficiently large message this would normally result in a deadlock, but nonblocking
`communication avoids this pitfall.
`
`1 #include <stdlib.h>
`2 #include <stdio.h>
`3 #include <mpi .h>
`4
`
`,,,.I
`
`}else{
`
`a=667;
`
`int main(int argc, char* argv[J) {
`int a, b;
`int size. rank;
`int tag= O; // Pick a tag arbitrarily
`8
`9 MPI_Status status;
`10 MPI_Request send_request, recv_request;
`11
`12 MPI_Init(&argc, &argv);
`13 MPI_Comm_size(MPI_COMM_WORLD, &size);
`14 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
`15
`16 if (size !=2) {
`17
`printf("Example is designed for 2 processes\n");
`18 MPI_Finalize();
`19
`exit(O);
`20
`21 if (rank=O) {
`22
`a=314159; //Value picked arbitrarily
`23
`MP I_! send (&a, 1, MP!_! NT, 1, tag, MP l_COMM_WORLD, &send_reques t);
`24
`25 MPI_Irecv (&b, 1, MPI_INT, 1, tag, MPI_COMM_WORLD, &recv_request);
`26
`27 MPI_Wait(&send_request, &status);
`28 MPI_Wait(&recv_request, &status);
`29
`printf ("Process %d received value %d\n", rank, b);
`30
`31
`32
`33
`34
`35
`36
`37
`38 MPI_WaitC&send_request, &status);
`39 MPI_Wait(&recv_request, &status):
`40
`printf("Process%dreceivedvalue%d\n", rank, b);
`41
`42
`
`MPI_Isend (&a, 1, MPI_INT, 0, tag, MP!_COMM_WORLD, &send_request);
`MPI_Irecv (&b, 1, MP!_!NT, 0, tag, MPI_COMM_WORLD. &recv_request);
`
`l
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 302
`
`

`

`l,-,
`
`1~
`
`8•9 USER-DEFINED DATA TYPES
`
`281
`
`43 MPI_Finalize(J:
`return o:
`44
`45 I
`


`Code 8.12. Example of nonblocking point-to-point co
`3 I 4 I 59 to process I while process I sends the inte er 6~~umcat1on. Process O . sends the integer
`o process 0. The particular order of the
`oes not matter because the calls are nonblocking.
`
`listing oflsend and Irecv in lines 24-25 and 35-36 J
`
`> mpirun -np 2 ./code12
`Process O rec~ived value .667
`Pr'ocess 1 received value 314159
`
`8.9 USER-DEFINED DATA TYPES
`Application develop~rs wil_l frequently wish_ to_ create a user-defined data type built out of the pre(cid:173)
`defined MPI types hsted m Table 8.1. This 1s accomplished using MPI_Type_create_struct and
`MPI_Type_commit.
`
`MPI_Type'""create_struct(int number_items;
`const int *blocklengths,
`canst MPI_Ai nt •arraLof _offsets.
`constMPI_Oatatype *arraLof_types,
`MPCDatatype *new_datatype_name)
`
`;M PI_Type_commi t(MPI_Datatype · *new_datatype_name)
`
`Creating a user-defined data type consists of providing the number of different partitions of existing
`MPI data-type elements (number _items), three separate arrays of length number _items containing the
`number of elements per block, byte offsets of each block and the MPI data types of each block, and the
`new name for the user-defined type. This name is then passed as an argument to MPI_Type_commit,
`after which it can be used in all existing MPI functions.
`An example of creating a user-defined data type from a C struct and broadcasting it to all processes
`is provided in Code 8.13. In this example, a C struct containing some typical variable names for a
`simulation is populated with values on process 0. The user-defined data type for this C struct, mpi_par,
`is created and committed on lines 38-39. The values for the structure are then broadcast to all other
`processes in line 41.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 303
`
`

`

`CHAPTER 8 iHE ESSENTIAL MP\
`
`282
`
`1 4tinclude <stdio.h>
`2 jfai ncl ude <s tddef, h>
`3 #include "mpi .h"
`4
`5 typedef struct \
`intmax_iter:
`6
`7 doubletO:
`8 doubletf:
`9 daub le xmi n:
`10 l Pars:
`11
`12 int main(int argc,char **argv)
`13
`14 MPI_lnit(&argc,&argv):
`15
`int rank:
`16
`; nt root= O: //define the root process
`17 MP I_Comm_rank ( MP J_(OMM_WORLD, &rank) : / / identify the rank
`18
`19 Pars pars:
`20
`if ( rank= root l
`21
`pars.max_iter=lO;
`22
`pars.tO=O.O;
`23
`pars.tf=l.O;
`24
`pars.xmin=-5.0;
`25
`26
`intnitems=4:
`27
`28 MPI_Datatype types[nitems]:
`29 MPI_Datatype mpi_par: // give my new type a name
`30 MPI_Aint offsets[nitems]: // an array for storing the element offsets
`31
`int blocklengths[nitemsJ;
`32
`33
`34
`35
`36
`37
`38 MPI_Type_create_struct(nitems,blocklengths,offsets,types,&mpi_par);
`39 MPI_Type_commi t(&mpi_par);
`40
`41 MPI_Bcast(&pars,1,mpi_par,root,MPI_COMM_WORLD);
`42
`43
`44
`45
`46
`47
`48
`
`types[OJ = MPI_INT: offsets[OJ = offsetof( Pars ,max_iter) ;bl ockl engths[O] = 1;
`types[l] =MPI_DOUBLE; offsets[l] =offsetof(Pars,tOJ;blocklengths[lJ = 1;
`types [2] = MP !_DOUBLE; offsets [2] = offsetof (Pars, t f); b 1 ockl engths [2] = 1;
`types [3] = MPI_DOUBLE; offsets [3] = offset of (Pars, xmi n l; bl ockl engths [3] = 1;
`
`printf("Hello from rank %d; my max_iter value is %d\n" ,rank,pars.max_iter);
`
`MPI_Final i ze();
`
`return O;
`
`Code 8.13. Example of creating and using a user-defined data type in an MPI collective.
`
`_I l J l
`
`.\
`
`I
`'I
`
`)
`
`I
`I
`I
`
`j
`! L
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 304
`
`

`

`./
`''-,
`:;;:'
`
`8.11 EXERCISES
`
`283
`
`> mpirun -np 4 . /c-0del3
`Hello from rank 0: my max_iter value s 10
`Hello from rank 2; my max_i-ter value s 10
`Hello from rank 1; my max_J ter value s 10
`Hello from rank 3; my max_iter value s 10
`
`8.10 SUMMARY AND OUTCOMES OF CHAPTER 8
`• There was probably no greater achievement of practical utility for the advancement of HPC than
`the development of MPI.
`• MPI is a community-driven specification that continues to evolve.
`• MPI is a library with an API, not a language.
`• MPICH was the first reduction to practice the MPI standard.
`• Key elements of MPI are point-to-point communication and collective communication.
`• MPI has a set of predefined data types for use in library calls.
`• Point-to-point communication calls are typified by the MPI_Send and MPI_Recv calls.
`• Collective communication is typified by the broadcast, gather, and scatter operations.
`Important extensions of these collective operations are allgather, reduce, and alltoall.
`•
`• Nonblocking point-to-point communications are frequently used to simplify code development
`and avoid deadlocks.
`• User-defined data types can be built up starting from existing MPI data types and used in MPI
`function calls.
`
`8. 11 EXERCISES
`1. Modify Code 8.13 to send and receive the user-defined data type mpi_par multiple times between
`two processes. Add an integer to the Par struct to count how many times the data have been passed
`back and forth.
`2. Modify Code 8.13 so that each process sends mpi_par to the process with rank+ 2 and receives
`data from rank - 2. For example, if there were 16 processes, process O would send to process 2
`and receive from process 14 while process 1 would send to process 3 and receive from process 15.
`3. Write a distributed matrix-vector multiplication code using MPI. Use a dense matrix and a dense
`vector. Call C language Basic Linear Algebra Subprograms (CBLAS) on each process for the
`local matrix-vector multiplication.
`4. Rewrite Code 8.11 using point-to-point communication. Generalize the code to run on an
`arbitrary number of processes. Compare the performance of MPI_Alltoall with your point-to(cid:173)
`point communication implementation.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 305
`
`

`

`284 CHAPTER 8 THE ESSENTIAL MPI
`
`5. Rewrite Code 8.9 so that the global vector sizes stay the same as the number of processes varies.
`This will require changing the local vector size depending on the number of processes on which
`MPI is launched. Plot the time to solution for your code as a function of the number of processes
`for various global vector sizes.
`
`REFERENCES
`[IJ MPI Forum, MPI Standardization Forum, [Online]. http://mpi-forum.org/.
`[2] Argonne National Laboratory, MPICH, [Online]. www.mpich.org.
`[3] B. Kernighan, D. Ritchie, The C Programming Language, Prentice Hall, s.1., 1988.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 306
`
`

`

`PARALLEL ALGORITHMS
`
`9
`
`LINE
`................. 285
`.. ......... 286
`...........................................................
`CHAPTER OUT
`287
`......................................................
`9 1 Introduction ...................................................
`9. 2 Fork-Join•······· ................................ : ......................................................................................... 291
`. ..................................................... 292
`9. 3 Divide and Conquer ...................................................
`9:4 Manager-Worker ........................ ::::::::::::::::::::::::: ...................................................... •::::::::::::::: 294
`295
`.............................................
`9.5 Embarrassingly Parallel...............
`,,,,,,,,.,,,,,,.
`9 6 Halo Exchange........................................................

`1 The Advection Equation Using Finite Difference........................................

`297
`tor Multiplication ......................................... ······························ .. .
`. V
`9.6.
`.. ......................... 30,
`•.
`g 6.2 Sparse Matrix ec
`9. 7 P~rmutation: Cannon"s Algorithm······· ............................................... ::::::::::: .. ::: ........................... 306
`9.s Task Dataflow: Breadth First Searc\ .. ····· .. ···················:::::::::::::::::::: ........................................... 310
`.. .......... 311
`9.9 summary and Outcomes of Chapter
`............................
`:~~~re!::~c.~~~~.:::::::::::::::::::::::::::::::::::::::::·:········ ··········. · ....................... ::::::::::::::::::::::::::::::::: ............ 311
`
`INTRODUCTION
`9.1
`Modern supercomputers employ several different modalities of ?peration to .ta~e advant~ge of
`parallelism both to exploit enabling technologies and to contribute to ach1evmg the highest
`. performance possible across a wide spectrum of algorithms and applications. Three of these most
`common hardware architecture fonns present in a supercomputer are single-instruction multiple data
`Sh d
`/l /'
`(SIMD) parallelism, shared memo,y parallelism, and dlstril,ure.f memory p
`are
`memory and distributed memory parallelism are subclasses of the multiple-inst a~ e I(~..
`rue 100 mu hp)e data
`(MIMD) class of Flynn's computer architecture taxonomv.
`Each of these modalities is present in a modern supe~omputer. \VhiJ
`. .
`e th~ urufymg theme of a
`parallel computer architecture is parallelism, the ways in which a P~l
`I
`parallelis~ in each of the~ modalities can differ substantially. Some Paral~I algon~hm exploits physical
`algonthms are beuer SUited
`for one kind of parallelism versus another. Often a completely difti
`needed, depending on the targeted parallel computer architecflae Sfnlc erenr Parallel algorithm will be
`all three will be necessary for a parallel algorithm to achiee the hi~ure. Frequently a combination of
`,,._
`e5i f'OSsibJe performan
`supercomputer can provide.
`ce ut,ll a
`
`•
`High Perrormance Co
`Capyfigh1 © 20/8 lilscv:;utmg. https://d~O.IOJ6!B97i
`'8.{).J2.42QJ56-J./)(JfJ(J,;.J
`Inc, All rights n:>cntd_
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 307
`
`

`

`../
`\_,.ff l
`
`~
`Table 9.1 Examples of G
`.
`· ,~-.-- . ---
`enenc Classes f p
`Generic,:~ot
`.
`-
`o arallel Algorithms
`J;l~~~r~l'#Jun
`··• ,,Ji;·
`I
`·•··--~~t
`_ Fork-join
`
`-
`
`Divide and conquer
`Halo exchange
`
`-
`Permutation
`
`-
`
`Embarrassingly parallel
`Manager worker
`Task dataflow
`
`OpenMP parallel for-loop
`F~s-t Fourier Transfonn, parallel sort
`Fm1te difti
`.
`1
`.
`. erence finite element Partial
`differential equation solvers
`Cannon's algorithm Fast F
`.
`Transfonn
`'
`ouner
`
`Monte Carlo
`Simple adaptive mesh refinement
`Breadth first search
`
`In 2004 an influential set of seven classes of numerical
`th d
`computers were identified [1] Th
`me O s commonly used on super-
`kn
`"
`I'
`.
`algebra
`ese are
`own as the seven dwarfs" or "seven motifs"· dense linear
`' sparse mear algebra, spectral methods, N-body methods, structured grids unst~ctured grids
`d M
`onte Carlo meth d Th
`I
`an

`'
`'
`0 s.
`ese seven c asses of algonthms represent a large segment of super-
`.
`.
`.
`com~utmg apphcat10ns today and many high performance computing (HPC) benchmarks are built
`specifically to target them. In addition to the original "seven dwarfs", researchers have added other
`important emerging classes of numerical methods found in supercomputing applications, including
`graph traversal, finite state machines, combinational logic, and statistical machine learning [2].
`Optimally mapping these numerical methods to a parallel algorithm implementation is a key challenge
`for supercomputing application developers.
`Several classes of parallel algorithms share key characteristics and are driven by the same underlying
`mechanism from which the parallelism is derived. Some examples of these generic classes of parallel
`algorithms include fork-join, divide and conquer, manager-worker, embarrassingly parallel, task
`dataflow, permutation, and halo exchange. Some examples of each class are listed in Table 9.1.
`This chapter examines a wide variety of parallel algorithms and the means by w~ich the
`parallelism is exposed and exploited. While the specific implementation of t~e algonthm _for
`SIMD or MIMD parallel computer architectures will differ. the co~c~ptual ba~1s_ for extractmg

`'thm wi\\ not. The chapter begins by exammmg fork-Jom type parallel
`parallelism from the a\gon
`'d
`/,a·
`d-conquer class of parallel algorithms. parallel sort.
`.
`le from the u1v1 e-an
`.
`.
`I
`algonthms and an examp
`l
`..lthms and a specific subclass of it, embarrassmg y
`worker type a go..
`· 1 ct·

`Examples from mana~er-
`cnangtpparallel algorithm examples are examm~d also me u mg
`
`n\\r~\\t\ 'M.t \\\t\\ t\'3.\'(\\l\ec\. H.a\o-e.t matrix vector multiplication. A permut~t1on example t
`
`ihe ad~ection equation and sp~o~ example of a breadth first search algonthm complete t e
`Cannon's a\gorithm and a task
`chapter.
`
`10\N
`MP execution model presented in
`9 2 fQR\(-
`is a~ component of the Ope~
`d
`ry parallelism. In
`•
`. . nara\\e\ design paitern .1. •
`\'.!Vamming models targetmg share memo
`ihe tork-1oin ~ \ ue\\\\':j em\l\l)':jeu \1\
`C,\\a\?~t 1 an\\ \'i, tt~
`
`i ,-.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 308
`
`

`

`9.3 DIVIDE AND CONQUER
`
`287
`
`:al°f ..
`
`d_..~ ..
`
`FIGURE 9.1
`
`This is an illustration of the fork-join parallel design pattern frequently used in parallel algorith~s intended for
`shared memory parallelism. The empty boxes indicate work that is serial (i.e., not parallelizable), while the filled
`boxes indicate work that can be performed concuITently. In the "fork" phase, concurrent operators known as
`threads (denoted here by the branching lines) are created to perform the concurrent work. In the "join" phase, the
`results of those concurrent operators are acc umulated into a single resulting operator.
`
`regions of a sequential algorithm where work can proceed concurrently, a group of lightweight concurrent
`operators, frequent! y called "threads", are created to perform that work. Once the work is completed, the
`results from each of these operators are accumulated during the "join" phase. This process is illustrated in
`Fig. 9.1.
`The OpenMP parallel for-loop construct is a simple example of this type of parallel algorithm.
`Consider the example of parallel work sharing presented in Code l. A previously initialized array b is
`added to another expression to initialize array a. Because each work element in the for-loop (see line 3)
`is independent of every other element, the work in this loop can proceed concurrently. Consequently, a
`parallel for-loop construct is added in line I. Fig. 9.2 illustrates the fork- join behavior of the resulting
`concurrent operators.
`
`1 #pragma omp parallel for
`for ( i =O ;i <30 ;i++)
`2
`a[i J = b[i] + s in (i l
`3
`
`Code I: An OpenMP example of fork-join for work sharing.
`While the fork-join parallel design pattern is the main execution model for OpenMP, it is also
`found in other parallel programming models, especially those which target shared memory parallelism.
`
`9.3 DIVIDE AND CONQUER
`Algorithms denoted as "divide and conquer" break a problem into smaller subproblems which
`share similar enough algorithmic prope11ies to the original problem that they can in tum al~o be
`subdivided. Using recursion, the larger problem is broken down into small enough pieces that ,t can
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 309
`
`

`

`288 CHAPTER g PARALLEL ALGORITHMS
`
`Tim, I
`
`i=0 .. 5
`a[i] = ...
`
`,-,~~.,,·
`
`,1,1•A.;,Jit
`
`FIGURE 9.2
`The resulting fork-join structure from the work-sharing example in Code 9.1 for five threads. Each concurrent
`operator performs an independent computation that is then joined into the final serial operator.
`
`be easily solved with minimal computation. Because the original problem has been broken down into
`several smaller computations that are independent of one another, there is a natural concurrency for
`exploiting parallel computation resources. Frequently, divide-and-conquer type algorithms are also
`naturally parallel algorithms because of this concurrency and, like fork- join type algorithms, can
`perform very well on shared memory architectures. On distributed memory architectures, however,
`network latency and load imbalance can complicate the direct application of divide-and-conquer
`type algorithms.
`One well-studied example of a divide-and-conquer algorithm with natural concurrency is quicksort
`[3]. As a s01ting algorithm, it aims to sort a list of numbers in order of increasing value. To start, a
`random element of the array is selected to serve as a pivot point. Using this pivot, the rest of the list is
`divided into a list containing numbers smaller than the pivot and a list containing numbers larger than
`the pivot. This process is then repeated recursively for each of the two lists. Upon completion of
`recursion the resulting sorted child subproblems are concatenated for the final result. An example is
`given in Fig. 9.3.
`The efficiency of the algorithm is significantly impacted by which element is chosen as the pivot
`point. If the array has N data items, the worst-case performance will be proportional to N2
`; however,
`for most cases the performance is much faster, proportional to N log N. Because the two branched
`lists in quicksort can be s01ted independently, there is a natural concurrency of computation that can
`be used for parallelization. On a distributed memory architecture, exploiting this concurrency incurs
`a significant communication cost as sorted lists are passed from one process to another during
`recursion. This makes direct application of quicksort on a distributed memory architecture
`undesirable. However, a modification to the approach based on sampling can be made to improve this
`situation .
`The regular sampling parallel sort algorithm is designed for better perfonnance on distributed
`memory architectures with quicksort underlyi ng the approach [4]. The algorithm is detailed below.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 310
`
`

`

`9.3 DIVIDE AND CONQUER
`
`289
`
`Given List of numbers { 3, 14, 15, 12, 9, 7, 5 }
`
`Random choice of pivot: 12
`
`Low list
`
`{3,9,7,5)
`
`Random choice of pivot: 7
`
`{14,15)
`
`High list
`
`Random choice of pivot: 15
`
`{3,5)
`
`{9}
`
`{14)
`
`{)
`
`Random choice (!f pivot: 3
`
`{}
`
`{5}
`
`Concatenated result: {3,5,7,9,12,14,15}
`
`FIGURE 9.3
`Example of serial quicksort algorithm.
`
`• An array of numbers to be sorted is distributed equally among P processes. Thus if the array size
`is N, each process will have NIP local elements.
`
`FIGURE 9.4A
`
`• Each process runs sequential quicksort on its local data.
`
`FIGURE 9.48
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 311
`
`

`

`29a CHAPTER 9 PARALLEL ALGORITHMS
`
`• db th global array size N and the
`. .
`d
`• The resulting sorted arrays are sampled at intervals et~rlmme_
`y t ~-
`t O i e . array element
`number of processes P. Samples are taken at every NIP ocat1on s aI mg a
`, · ·•· ,
`• ct·
`0 NIP2 2NIP2
`(P _ I) N/P2 form the sample array from each local data.
`1n ices
`,
`,
`, ... ,
`
`{3,14}
`Proc:essO
`
`FIGURE 9.4C
`
`• The resulting samples are gathered to a root process and sorted sequentially with quicksort.
`
`FIGURE 9.4D
`
`{3,5,9,14}
`ProcessO
`
`• Regularly sampled P -
`l pivot values computed from the sample set are broadcast to the other
`processes. Thus NIP2
`, 2NIP2
`, •.• , (P - 1) N/P2 indices form the sample P - 1 pivot points. In this
`example, the only pivot point broadcast is 9.
`
`FIGURE 9.4E
`
`• Each process divides its sorted segment of the array into P segments using the broadcast P - 1
`pivot values.
`
`Pivot: 9
`{ 3, 12, 14, 15}
`
`✓\
`
`{3}
`
`{12,14,1S}
`
`FIGURE 9.4F
`
`ProcessO
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 312
`
`

`

`9.4 MANAGER-WORKER
`
`291
`
`• Each process performs an all-to-all operation on the P segments. Thus the ith process keeps the ith
`segment and sends the jth segment to the jth process.
`
`FIGURE 9.4G
`
`• The arriving segments are merged into a single list and locally sorted.
`r-. ~• r:~ •-:~7
`.
`'
`. . -~ -··' ~- ,!
`Final result: {3,5, 7,9,10,12,14,15}
`
`{3,5,7}
`ProcessO
`
`Pivot: 9
`
`FIGURE 9.4H
`
`An example of this algorithm for P = 2 and N = 8 is illustrated in Fig. 9.4A-H.
`
`9.4 MANAGER-WORKER
`Manager-worker incorporates two different workflows in its execution: one intended for execution by just
`one process called the manager process, and another intended for execution by several other processes
`called worker processes. This approach has also historically been called "master-slave". Applications
`that are dynamic in nature frequently use this type of parallel design algorithm so that the manager process
`can coordinate and issue task actions to worker processes in response to changes in a simulation outcome.
`Many adaptive mesh refinement applications also use this parallel design algorithm because the meshes
`and data placement patterns change in response to a solution value. Such an adaptive mesh refinement is
`illustrated in Fig. 9.5 . Manager-worker codes frequently take the form illustrated in Code 9.2, where an
`"if' statement distinguishes the workflow between manager and worker ta~ks.
`
`for (i nt i=O ; i<num_timesteps ; i++l
`send_action(REFINEl ;
`send_acti on( INTEGRATE ) ;
`send_action(OUTPUT) ;
`
`1 if ( my_rank == master l {
`2
`send_action(INITIALIZEl ;
`3
`4
`5
`6
`7
`8
`9
`10
`11
`
`else I
`listen_for_actions(l :
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 313
`
`

`

`292 CHAPTER 9 PARALLEL ALGORITHMS
`
`FIGURE 9.5
`
`Example of a manager-worker adaptive mesh refinement code evolving a dynamic system of two compact objects
`orbiting each other. The meshes follow the orbiting compact objects as they move clockwise in the computational
`domain.
`
`Code 9.2 is manager-worker example code adapted from the adaptive mesh refinement code used to
`generate Fig. 9.5. The manager process (called "master" in this example) directs the refinement
`characteristics and sends actions to worker processes that are always listening for additional
`instructions from the manager.
`
`9.5 EMBARRASSINGLY PARALLEL
`The term "embarrassingly parallel" is a common phrase in scientific computing that is both widely
`used and poorly defined. It suggests lots of parallelism with essentially no intertask communication
`or coordination, as well as a highly partitionable workload with minimal overhead. In general, embar(cid:173)
`rassingly parallel algorithms are a subclass of manager-worker algorithms. They are called embarrass(cid:173)
`ingly parallel because the available concun-ency is trivially extracted from the workflow. These algorithms
`sometimes require reduction operation at the end to gather the results into a manager process. While this
`does require some minimal coordination and intertask communication, these "almost embarrassingly
`parallel" algorithms are still generally referred to as embarrassingly parallel.
`Monte Carlo simulations mainly fall into the category of embarrassingly parallel. Monte Carlo
`methods are statistical approaches for studying systems with a large number of coupled degrees of
`freedom, modeling phenomena with significant uncertainty in the inputs, and solving partial differential
`equations with more than four dimensions. Computing the value of 1r is a simple example.
`
`• Define a square domain and inscribe a circle inside that domain.
`• Randomly generate the coordinates of points lying inside the square domain; count the points that
`also lie in the circle.
`rr/4 is the ratio of the number of points that lie in the circle to the total number of random points
`generated.
`
`•
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 314
`
`

`

`9.5 EMBARRASSINGLY PARALLEL
`
`293
`
`2r
`
`2
`
`Acirc1, = 1tr
`A,qua.e = 4r2
`
`FIGURE 9.6
`When generating random coordinates inside a square, the ratio of the number of points lying inside an inscribed
`circle to the total number of random points will be 1r/4.
`
`The reasoning behind this algorithm is as follows. A circle with radius r inscribed in a square will
`have an area of 1r? while the square will have an area of (2r)2 = 4r2
`, as seen in Fig. 9.6. The ratio of the
`area of the circle to the area of the square will also be the probability that a random point generated in
`the square lies in the circle. The ratio of these two areas is 1r/4.
`The parallel version of this algorithm is illustrated in Fig. 9.7.
`
`Initialize variables
`~":.<.:·,.·.:Z:/J:.N,,<'.:..:·;·• .. ;.,.,~·
`
`Generate random x,y I
`
`Generate random x,y
`
`Generate random x,y
`
`FIGURE 9.7
`Embarrassingly parallel example: computing 7r using statistical methods. The manager is in light gray while the
`various workers are in black.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 315
`
`

`

`294 CHAPTER 9 PARALLEL ALGORITHMS
`
`9.6 HALO EXCHANGE
`Many parallel algorithms fall into a problem class where every parallel task is executing the same
`algmithm on different data without any manager algorithm present. This is sometimes referred to as
`the data parallel model. Data parallelism is frequently used in applications that are static in nature
`because a computational task can be mapped to particular subset of data throughout the life of the
`simulation. However, in all but the most simple of applications, some information in each data subset
`mapped to the parallel task has to be exchanged and synchronized for the application algorithm to
`function properly. This exchange of intertask information is called halo exchange.
`As the name implies, a halo is a region exterior to the data subset mapped to a parallel task. It acts
`as an artificial boundary to that data subset and contains infonnation that miginates from the data
`subsets of neighboring parallel tasks. A halo is illustrated in Fig. 9.8.
`
`Process O
`
`•
`
`• •
`
`Process 1
`
`Process O
`
`Process 2
`
`• ••
`• • •
`••••
`
`Process 3
`
`• • • •
`•
`
`/
`
`Halo for process 0
`
`FIGURE 9.8
`
`Illustration of a one-deep halo. In this illustration, various data points (colored dark) are split across four different
`processes (top figure) . For each process there are two boundaries in the data that a.re interprocess boundaries. For
`process 0, these are the right and bottom edges of the square. A one-deep halo for process 0 (bottom figure)
`consists of those data points that are closest to the interprocess boundary of process 0 but not mapped to process 0.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2136, p. 316
`
`

`

`9.6 HALO EXCHANGE
`
`295
`
`Halo exchange enables each task to perform computations and update the subset of data mapped to
`that task while having access to any data necessary for such computations that may not be local. Halo
`exchange is extremely common in parallel toolkits for solving partial differential equations and in
`linear algebra computations. Two parallel algorithm examples are presented in this section using halo
`exchange: the advection equation and sparse matrix vector multiplication.
`
`9.6. 1 THE ADVECTION EQUATION USING FINITE DIFFERENCE
`Wavelike phenomena permeate nature: examples include light, sound, gravitation, fluid flow, and
`weather, to name just a few. The study of wavelike phenomena is ubiquitous in supercomputing systems
`and is frequently modeled using a partial differential equation, or an expression involving derivatives
`taken against different independent variables. One of the simplest ways to solve these wavelike partial
`differential equations on a supercomputer is through the use of finite differencing and halo exchange.
`Finite differencing involves replacing the derivative expressions in the partial differential equation with
`approximations originating from estimating the slope between neighboring points on a uniform grid.
`As an example of this parallel algorithm, consider the advection equation in Eq. (9.1).
`aJ
`aJ
`at= -v ax
`This advection equation transports a scalar field fix, t) toward increasing x with speed v. The
`analytic solution to this partial differential equation is
`f(x, t) = F(x - vt)
`where F(x) is an arbitrary function describing the initial condition of the system. So if the initial
`condition of the wavelike phenomenon for solution is
`
`(9.1)
`
`(9.2)
`
`then the analytic solution to Eq. (9.1) would be
`f(x, t) = e-(x-vt)2
`
`F(x) = e-x2
`
`(9.3)
`
`(9.4)
`
`as plotted in Fig. 9.9.
`While the advection equation in Eq. (9.1) can be solved analytically with the solution shown in
`Fig. 9.9, a parallel algorithm based on halo exchange can be crafted to solve this equation numerically.
`The left- and right-hand partial derivatives in Eq. (9.1) are replaced with finite difference approximations
`to those derivatives:
`
`-J;.n
`J;!'
`f;n+ I -J;.n
`l = -V~
`dx
`dt
`where the fieldf(x,t) has been discretized to a

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket