`
`
`
`
`
`Exhibit D
`Part 2 of 2
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 2 of 185 PageID #: 2425
`
`INFORMATION TO USERS
`
`This manuscript has been reproduced from the microfilm master. UMI
`films the text directly from the original or copy submitted. Thus, some
`thesis and dissertation copies are in typewriter face, while others may
`be from any type of computer printer.
`
`The quality of this reproduction is dependent upon the quality of the
`copy submitted. Broken or indistinct print, colored or poor quality
`illustrations and photographs, print bleedthrough, substandard margins,
`and improper alignment can adversely affect reproduction.
`
`In the unlikely event that the author did not send UMI a complete
`manuscript and there are missing pages, these will be noted. Also, if
`unauthorized copyright material had to be removed, a note will indicate
`the deletion.
`
`Oversize materials (e.g., maps, drawings, charts) are reproduced by
`sectioning the original, beginning at the upper left-hand corner and
`continuing from left to right in equal sections with small overlaps. Each
`original is also photographed in one exposure and is included in
`reduced form at the back of the book.
`
`Photographs included in the original manuscript have been reproduced
`xerographically in this copy. Higher quality 6" x 9" black and white
`photographic prints are available for any photographs or illustrations
`appearing in this copy for an additional charge. Contact UMI directly
`to order.
`
`UMI
`
`University Microfilms International
`A Bell & Howell Information Company
`300 North Zeeb Road Ann Arbor. Ml 48106-1346 USA
`313'761-4700 800/521-0600
`
`BECKMAN00000189
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 3 of 185 PageID #: 2426
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 3 of 185 PagelD #: 2426
`
`BECKMANO00000190
`
`BECKMAN00000190
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 4 of 185 PageID #: 2427
`
`Order Number 9503201
`
`Simplified expression of message-driven p r o g r a ms a nd
`quantification of t h e ir i m p a ct on p e r f o r m a n ce
`
`Gursoy, Attila, Ph.D.
`
`University of Illinois at Urbana-Champaign, 1994
`
`Copyright ©1994 by Gursoy, Attila. All rights reserved.
`
`UMI
`
`300N.ZeebRd.
`Ann Aibor, MI 48106
`
`BECKMAN00000191
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 5 of 185 PageID #: 2428
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 5 of 185 PagelD #: 2428
`
`BECKMANO00000192
`
`BECKMAN00000192
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 6 of 185 PageID #: 2429
`
`SIMPLIFIED EXPRESSION OF MESSAGE-DRIVEN PROGRAMS AND
`QUANTIFICATION OF THEIR IMPACT ON PERFORMANCE
`
`BY
`
`ATTILA GURSOY
`
`B.Sc, Middle East Technical University, 1986
`M.S., Bilkent University, 1988
`
`THESIS
`
`Submitted in partial fulfillment of the requirements
`for the degree of Doctor of Philosophy in Computer Science
`in the Graduate College of the
`University of Illinois at Urbana-Champaign, 1994
`
`Urbana, Illinois
`
`BECKMAN00000193
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 7 of 185 PageID #: 2430
`
`UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
`
`T HE GRADUATE COLLEGE
`
`APRIL 1994
`
`WE HEREBY RECOMMEND THAT T HE THESIS BY
`
`ATTILA GURSOY
`
`E N T I T L ED
`
`SIMPLIFIED EXPRESSION OF MESSAGE-DRIVEN PROGRAMS AND
`
`QUANTIFICATION OF THEIR IMPACT ON PERFORMANCE
`
`BE ACCEPTED IN PARTIAL FULFILLMENT OF T HE REQUIREMENTS FOR
`
`T HE DEGREE OF_
`
`DOCTOR OF PHILOSOPHY
`
`Director oi TTiesis Research
`
`Head of Department
`
`Committee on Final Examination!
`AViU,
`
`/W&
`
`Chairperson
`
`WLmJL ^ l l C ol
`
`-jam
`
`t Required for doctor's degree but not for master's
`
`BECKMAN00000194
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 8 of 185 PageID #: 2431
`
`©Copyright by
`
`Attila Gursoy
`
`1994
`
`BECKMAN00000195
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 9 of 185 PageID #: 2432
`
`Communication latency and unpredictable delays in remote response times constitute sig(cid:173)
`nificant impediments to achieving high performance on massively parallel computers. Message-
`driven execution is a promising technique to improve the performance of parallel computations
`by overlapping these delays with useful computation. This thesis explores message-driven exe(cid:173)
`cution for improving performance of parallel programs. Programming in message-driven style
`is difficult due to the split-phase transactions it requires and due to the nondeterministic ar(cid:173)
`rival of messages. We developed language constructs to express dependences between messages
`and computations in order to simplify expression of message-driven programs. Predicting the
`performance of message-driven programs via simulations is difficult because the arrival order
`of messages changes as the machine characteristics change. We developed a trace-driven sim(cid:173)
`ulation methodology based on the those language constructs. We also conducted an extensive
`performance study of message-driven programs.
`
`iii
`
`BECKMAN00000196
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 10 of 185 PageID #: 2433
`
`ACKNOWLEDGEMENTS
`
`I would first like to thank my advisor, Professor L.V. Kale, for his guidance and support
`
`during my doctoral research. He was a constant source of assistance, and his brilliant ideas
`
`were always stimulating. I also wish to thank Professor S.P. Vanka for being my co-advisor and
`
`for his support. I thank the other members of my committee: Professor M. Heath, D. Reed,
`
`and F. Saied, for their helpful recommendations.
`
`Thanks to the members of the Parallel Programming Laboratory, including S.Krishnan,
`
`E.Kornkven, and A. Sinha, for providing a wonderful working environment and for listening to
`
`my practice presentations.
`
`I also want to express my thanks to the Scientific and Technical Research Council of Turkey
`
`for providing me with the opportunity to pursue my doctoral studies at the University of Illinois.
`
`Many of my friends deserve credit for their help. I would especially like to thank Maria Jose
`
`Gonzalez for her loving support and encouragement.
`
`Finally, my profound thanks go to my parents: Ali and Aysegiil Gursoy, and to my sister,
`
`Ayla Gursoy. I cannot express with words how grateful I am for their constant support and
`
`love.
`
`iv
`
`BECKMAN00000197
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 11 of 185 PageID #: 2434
`
`TABLE OF CONTENTS
`
`Chapter
`
`1
`
`Introduction
`1.1 Remote information access latency
`1.2 Reducing the impact of latency: message-driven execution
`1.3 Thesis objectives
`1.3.1 Selection of computation domain for the performance study
`1.4 Main contribution of the thesis
`1.5 Thesis organization
`
`1
`2
`2
`4
`5
`6
`8
`
`2 Message-driven execution and Charm
`2.1 Traditional SPMD model
`2.1.1 Overlapping communication in SPMD
`2.1.2 Traditional SPMD is inadequate to develop efficient large programs
`2.2 Message-driven execution
`2.2.1 Message-driven execution supports modularity
`2.3 The potential benefits of message-driven execution
`2.4 Emulating message-driven style in SPMD
`2.4.1 Using nested if blocks
`2.4.2 Using a global while-switch loop
`2.5 Charm - a message-driven system
`2.5.1 The Charm language
`2.5.2 The Charm runtime
`2.6 Related work on latency tolerance
`
`10
`11
`11
`. .. 14
`15
`15
`16
`20
`20
`21
`24
`24
`27
`28
`
`3 Two obstacles
`3.1 Programming difficulties of message-driven style
`3.1.1 Nondeterministic message arrival
`3.1.2 Obscure flow of control
`3.2 Performance prediction: how to simulate message-driven computations
`
`4 Controlling the complexity of message-driven programs
`4.1 Divide-And-Conquer: a simpler context
`4.1.1 Language definition
`4.1.2 Data declarations
`4.1.3 Blocks
`4.1.4 Statements
`
`31
`31
`32
`35
`36
`
`38
`39
`40
`40
`41
`42
`
`v
`
`BECKMAN00000198
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 12 of 185 PageID #: 2435
`
`4.1.5 An example
`4.2 Dagger
`4.2.1 Expecting a message
`4.2.2 An example
`4.2.3 Basic language
`4.2.4 Dag-chare example
`4.2.5 Extended language
`4.2.6 Expressing loops in dag-chare
`4.2.7 Reference numbers
`4.2.8 Pipelining independent iterations of a loop
`4.2.9 Multiple message entry points
`4.2.10 Translation and runtime of Dagger
`4.3 Message-driven libraries
`4.3.1 Problems with libraries in SPMD style
`4.3.2 Message-driven execution and library interface techniques
`4.4 Related work
`
`5 Simulating message-driven programs
`5.1
`Introduction
`5.2 A sufficient condition for accurate simulation
`5.3 Dagger programs and automatic trace generation
`5.4 The parallel machine model
`5.5 Simulator
`5.5.1 Preprocessor
`5.5.2 Parallel machine simulator
`5.5.3
`Interpreting the traces
`
`6 Performance studies
`6.1 Description of benchmarks
`6.1.1 Synthetic benchmarks
`6.1.2 Concurrent reductions
`6.1.3 Harlow-Welch
`6.1.4 Conjugate Gradient
`6.2 Effects of network latency
`6.2.1 Synthetic benchmarks
`6.2.2 Concurrent reductions
`6.2.3 Harlow-Welch
`6.2.4 Conjugate Gradient
`6.3 Effects of coprocessor
`6.4 Random variations in latencies
`6.5 Load balance versus critical path
`6.5.1 Load balanced spanning trees and message-driven execution
`6.5.2 Complementary spanning trees for multiple reductions
`6.6 Effects of message scheduling
`6.6.1 Preemptive scheduling
`6.6.2 Priority-based scheduling
`
`vi
`
`42
`44
`44
`45
`48
`50
`50
`52
`58
`59
`60
`62
`66
`66
`67
`72
`
`79
`79
`81
`83
`84
`88
`88
`89
`90
`
`92
`93
`94
`96
`101
`110
`114
`114
`117
`118
`124
`124
`132
`133
`139
`140
`141
`141
`145
`
`BECKMAN00000199
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 13 of 185 PageID #: 2436
`
`6.7 Conclusion
`
`7 Conclusion
`7.1 Future work
`
`Bibliography
`
`Vita
`
`152
`
`155
`157
`
`159
`
`168
`
`vii
`
`BECKMAN00000200
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 14 of 185 PageID #: 2437
`
`LIST OF TABLES
`
`5.1 Machine parameters
`5.2 Events in a when-block trace
`
`6.1 CG results on NCUBE/2
`6.2 Communication parameters for the coprocessor experiment
`6.3 Communication parameter settings in the variable latency test
`6.4 Effects of branching factor
`6.5 Effect of complementary spanning trees
`
`86
`89
`
`113
`127
`132
`140
`141
`
`viii
`
`BECKMAN00000201
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 15 of 185 PageID #: 2438
`
`LIST OF FIGURES
`
`1.1 Remote information access latency.
`
`2.1 Simple SPMD codes (a) with message passing primitives (b) with library/module
`calls
`2.2 Rearranging send and receives (a) a sample code (b) rearranged code
`2.3 Processor utilization for (a) naive code (b) rearranged code
`2.4 SPMD modules cannot share the processor time
`2.5 Message-driven modules share the processor time
`2.6 s independent threads
`2.7 Overlapping latency
`2.8 Latency tolerance
`2.9 Effect of overhead
`2.10 Using nested if's to simulate message-driven execution
`2.11 A global while-switch construct to simulate message-driven execution
`2.12 Chare definition
`2.13 A branch office - ring program
`2.14 Multiple modules in Charm (a) module M2 accesses entities in module Ml (b)
`interface module for Ml HI.interface
`
`Incorrect message-driven code
`3.1
`3.2 Correct message-driven code
`3.3 Flow of control (a) SPMD (b) pure message-driven (c) partial order
`
`3
`
`12
`13
`13
`14
`16
`17
`17
`19
`20
`22
`22
`25
`26
`
`28
`
`33
`34
`36
`
`4.1 Divide-and-conquer node definition
`4.2 Node definition to compute Fibonacci numbers
`4.3 A message triggers a computation (a) in pure message-driven (b) in Dagger.
`4.4 Matrix multiplication chare
`4.5 Matrix multiplication dag-chare
`4.6 Dag-chare template
`4.7 Dag-chare illustrating adaptive overlapping
`4.8 Red-black Gauss-Seidel (a) partitions (b) dependences on one processor
`4.9 Dag-chare for Gauss-Seidel red-black relaxation
`4.10 Jacobi (a) partitioning (b) dependences on one processor
`4.11 Partial dag-chare for Jacobi relaxation
`4.12 Out of order messages
`4.13 Loop structure in (a) red-black (b) Jacobi
`4.14 Correct Jacobi relaxation with reference numbers
`
`41
`43
`. . 44
`46
`47
`48
`51
`53
`55
`56
`56
`57
`58
`59
`
`ix
`
`BECKMAN00000202
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 16 of 185 PageID #: 2439
`
`4.15 Pipelining loop iterations
`4.16 A reduction dag-chare illustrating multiple message entries
`4.17 Translation of a dag-chare
`4.18 Library call
`4.19 Using the reduction library
`4.20 Distributed Processes example
`4.21 Ada select statement
`
`5.1 A dag computation
`5.2 The parallel machine
`5.3 Sending a message
`5.4 Message queues
`5.5 Simulation system
`
`6.1 Synthetic benchmark Wave (a) message-driven (b) SPMD
`6.2 Synthetic benchmark Mlib (a) message-driven (b) SPMD
`6.3 Concurrent reductions (a) SPMD (b) message-driven
`6.4 Pipelining and effects of overhead
`6.5 Concurrent reductions: effect of number of partitions on NCUBE/2
`6.6 Tolerating latency: concurrent reductions on NCUBE/2
`6.7 One time step of Harlow-Welch algorithm
`6.8 Decomposition of the computational domain
`6.9 Jacobi message-driven code
`6.10 Red-Blackl message-driven code
`6.11 Dependences of the Stone's method
`6.12 Stone's method: effect of pipelining
`6.13 Stone's method on NCUBE/2 (a) without reductions (b) with reductions
`6.14 Stone's method message-driven code
`6.15 Dependences in the modified CG method
`6.16 Effects of network latency: Synthetic Wave
`6.17 Effects of network latency: Synthetic Mlib
`6.18 Effects of network latency: Synthetic Mlib with varying computation load
`6.19 Concurrent reductions: effect of network latency an e t, and /3nct
`6.20 Harlow Welch with Jacobi: effect of network latency anet, and /3net (a) (Jnet = 0
`(b) pnet = 1(c) &,«, = 10
`6.21 Harlow Welch with Red-Blackl: effect of network latency an e t, and /3net (a)
`Pnet = 0 (b) &,,, = 1 (c) pnet = 10
`6.22 Harlow Welch with Red-Black2: effect of network latency anet, and Pnet (a)
`/Jnet = 0 (b) (inct = 1 (c) 0net = 10
`6.23 Harlow Welch Stone: effect of network latency anct, and P„et (a) Pnet = 0 (b)
`Pnet = 1 (C) Pnet = 10
`6.24 Conjugate Gradient (model):effect of network latency anet,and pnet (a) Pnet = 0
`(b) pnet = 1 (c) pnet = 10
`6.25 Effect of the coprocessor - concurrent reductions. Sum of processor and copro(cid:173)
`cessor delays (a) 1000 (b) 10000 units
`6.26 Effect of the coprocessor - multiplane Jacobi. Sum of processor and coprocessor
`delays (a) 1000 (b) 10000 units
`
`x
`
`60
`62
`63
`68
`71
`74
`76
`
`82
`85
`85
`87
`88
`
`95
`97
`98
`100
`101
`102
`102
`102
`104
`105
`107
`109
`I ll
`112
`113
`115
`116
`117
`119
`
`120
`
`122
`
`123
`
`125
`
`126
`
`129
`
`130
`
`BECKMAN00000203
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 17 of 185 PageID #: 2440
`
`6.27 Effect of the coprocessor - multiplane Stone's method. Sum of processor and
`coprocessor delays (a) 1000 (b) 10000 units
`6.28 Variable network latencies - Synthetic Wave
`6.29 Variable network latencies - Synthetic Mlib
`6.30 Variable network latencies - Concurrent reductions
`6.31 Variable network latencies - Multiplane Jacobi
`6.32 Variable network latencies - Multiplane Stone
`6.33 Variable network latencies - One plane Red-Black2
`6.34 Load balance of spanning trees
`6.35 A scheduling problem
`6.36 Reductions with interrupts
`6.37 Reductions with higher priority
`6.38 Two different schedules
`6.39 Multi-plane Jacobi (a) load index (b) concurrency index
`6.40 Multi-plane Jacobi : FIFO versus priority.
`6.41 Multi-plane Stone (a) load index (b) concurrency index
`6.42 Multi-plane Stone : FIFO versus priority.
`6.43 Greedy scheduling example (a) FIFO (b) forced static scheduling
`
`131
`133
`134
`135
`136
`137
`138
`140
`143
`144
`146
`148
`149
`149
`150
`150
`151
`
`xi
`
`BECKMAN00000204
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 18 of 185 PageID #: 2441
`
`Chapter 1
`
`Introduction
`
`Parallel computers potentially offer much more computational power than uniprocessor ma(cid:173)
`
`chines. Although recent advances in VLSI and microprocessor technology have improved the
`
`performance of uniprocessor systems, there are many problems that require computational
`
`power well beyond what uniprocessor systems can provide. Therefore, substantial effort is
`
`devoted to developing efficient and cost effective parallel machines.
`
`Currently, there are many commercial parallel computers with very high peak performance.
`
`Machines with tens of gigaflops capacity are already available, and teraflops machines are ex(cid:173)
`
`pected in the next decade. However, their performance falls substantially short of peak on many
`
`real life applications. A number of factors contribute to this performance loss. For example, the
`
`sequential performance of individual processors itself may fall short of the peak power because
`
`of cache performance and inability to exploit fully features such as multiple instruction issues,
`
`pipelining, etc. Hopefully, this problem can be solved eventually by better compilers, use of
`
`optimized libraries such as BLAS routines [28], and occasional use of assembly code. In any
`
`case, this factor is not specific to parallel computing, as performance of workstations is also
`
`affected by it.
`
`An important factor that is specific to parallel computing is the loss of performance due to
`
`communication latencies and processor idling due to load imbalances and critical paths. These
`
`factors and methods for overcoming them are the focus of this thesis.
`
`1
`
`BECKMAN00000205
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 19 of 185 PageID #: 2442
`
`1.1 Remote information access latency
`
`A parallel computation is a collection of co-operating processes. The processes and data are
`
`distributed across the parallel machine. Typically, these concurrent processes interact with each
`
`other during the course of computation. This interaction might be due to synchronization or to
`
`data access. In either case, a process needs information that is created or stored by some other
`
`processes. If the needed information is on some other processor, then access to this remote
`
`information will be slower than that to local information. This slowdown generally occurs for
`
`two reasons.
`
`One of the reasons is the delay introduced by the communication network. The communi(cid:173)
`
`cation networks of large private memory machines are usually multi-hop switching networks.
`
`Messages experience delay of transmission, routing, and buffering. This delay, which we will
`
`call communication delay, is defined as the time interval between the moment data enter into
`
`the communication network and the time the data become available at the destination.
`
`The other source that contributes to the remote information access latency is the delay in
`
`the creation of the information at the remote site (Figure 1.1). Process 1 sends a request for
`
`some particular information. When process 2 receives the request, it performs the required
`
`service, creates the information and sends it back. The time interval between the arrival of
`
`the request and the completion of the requested service is the response delay. The response
`
`delay can be longer than the service time itself and is often unpredictable. For example, the
`
`processor might be scheduled for some other task, or cache performance might delay processing.
`
`Response delays take place not only in request-and-respond type of interaction; it equally affects
`
`prearranged communication (i.e., prepare-and-send) where no request is necessary.
`
`These two types of delays together form the remote information access latency, simply
`
`referred to as latency. Minimizing the impact of this latency is a major objective in parallel
`
`programming on parallel machines, and particularly on massively parallel scalable machines.
`
`1.2 Reducing the impact of latency: message-driven execution
`
`The impact of latency can be reduced in several ways. On the hardware side, this is addressed
`
`by designing architectures that attempt to reduce the communication latency to the minimum.
`
`The ALLCACHE architecture of the KSR-1 machine, and the message-processor architecture
`
`2
`
`BECKMAN00000206
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 20 of 185 PageID #: 2443
`
`overhead
`communication latency
`
`response delay
`
`Process 1
`
`Process 2
`
`Figure 1.1: Remote information access latency.
`
`of J-Machine [22] are examples of these attempts as well as the continuous evolution of commu(cid:173)
`
`nication hardware in the traditional architectures of Intel and NCUBE machines [77]. However,
`
`physical reality dictates that remote access will always be significantly slower than local access
`
`(overhead can be reduced by better operating system support such as Active messages [31],
`
`SUNMOS).
`
`Since latency cannot be eliminated completely, one may try to minimize the number of
`
`remote data accesses. This requires the placement of computations and data to be reorganized
`
`so that data accesses are localized as much as possible [43, 92].
`
`A complementary approach to reduce the impact of latency, on the software side, is to over(cid:173)
`
`lap the delay with some useful computation. This idea — doing something else useful while
`
`waiting for data — has been used at different levels in computer systems. Multiple functional
`
`units of a cpu may overlap independent operations. For example, while a floating-point com(cid:173)
`
`putation unit is busy, the next instruction and data could be fetched from the memory. In the
`
`prevalent parallel programming paradigm (the traditional SPMD paradigm, see Section 2.1),
`
`such overlap is enhanced by moving sends earlier and moving receives later. However, this
`
`3
`
`BECKMAN00000207
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 21 of 185 PageID #: 2444
`
`strategy is not sufficiently adaptive to the eventualities that may arise at runtime (see Chap(cid:173)
`
`ter 2).
`
`Message-driven execution is a promising technique in this regard. In the message-driven
`
`execution style (which is distinct from message-passing), there are typically many processes
`
`per processor. A process does not block the processor on which it is running while trying to
`
`receive a message. Instead, the system activates a process when there is a message for it. This
`
`improves latency tolerance in two ways. First, when one process is waiting for data from a
`
`remote process, another ready process may be scheduled for execution. Second, even a single
`
`process may wait for multiple data items simultaneously, and continue execution whenever any
`
`of the expected items arrive. If there are multiple subcomputations, each with f* and ti running
`
`time respectively, the message-driven running time is ti +<2 - tov„iav as explained in Chapter 2.
`
`As we will see later, this leads to adaptive overlap and modularity.
`
`1.3 Thesis objectives
`
`This thesis will explore the message-driven execution technique for tolerating communication
`
`latencies. The major objectives of this thesis are:
`
`Simplification of expression of message-driven execution:
`
`Although message-driven execution imparts the benefits alluded to above, it often extracts
`
`a price in the form of apparent program complexity. The split-phase or continuation-
`
`passing style of programming that it requires is sometimes non-intuitive and obfuscates
`
`the flow of control. As the system may execute messages in the order it receives them (as
`
`opposed to a deterministic order imposed by sequential receive statements), the program(cid:173)
`
`mer must deal with all possible orderings of messages. This involves complex reasoning
`
`about which message-orderings will not arise, which are harmless, and which must be dealt
`
`with by buffering, counters, and flags. Therefore, it is desirable and important to simplify
`
`the expression of message-driven programs, possibly via new languages or annotations.
`
`Performance prediction of message-driven execution:
`
`Predicting the performance of message-driven computations via simulations is necessary
`
`to evaluate the benefits of message driven execution under various machine characteristics.
`
`In such computations, the sequence in which messages are processed is not fully specified.
`
`4
`
`BECKMAN00000208
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 22 of 185 PageID #: 2445
`
`As a result, at runtime messages may be executed in different sequences depending on
`
`runtime conditions. This makes simulation and performance prediction of message-driven
`
`programs a challenging task.
`
`Evaluation of the performance impact of message driven execution:
`
`The ability to overlap computation with communication provides message-driven exe(cid:173)
`
`cution with the potential to improve the performance of a parallel program. However,
`
`this potential benefit needs to be established by performance studies. Such performance
`
`studies will help to determine:
`
`• the effectiveness of message-driven execution in increasing performance,
`
`• the factors that influence the performance of message driven programs,
`
`• the conditions under which these performance benefits are realizable.
`
`1.3.1
`
`Selection of c o m p u t a t i on d o m a in for t he p e r f o r m a n ce s t u dy
`
`The performance studies in this thesis involve simulation and real machine implementation
`
`of selected computations. For dynamic tree-structured computations such as heuristic search,
`
`divide-and-conquer, game tree search, and evaluation of functional and logic languages, the
`
`performance advantages of message-driven execution is intuitively quite clear. In such problems,
`
`the message-driven strategy can adapt to the unpredictable nature of communication latencies
`
`and computational variations better than the conventional message passing style. For example,
`
`while evaluating a node of an and-or tree, one cannot predict if the node will be a simple leaf
`
`node or will lead to a large subtree. The structure of the tree is data dependent and dynamic.
`
`Due to the unpredictable nature of parallelism, one must assign many tree nodes per processor.
`
`As each node has a complex behavior, it is natural to treat each node as an individual object.
`
`The system must then process every incoming message, decide whether it involves creation
`
`of a new object (i.e., creation of a node), or it is a response from one of the subproblems
`
`of an and-or tree node, and activate the corresponding node to process this message. Each
`
`node itself must be willing to accept solutions (or a failure to find solutions) for any one of
`
`the multiple subproblems it may have started and create new subproblems that may depend
`
`on such solutions. All of this naturally leads to message-driven style for implementation. For
`
`5
`
`BECKMAN00000209
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 23 of 185 PageID #: 2446
`
`these reasons, many researchers in these areas have used message driven-execution, implicitly
`
`or explicitly [19, 63, 82].
`
`A more interesting question, therefore, is whether message-driven execution is beneficial for
`
`static computations — computations whose overall structure is known ahead of time. Therefore,
`
`investigating the performance advantages of message-driven execution for such computations
`
`will help answer an interesting and important question; if such advantages are established,
`
`it will lead to a stronger endorsement of the message-driven execution. Despite their static
`
`nature, communication latencies and their associated unpredictability make these computations
`
`potential candidates for message-driven execution.
`
`Computational Fluid Dynamics (CFD) appears to be a proper area to conduct the per(cid:173)
`
`formance studies because (a) many CFD algorithms have a static structure, and (b) it is an
`
`important application area for parallel computing. A review of many CFD applications reveals
`
`that many of them use a few numerical kernel algorithms that have regular computational and
`
`communication structure [4, 7, 10, 11, 24, 42, 75]. We chose a few of these core problems for
`
`our study as described in Chapter 6.
`
`1.4 Main contribution of the thesis
`
`• The thesis develops a set of arguments and lines of reasoning with examples highlighting
`
`the differences between SPMD and message-driven programs and argues for the potential
`
`advantages of message-driven execution.
`
`• The thesis research involves development and implementation of a language, Dagger, for
`
`simplified expression of message-driven programs.
`
`• A methodology is devised for trace-driven simulation of message-driven programs, and a
`
`simulator based upon it is implemented.
`
`• The thesis includes an extensive performance study involving several synthetic and real
`
`benchmarks, and performance measurement on real machines as well as simulations with
`
`varying machine parameters.
`
`The idea of message-driven execution was first proposed by Hewitt [53] and later was elab(cid:173)
`
`orated by Agha [1]. The work on dataflow also relates to the same idea at a finer grain and
`
`6
`
`BECKMAN00000210
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 24 of 185 PageID #: 2447
`
`at the hardware level. However, there has not been an extensive comparison of this approach
`
`with the now dominant message passing model (receive based SPMD model that is elaborated
`
`in Chapter 2). We argue that modularity and efficiency - in the form of overlapping commu(cid:173)
`
`nication with computation - can be achieved much more easily in message-driven execution
`
`than in the traditional message passing paradigm. We also show why it is not adequate to
`
`emulate message-driven execution with the message passing style. Section 2.6 discusses other
`
`approaches related to message-driven execution with the Charm system. Charm was one of the
`
`first message-driven systems and is used as an implementation substrate in this thesis.
`
`Programming in message-driven style is difficult due to the split-phase style of programming
`
`it requires and due to the nondeterministic arrival of messages. Therefore, the synchronization
`
`of messages and subcomputations in a process is necessary to maintain the integrity of the
`
`computation. Dagger expresses the partial orders among subcomputations and messages within
`
`an object, and yet retains the advantages of message-driven execution (adaptive scheduling of
`
`subcomputations based on message arrivals).
`
`The issue of synchronization has been dealt with extensively in past research starting from
`
`the mid 1970's. However, many of the schemes require the caller of a method to block until the
`
`called object finished its service. In most of these approaches, the server is also blocked in a
`
`variety of contexts. None of these features can be used effectively to express adaptive overlap
`
`of different computations based on availability of data, which is an important requirement
`
`for message-driven execution. The distributed process model by Hansen [50] comes closest to
`
`Dagger in many respects. This model as well as others are compared to Dagger in Section 4.4.
`
`More recent work on the Actor model and other concurrent object oriented languages also
`
`address the issue of local synchronization within an object. These studies were also summarized
`
`in Section 4.4.
`
`Although trace-driven simulation is a very efficient method for predicting performance of
`
`parallel programs under varying runtime conditions, such simulation methodology has not been
`
`applied to message-driven programs or to programs that use wild-card receives. The basic
`
`difficulty in such simulations is the fact that if two messages are received in different orders
`
`under different runtime conditions, then the behavior of the program may be altered in a
`
`way that cannot be reconstructed based on the traces obtained with the first sequence. The
`
`thesis presents a methodology for simulation of message-driven programs that exploits the
`
`7
`
`BECKMAN00000211
`
`
`
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 25 of 185 PageID #: 2448
`
`dependence information provided by Dagger at compile time combined with a special method
`
`for obtaining runtime traces. We know of no other results that can carry out performance
`
`analysis or simulations in a trace driven manner for such programs. Further, existing systems
`
`do not handle the case of a different message ordering during simulation.
`
`The thesis includes an extensive performance study involving several synthetic and real
`
`benchmarks, and performance measurement