throbber
Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 1 of 185 PageID #: 2424
`
`
`
`
`
`Exhibit D
`Part 2 of 2
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 2 of 185 PageID #: 2425
`
`INFORMATION TO USERS
`
`This manuscript has been reproduced from the microfilm master. UMI
`films the text directly from the original or copy submitted. Thus, some
`thesis and dissertation copies are in typewriter face, while others may
`be from any type of computer printer.
`
`The quality of this reproduction is dependent upon the quality of the
`copy submitted. Broken or indistinct print, colored or poor quality
`illustrations and photographs, print bleedthrough, substandard margins,
`and improper alignment can adversely affect reproduction.
`
`In the unlikely event that the author did not send UMI a complete
`manuscript and there are missing pages, these will be noted. Also, if
`unauthorized copyright material had to be removed, a note will indicate
`the deletion.
`
`Oversize materials (e.g., maps, drawings, charts) are reproduced by
`sectioning the original, beginning at the upper left-hand corner and
`continuing from left to right in equal sections with small overlaps. Each
`original is also photographed in one exposure and is included in
`reduced form at the back of the book.
`
`Photographs included in the original manuscript have been reproduced
`xerographically in this copy. Higher quality 6" x 9" black and white
`photographic prints are available for any photographs or illustrations
`appearing in this copy for an additional charge. Contact UMI directly
`to order.
`
`UMI
`
`University Microfilms International
`A Bell & Howell Information Company
`300 North Zeeb Road Ann Arbor. Ml 48106-1346 USA
`313'761-4700 800/521-0600
`
`BECKMAN00000189
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 3 of 185 PageID #: 2426
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 3 of 185 PagelD #: 2426
`
`BECKMANO00000190
`
`BECKMAN00000190
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 4 of 185 PageID #: 2427
`
`Order Number 9503201
`
`Simplified expression of message-driven p r o g r a ms a nd
`quantification of t h e ir i m p a ct on p e r f o r m a n ce
`
`Gursoy, Attila, Ph.D.
`
`University of Illinois at Urbana-Champaign, 1994
`
`Copyright ©1994 by Gursoy, Attila. All rights reserved.
`
`UMI
`
`300N.ZeebRd.
`Ann Aibor, MI 48106
`
`BECKMAN00000191
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 5 of 185 PageID #: 2428
`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 5 of 185 PagelD #: 2428
`
`BECKMANO00000192
`
`BECKMAN00000192
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 6 of 185 PageID #: 2429
`
`SIMPLIFIED EXPRESSION OF MESSAGE-DRIVEN PROGRAMS AND
`QUANTIFICATION OF THEIR IMPACT ON PERFORMANCE
`
`BY
`
`ATTILA GURSOY
`
`B.Sc, Middle East Technical University, 1986
`M.S., Bilkent University, 1988
`
`THESIS
`
`Submitted in partial fulfillment of the requirements
`for the degree of Doctor of Philosophy in Computer Science
`in the Graduate College of the
`University of Illinois at Urbana-Champaign, 1994
`
`Urbana, Illinois
`
`BECKMAN00000193
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 7 of 185 PageID #: 2430
`
`UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
`
`T HE GRADUATE COLLEGE
`
`APRIL 1994
`
`WE HEREBY RECOMMEND THAT T HE THESIS BY
`
`ATTILA GURSOY
`
`E N T I T L ED
`
`SIMPLIFIED EXPRESSION OF MESSAGE-DRIVEN PROGRAMS AND
`
`QUANTIFICATION OF THEIR IMPACT ON PERFORMANCE
`
`BE ACCEPTED IN PARTIAL FULFILLMENT OF T HE REQUIREMENTS FOR
`
`T HE DEGREE OF_
`
`DOCTOR OF PHILOSOPHY
`
`Director oi TTiesis Research
`
`Head of Department
`
`Committee on Final Examination!
`AViU,
`
`/W&
`
`Chairperson
`
`WLmJL ^ l l C ol
`
`-jam
`
`t Required for doctor's degree but not for master's
`
`BECKMAN00000194
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 8 of 185 PageID #: 2431
`
`©Copyright by
`
`Attila Gursoy
`
`1994
`
`BECKMAN00000195
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 9 of 185 PageID #: 2432
`
`Communication latency and unpredictable delays in remote response times constitute sig(cid:173)
`nificant impediments to achieving high performance on massively parallel computers. Message-
`driven execution is a promising technique to improve the performance of parallel computations
`by overlapping these delays with useful computation. This thesis explores message-driven exe(cid:173)
`cution for improving performance of parallel programs. Programming in message-driven style
`is difficult due to the split-phase transactions it requires and due to the nondeterministic ar(cid:173)
`rival of messages. We developed language constructs to express dependences between messages
`and computations in order to simplify expression of message-driven programs. Predicting the
`performance of message-driven programs via simulations is difficult because the arrival order
`of messages changes as the machine characteristics change. We developed a trace-driven sim(cid:173)
`ulation methodology based on the those language constructs. We also conducted an extensive
`performance study of message-driven programs.
`
`iii
`
`BECKMAN00000196
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 10 of 185 PageID #: 2433
`
`ACKNOWLEDGEMENTS
`
`I would first like to thank my advisor, Professor L.V. Kale, for his guidance and support
`
`during my doctoral research. He was a constant source of assistance, and his brilliant ideas
`
`were always stimulating. I also wish to thank Professor S.P. Vanka for being my co-advisor and
`
`for his support. I thank the other members of my committee: Professor M. Heath, D. Reed,
`
`and F. Saied, for their helpful recommendations.
`
`Thanks to the members of the Parallel Programming Laboratory, including S.Krishnan,
`
`E.Kornkven, and A. Sinha, for providing a wonderful working environment and for listening to
`
`my practice presentations.
`
`I also want to express my thanks to the Scientific and Technical Research Council of Turkey
`
`for providing me with the opportunity to pursue my doctoral studies at the University of Illinois.
`
`Many of my friends deserve credit for their help. I would especially like to thank Maria Jose
`
`Gonzalez for her loving support and encouragement.
`
`Finally, my profound thanks go to my parents: Ali and Aysegiil Gursoy, and to my sister,
`
`Ayla Gursoy. I cannot express with words how grateful I am for their constant support and
`
`love.
`
`iv
`
`BECKMAN00000197
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 11 of 185 PageID #: 2434
`
`TABLE OF CONTENTS
`
`Chapter
`
`1
`
`Introduction
`1.1 Remote information access latency
`1.2 Reducing the impact of latency: message-driven execution
`1.3 Thesis objectives
`1.3.1 Selection of computation domain for the performance study
`1.4 Main contribution of the thesis
`1.5 Thesis organization
`
`1
`2
`2
`4
`5
`6
`8
`
`2 Message-driven execution and Charm
`2.1 Traditional SPMD model
`2.1.1 Overlapping communication in SPMD
`2.1.2 Traditional SPMD is inadequate to develop efficient large programs
`2.2 Message-driven execution
`2.2.1 Message-driven execution supports modularity
`2.3 The potential benefits of message-driven execution
`2.4 Emulating message-driven style in SPMD
`2.4.1 Using nested if blocks
`2.4.2 Using a global while-switch loop
`2.5 Charm - a message-driven system
`2.5.1 The Charm language
`2.5.2 The Charm runtime
`2.6 Related work on latency tolerance
`
`10
`11
`11
`. .. 14
`15
`15
`16
`20
`20
`21
`24
`24
`27
`28
`
`3 Two obstacles
`3.1 Programming difficulties of message-driven style
`3.1.1 Nondeterministic message arrival
`3.1.2 Obscure flow of control
`3.2 Performance prediction: how to simulate message-driven computations
`
`4 Controlling the complexity of message-driven programs
`4.1 Divide-And-Conquer: a simpler context
`4.1.1 Language definition
`4.1.2 Data declarations
`4.1.3 Blocks
`4.1.4 Statements
`
`31
`31
`32
`35
`36
`
`38
`39
`40
`40
`41
`42
`
`v
`
`BECKMAN00000198
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 12 of 185 PageID #: 2435
`
`4.1.5 An example
`4.2 Dagger
`4.2.1 Expecting a message
`4.2.2 An example
`4.2.3 Basic language
`4.2.4 Dag-chare example
`4.2.5 Extended language
`4.2.6 Expressing loops in dag-chare
`4.2.7 Reference numbers
`4.2.8 Pipelining independent iterations of a loop
`4.2.9 Multiple message entry points
`4.2.10 Translation and runtime of Dagger
`4.3 Message-driven libraries
`4.3.1 Problems with libraries in SPMD style
`4.3.2 Message-driven execution and library interface techniques
`4.4 Related work
`
`5 Simulating message-driven programs
`5.1
`Introduction
`5.2 A sufficient condition for accurate simulation
`5.3 Dagger programs and automatic trace generation
`5.4 The parallel machine model
`5.5 Simulator
`5.5.1 Preprocessor
`5.5.2 Parallel machine simulator
`5.5.3
`Interpreting the traces
`
`6 Performance studies
`6.1 Description of benchmarks
`6.1.1 Synthetic benchmarks
`6.1.2 Concurrent reductions
`6.1.3 Harlow-Welch
`6.1.4 Conjugate Gradient
`6.2 Effects of network latency
`6.2.1 Synthetic benchmarks
`6.2.2 Concurrent reductions
`6.2.3 Harlow-Welch
`6.2.4 Conjugate Gradient
`6.3 Effects of coprocessor
`6.4 Random variations in latencies
`6.5 Load balance versus critical path
`6.5.1 Load balanced spanning trees and message-driven execution
`6.5.2 Complementary spanning trees for multiple reductions
`6.6 Effects of message scheduling
`6.6.1 Preemptive scheduling
`6.6.2 Priority-based scheduling
`
`vi
`
`42
`44
`44
`45
`48
`50
`50
`52
`58
`59
`60
`62
`66
`66
`67
`72
`
`79
`79
`81
`83
`84
`88
`88
`89
`90
`
`92
`93
`94
`96
`101
`110
`114
`114
`117
`118
`124
`124
`132
`133
`139
`140
`141
`141
`145
`
`BECKMAN00000199
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 13 of 185 PageID #: 2436
`
`6.7 Conclusion
`
`7 Conclusion
`7.1 Future work
`
`Bibliography
`
`Vita
`
`152
`
`155
`157
`
`159
`
`168
`
`vii
`
`BECKMAN00000200
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 14 of 185 PageID #: 2437
`
`LIST OF TABLES
`
`5.1 Machine parameters
`5.2 Events in a when-block trace
`
`6.1 CG results on NCUBE/2
`6.2 Communication parameters for the coprocessor experiment
`6.3 Communication parameter settings in the variable latency test
`6.4 Effects of branching factor
`6.5 Effect of complementary spanning trees
`
`86
`89
`
`113
`127
`132
`140
`141
`
`viii
`
`BECKMAN00000201
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 15 of 185 PageID #: 2438
`
`LIST OF FIGURES
`
`1.1 Remote information access latency.
`
`2.1 Simple SPMD codes (a) with message passing primitives (b) with library/module
`calls
`2.2 Rearranging send and receives (a) a sample code (b) rearranged code
`2.3 Processor utilization for (a) naive code (b) rearranged code
`2.4 SPMD modules cannot share the processor time
`2.5 Message-driven modules share the processor time
`2.6 s independent threads
`2.7 Overlapping latency
`2.8 Latency tolerance
`2.9 Effect of overhead
`2.10 Using nested if's to simulate message-driven execution
`2.11 A global while-switch construct to simulate message-driven execution
`2.12 Chare definition
`2.13 A branch office - ring program
`2.14 Multiple modules in Charm (a) module M2 accesses entities in module Ml (b)
`interface module for Ml HI.interface
`
`Incorrect message-driven code
`3.1
`3.2 Correct message-driven code
`3.3 Flow of control (a) SPMD (b) pure message-driven (c) partial order
`
`3
`
`12
`13
`13
`14
`16
`17
`17
`19
`20
`22
`22
`25
`26
`
`28
`
`33
`34
`36
`
`4.1 Divide-and-conquer node definition
`4.2 Node definition to compute Fibonacci numbers
`4.3 A message triggers a computation (a) in pure message-driven (b) in Dagger.
`4.4 Matrix multiplication chare
`4.5 Matrix multiplication dag-chare
`4.6 Dag-chare template
`4.7 Dag-chare illustrating adaptive overlapping
`4.8 Red-black Gauss-Seidel (a) partitions (b) dependences on one processor
`4.9 Dag-chare for Gauss-Seidel red-black relaxation
`4.10 Jacobi (a) partitioning (b) dependences on one processor
`4.11 Partial dag-chare for Jacobi relaxation
`4.12 Out of order messages
`4.13 Loop structure in (a) red-black (b) Jacobi
`4.14 Correct Jacobi relaxation with reference numbers
`
`41
`43
`. . 44
`46
`47
`48
`51
`53
`55
`56
`56
`57
`58
`59
`
`ix
`
`BECKMAN00000202
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 16 of 185 PageID #: 2439
`
`4.15 Pipelining loop iterations
`4.16 A reduction dag-chare illustrating multiple message entries
`4.17 Translation of a dag-chare
`4.18 Library call
`4.19 Using the reduction library
`4.20 Distributed Processes example
`4.21 Ada select statement
`
`5.1 A dag computation
`5.2 The parallel machine
`5.3 Sending a message
`5.4 Message queues
`5.5 Simulation system
`
`6.1 Synthetic benchmark Wave (a) message-driven (b) SPMD
`6.2 Synthetic benchmark Mlib (a) message-driven (b) SPMD
`6.3 Concurrent reductions (a) SPMD (b) message-driven
`6.4 Pipelining and effects of overhead
`6.5 Concurrent reductions: effect of number of partitions on NCUBE/2
`6.6 Tolerating latency: concurrent reductions on NCUBE/2
`6.7 One time step of Harlow-Welch algorithm
`6.8 Decomposition of the computational domain
`6.9 Jacobi message-driven code
`6.10 Red-Blackl message-driven code
`6.11 Dependences of the Stone's method
`6.12 Stone's method: effect of pipelining
`6.13 Stone's method on NCUBE/2 (a) without reductions (b) with reductions
`6.14 Stone's method message-driven code
`6.15 Dependences in the modified CG method
`6.16 Effects of network latency: Synthetic Wave
`6.17 Effects of network latency: Synthetic Mlib
`6.18 Effects of network latency: Synthetic Mlib with varying computation load
`6.19 Concurrent reductions: effect of network latency an e t, and /3nct
`6.20 Harlow Welch with Jacobi: effect of network latency anet, and /3net (a) (Jnet = 0
`(b) pnet = 1(c) &,«, = 10
`6.21 Harlow Welch with Red-Blackl: effect of network latency an e t, and /3net (a)
`Pnet = 0 (b) &,,, = 1 (c) pnet = 10
`6.22 Harlow Welch with Red-Black2: effect of network latency anet, and Pnet (a)
`/Jnet = 0 (b) (inct = 1 (c) 0net = 10
`6.23 Harlow Welch Stone: effect of network latency anct, and P„et (a) Pnet = 0 (b)
`Pnet = 1 (C) Pnet = 10
`6.24 Conjugate Gradient (model):effect of network latency anet,and pnet (a) Pnet = 0
`(b) pnet = 1 (c) pnet = 10
`6.25 Effect of the coprocessor - concurrent reductions. Sum of processor and copro(cid:173)
`cessor delays (a) 1000 (b) 10000 units
`6.26 Effect of the coprocessor - multiplane Jacobi. Sum of processor and coprocessor
`delays (a) 1000 (b) 10000 units
`
`x
`
`60
`62
`63
`68
`71
`74
`76
`
`82
`85
`85
`87
`88
`
`95
`97
`98
`100
`101
`102
`102
`102
`104
`105
`107
`109
`I ll
`112
`113
`115
`116
`117
`119
`
`120
`
`122
`
`123
`
`125
`
`126
`
`129
`
`130
`
`BECKMAN00000203
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 17 of 185 PageID #: 2440
`
`6.27 Effect of the coprocessor - multiplane Stone's method. Sum of processor and
`coprocessor delays (a) 1000 (b) 10000 units
`6.28 Variable network latencies - Synthetic Wave
`6.29 Variable network latencies - Synthetic Mlib
`6.30 Variable network latencies - Concurrent reductions
`6.31 Variable network latencies - Multiplane Jacobi
`6.32 Variable network latencies - Multiplane Stone
`6.33 Variable network latencies - One plane Red-Black2
`6.34 Load balance of spanning trees
`6.35 A scheduling problem
`6.36 Reductions with interrupts
`6.37 Reductions with higher priority
`6.38 Two different schedules
`6.39 Multi-plane Jacobi (a) load index (b) concurrency index
`6.40 Multi-plane Jacobi : FIFO versus priority.
`6.41 Multi-plane Stone (a) load index (b) concurrency index
`6.42 Multi-plane Stone : FIFO versus priority.
`6.43 Greedy scheduling example (a) FIFO (b) forced static scheduling
`
`131
`133
`134
`135
`136
`137
`138
`140
`143
`144
`146
`148
`149
`149
`150
`150
`151
`
`xi
`
`BECKMAN00000204
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 18 of 185 PageID #: 2441
`
`Chapter 1
`
`Introduction
`
`Parallel computers potentially offer much more computational power than uniprocessor ma(cid:173)
`
`chines. Although recent advances in VLSI and microprocessor technology have improved the
`
`performance of uniprocessor systems, there are many problems that require computational
`
`power well beyond what uniprocessor systems can provide. Therefore, substantial effort is
`
`devoted to developing efficient and cost effective parallel machines.
`
`Currently, there are many commercial parallel computers with very high peak performance.
`
`Machines with tens of gigaflops capacity are already available, and teraflops machines are ex(cid:173)
`
`pected in the next decade. However, their performance falls substantially short of peak on many
`
`real life applications. A number of factors contribute to this performance loss. For example, the
`
`sequential performance of individual processors itself may fall short of the peak power because
`
`of cache performance and inability to exploit fully features such as multiple instruction issues,
`
`pipelining, etc. Hopefully, this problem can be solved eventually by better compilers, use of
`
`optimized libraries such as BLAS routines [28], and occasional use of assembly code. In any
`
`case, this factor is not specific to parallel computing, as performance of workstations is also
`
`affected by it.
`
`An important factor that is specific to parallel computing is the loss of performance due to
`
`communication latencies and processor idling due to load imbalances and critical paths. These
`
`factors and methods for overcoming them are the focus of this thesis.
`
`1
`
`BECKMAN00000205
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 19 of 185 PageID #: 2442
`
`1.1 Remote information access latency
`
`A parallel computation is a collection of co-operating processes. The processes and data are
`
`distributed across the parallel machine. Typically, these concurrent processes interact with each
`
`other during the course of computation. This interaction might be due to synchronization or to
`
`data access. In either case, a process needs information that is created or stored by some other
`
`processes. If the needed information is on some other processor, then access to this remote
`
`information will be slower than that to local information. This slowdown generally occurs for
`
`two reasons.
`
`One of the reasons is the delay introduced by the communication network. The communi(cid:173)
`
`cation networks of large private memory machines are usually multi-hop switching networks.
`
`Messages experience delay of transmission, routing, and buffering. This delay, which we will
`
`call communication delay, is defined as the time interval between the moment data enter into
`
`the communication network and the time the data become available at the destination.
`
`The other source that contributes to the remote information access latency is the delay in
`
`the creation of the information at the remote site (Figure 1.1). Process 1 sends a request for
`
`some particular information. When process 2 receives the request, it performs the required
`
`service, creates the information and sends it back. The time interval between the arrival of
`
`the request and the completion of the requested service is the response delay. The response
`
`delay can be longer than the service time itself and is often unpredictable. For example, the
`
`processor might be scheduled for some other task, or cache performance might delay processing.
`
`Response delays take place not only in request-and-respond type of interaction; it equally affects
`
`prearranged communication (i.e., prepare-and-send) where no request is necessary.
`
`These two types of delays together form the remote information access latency, simply
`
`referred to as latency. Minimizing the impact of this latency is a major objective in parallel
`
`programming on parallel machines, and particularly on massively parallel scalable machines.
`
`1.2 Reducing the impact of latency: message-driven execution
`
`The impact of latency can be reduced in several ways. On the hardware side, this is addressed
`
`by designing architectures that attempt to reduce the communication latency to the minimum.
`
`The ALLCACHE architecture of the KSR-1 machine, and the message-processor architecture
`
`2
`
`BECKMAN00000206
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 20 of 185 PageID #: 2443
`
`overhead
`communication latency
`
`response delay
`
`Process 1
`
`Process 2
`
`Figure 1.1: Remote information access latency.
`
`of J-Machine [22] are examples of these attempts as well as the continuous evolution of commu(cid:173)
`
`nication hardware in the traditional architectures of Intel and NCUBE machines [77]. However,
`
`physical reality dictates that remote access will always be significantly slower than local access
`
`(overhead can be reduced by better operating system support such as Active messages [31],
`
`SUNMOS).
`
`Since latency cannot be eliminated completely, one may try to minimize the number of
`
`remote data accesses. This requires the placement of computations and data to be reorganized
`
`so that data accesses are localized as much as possible [43, 92].
`
`A complementary approach to reduce the impact of latency, on the software side, is to over(cid:173)
`
`lap the delay with some useful computation. This idea — doing something else useful while
`
`waiting for data — has been used at different levels in computer systems. Multiple functional
`
`units of a cpu may overlap independent operations. For example, while a floating-point com(cid:173)
`
`putation unit is busy, the next instruction and data could be fetched from the memory. In the
`
`prevalent parallel programming paradigm (the traditional SPMD paradigm, see Section 2.1),
`
`such overlap is enhanced by moving sends earlier and moving receives later. However, this
`
`3
`
`BECKMAN00000207
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 21 of 185 PageID #: 2444
`
`strategy is not sufficiently adaptive to the eventualities that may arise at runtime (see Chap(cid:173)
`
`ter 2).
`
`Message-driven execution is a promising technique in this regard. In the message-driven
`
`execution style (which is distinct from message-passing), there are typically many processes
`
`per processor. A process does not block the processor on which it is running while trying to
`
`receive a message. Instead, the system activates a process when there is a message for it. This
`
`improves latency tolerance in two ways. First, when one process is waiting for data from a
`
`remote process, another ready process may be scheduled for execution. Second, even a single
`
`process may wait for multiple data items simultaneously, and continue execution whenever any
`
`of the expected items arrive. If there are multiple subcomputations, each with f* and ti running
`
`time respectively, the message-driven running time is ti +<2 - tov„iav as explained in Chapter 2.
`
`As we will see later, this leads to adaptive overlap and modularity.
`
`1.3 Thesis objectives
`
`This thesis will explore the message-driven execution technique for tolerating communication
`
`latencies. The major objectives of this thesis are:
`
`Simplification of expression of message-driven execution:
`
`Although message-driven execution imparts the benefits alluded to above, it often extracts
`
`a price in the form of apparent program complexity. The split-phase or continuation-
`
`passing style of programming that it requires is sometimes non-intuitive and obfuscates
`
`the flow of control. As the system may execute messages in the order it receives them (as
`
`opposed to a deterministic order imposed by sequential receive statements), the program(cid:173)
`
`mer must deal with all possible orderings of messages. This involves complex reasoning
`
`about which message-orderings will not arise, which are harmless, and which must be dealt
`
`with by buffering, counters, and flags. Therefore, it is desirable and important to simplify
`
`the expression of message-driven programs, possibly via new languages or annotations.
`
`Performance prediction of message-driven execution:
`
`Predicting the performance of message-driven computations via simulations is necessary
`
`to evaluate the benefits of message driven execution under various machine characteristics.
`
`In such computations, the sequence in which messages are processed is not fully specified.
`
`4
`
`BECKMAN00000208
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 22 of 185 PageID #: 2445
`
`As a result, at runtime messages may be executed in different sequences depending on
`
`runtime conditions. This makes simulation and performance prediction of message-driven
`
`programs a challenging task.
`
`Evaluation of the performance impact of message driven execution:
`
`The ability to overlap computation with communication provides message-driven exe(cid:173)
`
`cution with the potential to improve the performance of a parallel program. However,
`
`this potential benefit needs to be established by performance studies. Such performance
`
`studies will help to determine:
`
`• the effectiveness of message-driven execution in increasing performance,
`
`• the factors that influence the performance of message driven programs,
`
`• the conditions under which these performance benefits are realizable.
`
`1.3.1
`
`Selection of c o m p u t a t i on d o m a in for t he p e r f o r m a n ce s t u dy
`
`The performance studies in this thesis involve simulation and real machine implementation
`
`of selected computations. For dynamic tree-structured computations such as heuristic search,
`
`divide-and-conquer, game tree search, and evaluation of functional and logic languages, the
`
`performance advantages of message-driven execution is intuitively quite clear. In such problems,
`
`the message-driven strategy can adapt to the unpredictable nature of communication latencies
`
`and computational variations better than the conventional message passing style. For example,
`
`while evaluating a node of an and-or tree, one cannot predict if the node will be a simple leaf
`
`node or will lead to a large subtree. The structure of the tree is data dependent and dynamic.
`
`Due to the unpredictable nature of parallelism, one must assign many tree nodes per processor.
`
`As each node has a complex behavior, it is natural to treat each node as an individual object.
`
`The system must then process every incoming message, decide whether it involves creation
`
`of a new object (i.e., creation of a node), or it is a response from one of the subproblems
`
`of an and-or tree node, and activate the corresponding node to process this message. Each
`
`node itself must be willing to accept solutions (or a failure to find solutions) for any one of
`
`the multiple subproblems it may have started and create new subproblems that may depend
`
`on such solutions. All of this naturally leads to message-driven style for implementation. For
`
`5
`
`BECKMAN00000209
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 23 of 185 PageID #: 2446
`
`these reasons, many researchers in these areas have used message driven-execution, implicitly
`
`or explicitly [19, 63, 82].
`
`A more interesting question, therefore, is whether message-driven execution is beneficial for
`
`static computations — computations whose overall structure is known ahead of time. Therefore,
`
`investigating the performance advantages of message-driven execution for such computations
`
`will help answer an interesting and important question; if such advantages are established,
`
`it will lead to a stronger endorsement of the message-driven execution. Despite their static
`
`nature, communication latencies and their associated unpredictability make these computations
`
`potential candidates for message-driven execution.
`
`Computational Fluid Dynamics (CFD) appears to be a proper area to conduct the per(cid:173)
`
`formance studies because (a) many CFD algorithms have a static structure, and (b) it is an
`
`important application area for parallel computing. A review of many CFD applications reveals
`
`that many of them use a few numerical kernel algorithms that have regular computational and
`
`communication structure [4, 7, 10, 11, 24, 42, 75]. We chose a few of these core problems for
`
`our study as described in Chapter 6.
`
`1.4 Main contribution of the thesis
`
`• The thesis develops a set of arguments and lines of reasoning with examples highlighting
`
`the differences between SPMD and message-driven programs and argues for the potential
`
`advantages of message-driven execution.
`
`• The thesis research involves development and implementation of a language, Dagger, for
`
`simplified expression of message-driven programs.
`
`• A methodology is devised for trace-driven simulation of message-driven programs, and a
`
`simulator based upon it is implemented.
`
`• The thesis includes an extensive performance study involving several synthetic and real
`
`benchmarks, and performance measurement on real machines as well as simulations with
`
`varying machine parameters.
`
`The idea of message-driven execution was first proposed by Hewitt [53] and later was elab(cid:173)
`
`orated by Agha [1]. The work on dataflow also relates to the same idea at a finer grain and
`
`6
`
`BECKMAN00000210
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 24 of 185 PageID #: 2447
`
`at the hardware level. However, there has not been an extensive comparison of this approach
`
`with the now dominant message passing model (receive based SPMD model that is elaborated
`
`in Chapter 2). We argue that modularity and efficiency - in the form of overlapping commu(cid:173)
`
`nication with computation - can be achieved much more easily in message-driven execution
`
`than in the traditional message passing paradigm. We also show why it is not adequate to
`
`emulate message-driven execution with the message passing style. Section 2.6 discusses other
`
`approaches related to message-driven execution with the Charm system. Charm was one of the
`
`first message-driven systems and is used as an implementation substrate in this thesis.
`
`Programming in message-driven style is difficult due to the split-phase style of programming
`
`it requires and due to the nondeterministic arrival of messages. Therefore, the synchronization
`
`of messages and subcomputations in a process is necessary to maintain the integrity of the
`
`computation. Dagger expresses the partial orders among subcomputations and messages within
`
`an object, and yet retains the advantages of message-driven execution (adaptive scheduling of
`
`subcomputations based on message arrivals).
`
`The issue of synchronization has been dealt with extensively in past research starting from
`
`the mid 1970's. However, many of the schemes require the caller of a method to block until the
`
`called object finished its service. In most of these approaches, the server is also blocked in a
`
`variety of contexts. None of these features can be used effectively to express adaptive overlap
`
`of different computations based on availability of data, which is an important requirement
`
`for message-driven execution. The distributed process model by Hansen [50] comes closest to
`
`Dagger in many respects. This model as well as others are compared to Dagger in Section 4.4.
`
`More recent work on the Actor model and other concurrent object oriented languages also
`
`address the issue of local synchronization within an object. These studies were also summarized
`
`in Section 4.4.
`
`Although trace-driven simulation is a very efficient method for predicting performance of
`
`parallel programs under varying runtime conditions, such simulation methodology has not been
`
`applied to message-driven programs or to programs that use wild-card receives. The basic
`
`difficulty in such simulations is the fact that if two messages are received in different orders
`
`under different runtime conditions, then the behavior of the program may be altered in a
`
`way that cannot be reconstructed based on the traces obtained with the first sequence. The
`
`thesis presents a methodology for simulation of message-driven programs that exploits the
`
`7
`
`BECKMAN00000211
`
`

`

`Case 2:21-cv-00040-JRG Document 94-4 Filed 10/12/21 Page 25 of 185 PageID #: 2448
`
`dependence information provided by Dagger at compile time combined with a special method
`
`for obtaining runtime traces. We know of no other results that can carry out performance
`
`analysis or simulations in a trace driven manner for such programs. Further, existing systems
`
`do not handle the case of a different message ordering during simulation.
`
`The thesis includes an extensive performance study involving several synthetic and real
`
`benchmarks, and performance measurement

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket