`
`Sebastian Gerlach, Roger D. Hersch
`Ecole Polytechnique Fédérale de Lausanne, Switzerland
`Sebastian.Gerlach@epfl.ch, RD.Hersch@epfl.ch
`http://dps.epfl.ch
`
`
`
`Abstract
`
`
`Dynamic Parallel Schedules (DPS) is a high-level framework
`for developing parallel applications on distributed memory
`computers
`(e.g. clusters of PCs).
`Its model relies on
`compositional customizable split-compute-merge graphs of
`operations (directed acyclic flow graphs). The graphs and the
`mapping of operations
`to processing nodes are specified
`dynamically at runtime. DPS applications are pipelined and
`multithreaded by construction, ensuring a maximal overlap of
`computations and communications. DPS applications can call
`parallel services exposed by other DPS applications, enabling the
`creation of reusable parallel components. The DPS framework
`relies on a C++ class library. Thanks to its dynamic nature, DPS
`offers new perspectives for the creation and deployment of
`parallel applications running on server clusters.
`
`Keywords: Parallel computation, parallel schedules, flow
`graphs, split-merge constructs, overlapping of computations and
`communications
`
`1. Introduction
`
`The development of high-level tools and languages for
`simplifying the task of creating parallel applications is one of
`today’s challenges. High-level abstractions should support
`desirable parallelization patterns. The fundamental "farming"
`concept of splitting tasks from a master node to worker nodes and
`gathering the results should be generalized so as to express
`explicitly the splitting and the merging functions and to be able to
`map them onto separate computing nodes. Furthermore, since
`computing clusters built with commodity networks (Fast or
`Gigabit Ethernet) incur high communication latencies [1] and
`since an increasing fraction of parallel applications make use of
`large datasets [2], support should be provided for the overlapping
`of communications and computations.
`A further challenge resides in providing “dynamicity”, i.e.
`allowing parallel programs to modify their behavior and release
`or acquire resources at run time. Dynamic reshaping of parallel
`applications at run time is important for parallel server systems
`whose resources must be reassigned according to the needs of
`dynamically scheduled applications. And, in order to facilitate the
`deployment of parallel server systems, parallel programs should
`be able, at run time, to make use of the services offered by other
`parallel programs.
`
`High-level parallel programming frameworks for shared
`memory computers are available mainly for facilitating the
`development of computational applications and for ensuring the
`portability of programs [3] [4]. In addition, software distributed
`shared memory systems exist which provide a global shared
`address space on top of physically distributed memory computers
` [5].
`In the present contribution, we focus on distributed memory
`systems and try to create abstractions which rely purely on the
`circulation and distributed processing of data objects. Currently,
`the majority of parallel application developments running on
`distributed memory computers are carried out with libraries such
`as MPI [6] and PVM [7]. These libraries provide low-level
`message-passing functions, but leave most of the other parallel
`programming issues to the programmer.
`Writing complex programs that execute without deadlocks and
`make a maximal usage of the underlying hardware such as bi-
`processor nodes, shared memory for communicating between
`local threads and overlapping communications and computations
`requires substantial programming efforts [8]. Low-level message
`passing libraries also make it difficult to modify parallel programs
`in order to experiment with different parallel program structures.
`Higher-level approaches to parallel application development
`may provide different levels of abstraction [9]. High abstraction
`levels such as functional languages leave many decisions related
`to the program behavior to the parallelization framework and
`therefore attain only moderate performance.
`Intermediate
`abstraction levels, such as CC++ [10], skeletons [11] [12] [13], and
`Mentat [14], leave the specification of the parallel program
`behavior to the programmer but free him from managing
`communications, threads, synchronization, flow control, and
`pipelining. They have the potential of facilitating parallel
`programming and at the same time of achieving a high utilization
`of the underlying parallel system resources.
`Skeleton languages, such as P3L [15] [16] or Skil [17] provide
`pre-implemented parallel constructs. The programmer can
`combine and customize these constructs by providing his own
`code. Skeletons exist for task-parallel and data-parallel constructs
` [18]. Task-parallel constructs are based on the transfer of data
`items between tasks. Worker tasks process data items as they
`arrive and send off processed data items as soon as they are ready.
`Data-parallel constructs work on a distributed data structure that
`is stored within the worker tasks. Internally, skeletons are usually
`represented as a tree with nested parallel constructs. Skeleton
`languages also allow creating sequences of constructs.
`
`Proc. of 8th Int’l Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2003), 17th International Parallel and
`Distributed Processing Symposium (IPDPS), April 22-26, Nice, France, 2003, IEEE Press, pp. 15-24
`
`
`
`
`FireEye - Exhibit 1030 Page 1
`
`
`
`to parallel application
`We present a new approach
`development, called Dynamic Parallel Schedules (DPS). Dynamic
`Parallel Schedules have some resemblance with task parallel
`skeletons, since simple constructs are provided that can be
`combined and extended by the programmer. Instead of combining
`the constructs in a tree by nesting them within one another, DPS
`applications are defined as directed acyclic graphs of constructs1.
`This approach allows separating data distribution and collection
`of results by providing distinct split and merge constructs. Figure
`1 illustrates the flow graph of a simple parallel application,
`describing the asynchronous flow of data between operations.
`
`ComputeData
`
`ComputeData
`
`ComputeData
`
`
`Final
`
`Merge
`
`
`Figure 1. Flow graph describing data distribution (split), parallel
`processing, and collection of results (merge)
`In contrast to previous approaches [15] [17], split and merge
`operations are extensible constructs, i.e. the developer can
`provide his own code to control how data is distributed, and how
`processed sub-results are merged into one result. The data
`elements in a flow graph are complex data objects. Data objects
`may contain any combination of simple types or complex types
`such as arrays or lists. The expressivity of DPS flow graphs is
`detailed in section 2.
`Operations within a flow graph are carried out within threads
`grouped in thread collections. DPS threads are mapped to
`operating system threads. Routing of data objects from one
`operation to the next is accomplished according to user defined
`routing functions. DPS supports full pipelining of operations.
`Data objects are transferred as soon as they are computed.
`Arriving data objects are stored in queues associated with the
`thread that contains the operations that will process them. This
`macro data flow behavior enables automatic overlapping of
`communications and computations. The execution of a flow graph
`within its collection of threads and according to its routing
`functions is referred to as a parallel schedule.
`For solving data-parallel problems, operations can store data
`within their local threads, e.g. a matrix distributed across different
`nodes. Libraries of flow graphs can be created to perform
`operations on distributed data structures. These flow graphs can
`then be used by applications for performing higher level
`computations.
`The present approach to parallel application development was
`first introduced in the context of data-intensive computing
`applications. The first generation parallel schedule system
`successfully performed out-of-core parallel access to 3D volume
`images [20], computation of radio-listening rates [21], and
`streaming real-time slice extraction from a time-varying 3D
`volume image [22]. These applications allowed to validate the
`approach and to identify the desirable new features of the second
`generation DPS framework. These new features include facilities
`
`1 Task graphs [19] also describe parallel applications as directed acyclic
`graphs of sequential operations. They however do not offer a split-merge
`construct, a fundamental element facilitating the creation of parallel programs.
`
`
`
`Init
`
`Split
`
`
`
`for simplifying the creation of parallel schedules, for providing
`interoperable parallel services and for allowing the dynamic
`deployment of applications.
`Therefore all DPS structures that describe the application such
`as its flow graph and thread mapping are created dynamically at
`runtime. This dynamic behavior allows applications
`to
`reconfigure themselves in order to adapt to changes in the
`problem definition or in the computing environment without
`requiring recompilation or restarting. A DPS application can
`expose its graphs to other DPS applications, thus enabling a
`parallel application to call parallel services exposed by another
`parallel application. Finally, the second generation framework
`adds the stream construct to the model. The stream construct
`enables partial merging and forwarding of data elements in order
`to ensure the pipelining of subsequent sets of parallel operations
`(section 3).
`High-level approaches for parallel programming often rely on
`a new programming language, or add extensions to existing
`languages. In order to avoid requiring program developers to
`learn a new language or language extension, DPS applications are
`written in pure C++. The parallel constructs are handled in DPS
`by using C++ classes, templates, macros and operator overloading.
`DPS programs may be
`recompiled and
`run without
`modification on platforms on which the DPS library has been
`ported (presently Windows and Linux).
`The remainder of this paper is structured as follows. Section 2
`presents the basic concepts for creating DPS schedules. Section 3
`describes the functionality and interfaces of the DPS C++ library.
`Section 4 gives an overview of the runtime system and of some of
` 5 presents
`its associated performance issues. Section
`two
`applications illustrating the philosophy and the concepts of DPS.
`Section 6 draws the conclusions and presents future research
`directions.
`
`2. Expressing Parallel Schedules
`
`An application using Dynamic Parallel Schedules is expressed
`as a directed acyclic graph of sequential operations. The nodes on
`the graph are user-written functions deriving from the elementary
`DPS operations: leaf operation, split operation, merge operation,
`and stream operation. Pipelining of operations is implicit in the
`construction of the flow graph.
`The basic parallel construct (split, computation, merge) is
`illustrated in Figure 1. The split operation takes as input one data
`object, and generates as output multiple output data objects
`representing the subtasks to execute. Each ComputeData leaf
`operation processes the data of the incoming data object and
`generates one output data object. The merge operation collects all
`the output data objects to produce one global result. The
`programmer does not have to know how many data objects arrive
`at the merge operation, since DPS keeps track of the number of
`data objects generated by the corresponding split operation. An
`additional operation, the stream operation, is described in section
` 3.
`
`The flow graph shown in Figure 1 is an unfolded view of a
`DPS graph showing its inherent parallelism. However, DPS
`graphs are generally represented by only single instances of
`
`
`
`FireEye - Exhibit 1030 Page 2
`
`
`
`split-
`
` for a
`
`
`e.g.
`constructs,
`parallel
`computation-merge construct.
`As it stands, a flow graph only indicates the processing
`operations and their order of execution. To enable parallelism, the
`operations need to be mapped to different threads, possibly
`located in different processing nodes. Threads may incorporate
`user defined data structures. They are associated to operating
`system threads and are mapped onto the nodes where their
`operations will execute. Developers instantiate collections of
`threads. A user-defined routing function specifies at runtime to
`which instance of the thread in the thread collection a data object
`is directed in order to execute its next operation.
`The execution of a sequence of leaf operations within a flow
`graph is automatically pipelined. This enables overlapping of
`computations, communications and possibly I/O operations.
`The graph elements are compositional: a split-merge construct
`may contain another split-merge construct. A split-merge
`construct may incorporate different paths, enabling a data-
`dependant conditional execution of parts of the flow graph at
`runtime.
`DPS flow graphs and the mapping of thread collections to
`operating system threads are specified dynamically at run time by
`the application. Dynamic flow graphs enable to adjust the graph
`to the size of the problem to be solved. In section 5, we show how
`a dynamic graph is used to parallelize the LU factorization.
`Dynamically created thread collections and mappings of threads
`to nodes also offer the potential for dynamically allocating
`computing and I/O resources according to the requirements of
`multiple concurrently running parallel applications.
`
`3. DPS library constructs
`
`The DPS C++ library provides a class framework for
`developing parallel applications. For maximum ease of use and
`maintainability, it does not extend the C++ language in any way.
`DPS applications can be compiled with a standard C++ compiler
`such as gcc or MS Visual C++. To simplify the development and
`debugging effort, the library takes care of serialization and
`deserialization of the data objects used in the application without
`requiring additional coding. It also detects invalid constructs at
`compile time, such as an attempt to create a sequence of
`operations in a flow graph where the output data object type is
`different from the input data object type of the next operation.
`The library also takes care of releasing memory using smart
`pointers with reference counting.
`The following paragraphs illustrate the syntax of the various
`DPS constructs. The source code originates from a tutorial
`application serving as a first introduction to DPS. It converts in
`parallel a character string from lowercase to uppercase by
`splitting the string into its individual character components.
`
`Expressing data objects
`Data objects are the fundamental data elements used in DPS.
`A data object is a standard C++ class, with additional information
`enabling serialization and deserialization. The following program
`lines describe a simple data object.
`class CharToken : public SimpleToken {
`public:
`
`
`
` char chr; // a character
` int pos; // its position within a string
` // Constructor for CharToken
` CharToken(char c = 0, int p = 0) { chr=c; pos=p; }
` IDENTIFY (CharToken);
`};
`
`The elements added for DPS is the identify macro. This macro
`provides support for serialization, deserialization, and to create an
`abstract class factory to instantiate the data object during
`deserialization [23]. The CharToken simple data object type is
`serialized and deserialized with simple memory copies. More
`complex data object types can also be created, as illustrated in the
`following example.
`class MyComplexToken : public ComplexToken {
`public:
` CT<int> id; // A simple integer
` CT<std::string> name; // A character string
` // A vector of Somethings
` Vector<Something> children;
` // A variable-size array of integers
` Buffer<int> aBuffer;
` IDENTIFY (MyComplexToken);
`};
`
`This data object contains complex data types and containers.
`DPS provides two container templates that can store variable-size
`arrays of simple elements (Buffer) or complex elements (Vector).
`Complex data objects can only contain complex elements;
`therefore a templated class (CT) is provided for inserting simple
`types or types otherwise known to the serializer, such as
`std::string. These
`templates enable programmers
`to create
`complicated data structures within their data objects without
`having to care about serialization. Support for derived classes is
`also provided. The serialization is performed with pointer
`arithmetic in order to traverse the elements of the data object. The
`present approach enables
`serialization without
`requiring
`redundant data declarations.
`
`Expressing operations
`Operations are also expressed as C++ classes deriving from
`DPS provided base classes. The following source code shows the
`three operations of a simple split-compute-merge graph.
`SplitString and MergeString operations are executed by an
`instance of MainThread. The ToUpperCase leaf operations are
`performed by instances of ComputeThread.
`class SplitString :
` public SplitOperation<MainThread, // Thread
` TV1(StringToken), TV1(CharToken)> {
`// Input token types Output token types
`// TV1 indicates one argument type
`public:
` void execute(StringToken *in) {
` // Post one token for each character
` for(int i=0;i<STRLEN;i++)
` postToken(new CharToken(in->str[i],i));
` }
` IDENTIFYOPERATION(SplitString);
`};
`
`class ToUpperCase :
` public LeafOperation<ComputeThread,
` TV1(CharToken),TV1(CharToken)> {
`public:
` void execute(CharToken *in) {
`
`
`
`FireEye - Exhibit 1030 Page 3
`
`
`
` // Post the uppercase equivalent
` // of the incoming character
` postToken(
` new CharToken(toupper(in->chr),in->pos));
` }
` IDENTIFYOPERATION(ToUpperCase);
`};
`
`class MergeString :
` public MergeOperation<MainThread,
` TV1(CharToken),TV1(StringToken)> {
`public:
` void execute(CharToken *in) {
` StringToken *out=new StringToken();
` do {
` // Store incoming characters at
` // the appropriate position of string
` out->str[in->pos]=in->chr;
` }
` // Wait for all chars
` while(in=(CharToken*)(Token*)
` waitForNextToken());
` // Post output string
` postToken(out);
` }
` IDENTIFYOPERATION(MergeString);
`};
`
`As with data objects, the operations have identify macros. The
`template parameters for the base classes indicate the thread class
`with which the operation is associated, and the acceptable input
`and output data object types. These parameters are used for
`verifying the graph coherence at compile time.
`
`Expressing threads and routing functions
`Threads are also expressed as classes. They can contain
`members for storing data elements in order to support the
`construction of distributed data structures. The following source
`code shows a simple thread type containing a single member
`variable.
`class ComputeThread : public Thread {
` int threadMember;
` IDENTIFY(ComputeThread);
`};
`
`Routing functions are also expressed as classes, as shown in
`the following lines of code:
`class RoundRobinRoute :
` public Route<ComputeThread, CharToken> {
` // Target thread type Token type
` int route(CharToken *currentToken) {
` // Return a thread index
` return currentToken->pos%threadCount();
` }
` IDENTIFY (RoundRobinRoute);
`};
`
`The route function contains an expression returning an index
`to one thread in a thread collection. Due to the simplicity of most
`routing functions, a ROUTE macro is provided. The parameters
`are the name of the routing function, the associated thread type,
`the data object type to be routed, and the routing expression
`producing the destination thread index. The following ROUTE
`macro produces the same routing function as the one described
`above:
`
`ROUTE(RoundRobinRoute, ComputeThread,
` CharToken, currentToken->pos%threadCount());
`
`Expressing thread collections and flow graphs
`With all static elements of an application defined, the dynamic
`construction of thread collections and flow graphs can now be
`described. Thread collections are simply instantiated and named:
`Ptr<ThreadCollection<ComputeThread> >
` computeThreads =
` new ThreadCollection<ComputeThread>("proc");
`
`The mapping of the threads of a thread collection onto nodes is
`specified by using a string containing the names of the nodes
`separated by spaces, with an optional multiplier to create multiple
`threads on the same node. This string can be loaded from a
`configuration file, be specified as a constant, or be created at
`runtime, according
`to
`the application’s requirements. The
`following lines show the simplest form, where a constant is
`specified to create three threads, two on node nodeA, and one on
`node nodeB.
`computeThreads->map("nodeA*2 nodeB");
` // The string specifies the mapping
`
`Flow graphs are defined with overloaded C++ operators. The
`following lines of source code can be used to create a flow graph
`( Figure 2) containing the three previously defined operations
`(split, computation, and merge operations). The flow graphs are
`named in order to possibly reuse them by other applications.
`FlowgraphBuilder theGraphBuilder =
` FlowgraphNode<SplitString, MainRoute>
` ( theMainThread ) >>
` FlowgraphNode<ToUpperCase, RoundRobinRoute>
` ( computeThreads ) >>
` FlowgraphNode<MergeString, MainRoute>
` // Operation Routing func
` ( theMainThread );
` // Thread collection
`
`Ptr<Flowgraph> theGraph=new Flowgraph
` (theGraphBuilder,"graph");
` // Builder name of graph
`
`
`
`SplitString
`
`ToUpperCase
`
`MergeString
`
`
`
`Figure 2. Simple flow graph
`The flow graph nodes represented by FlowgraphNode objects
`specify the operation, the route to this operation and the thread
`collection on which the operation should execute. The operator
`>> is used to indicate paths in the graph. The operator >>
`generates compile time errors when two incompatible operations
`are linked together (e.g. when their data object types do not
`match).
`declaring
`by
`created
`are
`graphs
`complex
`More
`FlowgraphNode variables and reusing them to create separate
`paths. For instance, in order to create a graph with two possible
`different types of operations between the split and merge
`operations ( Figure 3), we can use the following lines of source
`code.
`
`
`
`
`
`FireEye - Exhibit 1030 Page 4
`
`
`
`FlowgraphNode<MySplit,MainRoute>
` nodeSplit(theMainThread);
`FlowgraphNode<MyMerge,MainRoute>
` nodeMerge(theMainThread);
`FlowgraphNode<MyOpOne,RoundRobinRoute>
` nodeOp1(ComputeThreads);
`FlowgraphNode<MyOpTwo,RoundRobinRoute>
` nodeOp2(ComputeThreads);
`
`// create 1st path in graph
`FlowgraphBuilder theGraphBuilder =
` nodeSplit >> nodeOp1 >> nodeMerge;
`// add 2nd path to graph
`theGraphBuilder +=
` nodeSplit >> nodeOp2 >> nodeMerge;
`
`Ptr<Flowgraph> theGraph = new Flowgraph
` (theGraphBuilder,"graph");
`
`
`
`MySplit
`
`MyOpOne
`
`MyOpTwo
`
`MyMerge
`
`
`Figure 3. Graph with two possible paths; the selected path
`depends on the data object type.
`
`Note that the += operator allows to insert an additional path to
`the graph. It can also be used to append pieces of graphs together,
`e.g. to create a graph of varying length, as illustrated in the LU
`factorization example (section 5). When multiple paths are
`available to a given output data object, the input data object types
`of the destinations are used to determine which path to follow. In
`the example of Figure 3, MyOpOne and MyOpTwo must have
`different input data object types. Programmers may create at
`runtime different types of data objects that will be routed to
`different operations.
`
`Stream operations
`In the previous paragraphs we presented graphs containing
`split, leaf, and merge operations. DPS offers in addition the
`stream operation.
`In some applications, it may be useful to collect data objects as
`in a merge operation, but to post more than one output data object.
`This may be carried out by using a sequence comprising a merge
`and a split operation, but no output data objects would be posted
`before the merge received all its input data objects. To enable
`pipelining, DPS offers the stream construct. It works like a merge
`and a split operation combined, enabling the programmer to post
`data objects at any appropriate time during the execution of the
`operation. A graph using the stream operation in a simple video
`processing application is illustrated in Figure 4. An uncompressed
`video stream is stored on a disk array as partial frames, which
`need to be recomposed before further processing. The use of the
`stream operation enables complete frames to be processed as soon
`as they are ready, without waiting until all partial frames have
`been read. Another application for the stream operation is shown
`in the LU factorization example (Section 5).
`
`
`
`Stream
`Operation
`
`
`
`(1)
`
`(2)
`
`(3)
`
`(4)
`
`(5)
`
`
`Figure 4. Graph with stream operation for processing video: (1)
`generate frame part read requests; (2) read frame parts from the
`disk array; (3) combine frame parts into complete frames and
`stream them out; (4) process complete frames; (5) merge
`processed frames onto the final stream.
`
`Flow control and load balancing
`travelling between
`Since DPS
`tracks
`the data objects
`split/merge pairs, a feedback mechanism ensures that no more
`than a given number of data objects is in circulation between a
`specific pair of split merge constructs. This prevents the split
`operation from sending many data objects in a very short time
`interval, which would possibly induce a very high memory or
`network load. The split operation is simply stalled until data
`objects have arrived and been processed by the corresponding
`merge operation. By incorporating additional information into
`posted data objects, such as the processing nodes to which they
`were sent, DPS achieves a simple form of load balancing. After
`the split operation, the routing function sends data objects to those
`processing nodes which have previously posted data objects to the
`merge operation. Such a scheme allows balancing the load within
`the nodes spanned by a split-merge construct.
`
`Sequencing of operations
`At the heart of the DPS library is the Controller object,
`instantiated in each node and responsible for sequencing within
`each node the program execution according to the flow graphs
`and thread collections instantiated by the application. The
`controller object establishes all required connections, creates
`threads, and is responsible for the transmission of the flow graph
`and
`the
`thread collection
`information
`to newly
`launched
`application node instances.
`
`4. Runtime Support
`
`The DPS runtime environment for a typical usage case is
`illustrated in Figure 5. DPS provides a kernel that is running on
`all computers participating in the parallel program execution. This
`kernel is used for launching parallel applications and for initiating
`communications within a parallel application or between two
`distinct parallel applications. A running application may use the
`services provided by another running application by calling its
`flow graphs.
`The kernels are named independently of the underlying host
`names. This allows multiple kernels to be executed on a single
`host. This feature is mainly useful for debugging purposes. It
`enforces
`the
`use
`of
`the
`networking
`code
`(serialization/deserialization) and of the complete runtime system
`although the application is running within a single computer.
`
`
`
`FireEye - Exhibit 1030 Page 5
`
`
`
`Data objects transferred over the network incorporate control
`structures gi~-ing infonnation about their state and position v.-ithin
`the flow graph. These control structures induce an o>erhead that
`is _significant only wheu seuding large amounts of small data
`ObjeCts.
`
`Dlta transfer throughput
`40 ~--------------~------~
`
`~ ~~~~~~~~~3
`-1-
`iii
`~ 25+-or~~~----------------~
`; 20~~~~----~
`0..
`~ 15 t-~~----~~~~~~------------~
`
`e 10+-------~~~~~
`
`.s:.
`1-
`
`5 -l-----------~
`0+-------~--------~------~
`100000
`1000000
`1000
`10000
`Single tranfe r data size (bytes]
`
`Figure 6. Round-trip data transfer throughput: comparine DPS
`with transfers relying on direct socket accesses
`-
`
`Benefits of overlappmg communications and computations
`The second experiment illustrates the benefits of the implicit
`overlapping of communications and computations obtained with
`DPS. ~pbs. To evaluate this overlap. we run a program
`muluplylng two square n x n ora trices by performing block-based
`matrix multiplications. Assuming that then x n matrix is split into
`s
`blocks horizontally
`and verrical1y.
`the
`amount of
`•
`.
`•
`•
`2
`commuwcatton ts proportional ton {2s+l ). whereas computation
`3
`is proportional to 11
`• By keeping the size of the nratrix n constant
`factor s.
`and varying
`the
`splining
`the
`ratio between
`conununication time and computation time can be modified. For
`this test, two 1024x1024 element matrices are multiplied on l to 4
`compute nodes. with block sizes ranging from 256x256 (s=4) to
`32x32 (s=32). This enables
`testing situations where either
`conummications (s=l 6 and s=32) or computations (s=4 and s=8)
`are U1e bottleneck. The reductions in execution time due to
`overlapping of conununicatious and computations and
`the
`con·esponding ratios of communication time over computation
`time are given in Table 1.
`
`128
`
`64
`
`Block size
`256
`reduct rabo
`Nodes
`6.7% 022 91% 0.45 17.6% 0.94 25.2% 2_09
`1
`2 13.6% 033 19.8% 0.66 28.7% 128 24.9% 276
`3 158% 044 295% 0.97 321% 1.92 19.5% 4.19
`4 239% 063 356% 1.36 27.2% 2 54 15.6% 5_54
`
`32
`
`Table 1. Reduction in execution time due to overlappin2 and
`corresponding ratio of communication over computation-time
`
`. The potential reduction g in execurion time due to pipelining is
`etther
`g = ratiol(ratio+ 1)
`g = 1/( 1 +ratio)
`
`if ratio ~ 1 . or
`if ratio~ 1.
`
`Figure 5. Two parallel applications calling parallel striped file
`services provided by a third parallel application within the DPS
`nmtime envirorunent
`
`The DPS runtime system was designed to be as dynamic as
`possible. The kernels can be started or stopped at any point in
`time to add or remove nodes from the cluster. Kernels locate each
`other either by using UDP broadcasts or by accessing a simple
`name server. When an application is started on a given node. it
`fir">t contacts the local kernel. and starts constrnctin2 its flow
`graphs and thread collections. DPS uses a delayed mechanism for
`starting communications. It neither launches an application on a
`node nor opens a connection (TCP socket) to another application
`unless a data object needs to reach that node. When an application
`thread posts a data object to a thread running on a node where
`there is no acti\'e instance of the application. the kernel on that
`n~~ ~tarts a new instance of the application. This strategy
`m•mm•zes resource ~onswnprion and enables dynamic mapping
`of~eads to _processmg nodes at runtime. However, this approach
`requ.II'es a shgbtly longer startl•p time (e.g. one second on an 8
`node system), especially for applications that need full N-to-N
`node connectivity.
`DPS perfonm conununicat'ions using TCP sockets. When a
`data object is sent between two tlu·eads within the same address
`space, it bypasses the conununication layer -
`the pointer to the
`data object is transfen-ed directly to the destination tlli'ead. Thus
`messages are transfen·ed at a negligible cost between threads of a
`shared memory muliiprocessor node.
`
`Communication overhead
`The communication overhead of DPS was evaluated with
`se,;eraJ simple experiments. These experiments were executed
`and timed on a cluster of hi-processor 733MHz Pentium ill PCs
`·with 512 MB of RAM. running Windows 2000. The clnster is
`composed of 8 computers (nodes). interconnected with a Gi23bit
`Ethernet switch.
`-
`In o~der _to evaluate the maximal data throughput when
`perfo=g stmultaneous scud and receive operations. the first
`test transfers I 00 MB of data along a rin2 of 4 PCs. The
`indi\-iduaJ machines forward the data as soon ~ they receiYe it. In
`Figure 6. we compare tbe steady state data