IPR2017-00157, No. 1030-30 Exhibit - Gerlach and Hersch, DPS Dynamic Parallel Schedules (P.T.A.B. Oct. 28, 2016)

DPS – Dynamic Parallel Schedules
`
`Sebastian Gerlach, Roger D. Hersch
`Ecole Polytechnique Fédérale de Lausanne, Switzerland
`Sebastian.Gerlach@epfl.ch, RD.Hersch@epfl.ch
`http://dps.epfl.ch
`
`
`
`Abstract
`
`
`Dynamic Parallel Schedules (DPS) is a high-level framework
`for developing parallel applications on distributed memory
`computers
`(e.g. clusters of PCs).
`Its model relies on
`compositional customizable split-compute-merge graphs of
`operations (directed acyclic flow graphs). The graphs and the
`mapping of operations
`to processing nodes are specified
`dynamically at runtime. DPS applications are pipelined and
`multithreaded by construction, ensuring a maximal overlap of
`computations and communications. DPS applications can call
`parallel services exposed by other DPS applications, enabling the
`creation of reusable parallel components. The DPS framework
`relies on a C++ class library. Thanks to its dynamic nature, DPS
`offers new perspectives for the creation and deployment of
`parallel applications running on server clusters.
`
`Keywords: Parallel computation, parallel schedules, flow
`graphs, split-merge constructs, overlapping of computations and
`communications
`
`1. Introduction
`
`The development of high-level tools and languages for
`simplifying the task of creating parallel applications is one of
`today’s challenges. High-level abstractions should support
`desirable parallelization patterns. The fundamental "farming"
`concept of splitting tasks from a master node to worker nodes and
`gathering the results should be generalized so as to express
`explicitly the splitting and the merging functions and to be able to
`map them onto separate computing nodes. Furthermore, since
`computing clusters built with commodity networks (Fast or
`Gigabit Ethernet) incur high communication latencies [1] and
`since an increasing fraction of parallel applications make use of
`large datasets [2], support should be provided for the overlapping
`of communications and computations.
`A further challenge resides in providing “dynamicity”, i.e.
`allowing parallel programs to modify their behavior and release
`or acquire resources at run time. Dynamic reshaping of parallel
`applications at run time is important for parallel server systems
`whose resources must be reassigned according to the needs of
`dynamically scheduled applications. And, in order to facilitate the
`deployment of parallel server systems, parallel programs should
`be able, at run time, to make use of the services offered by other
`parallel programs.
`
`High-level parallel programming frameworks for shared
`memory computers are available mainly for facilitating the
`development of computational applications and for ensuring the
`portability of programs [3] [4]. In addition, software distributed
`shared memory systems exist which provide a global shared
`address space on top of physically distributed memory computers
` [5].
`In the present contribution, we focus on distributed memory
`systems and try to create abstractions which rely purely on the
`circulation and distributed processing of data objects. Currently,
`the majority of parallel application developments running on
`distributed memory computers are carried out with libraries such
`as MPI [6] and PVM [7]. These libraries provide low-level
`message-passing functions, but leave most of the other parallel
`programming issues to the programmer.
`Writing complex programs that execute without deadlocks and
`make a maximal usage of the underlying hardware such as bi-
`processor nodes, shared memory for communicating between
`local threads and overlapping communications and computations
`requires substantial programming efforts [8]. Low-level message
`passing libraries also make it difficult to modify parallel programs
`in order to experiment with different parallel program structures.
`Higher-level approaches to parallel application development
`may provide different levels of abstraction [9]. High abstraction
`levels such as functional languages leave many decisions related
`to the program behavior to the parallelization framework and
`therefore attain only moderate performance.
`Intermediate
`abstraction levels, such as CC++ [10], skeletons [11] [12] [13], and
`Mentat [14], leave the specification of the parallel program
`behavior to the programmer but free him from managing
`communications, threads, synchronization, flow control, and
`pipelining. They have the potential of facilitating parallel
`programming and at the same time of achieving a high utilization
`of the underlying parallel system resources.
`Skeleton languages, such as P3L [15] [16] or Skil [17] provide
`pre-implemented parallel constructs. The programmer can
`combine and customize these constructs by providing his own
`code. Skeletons exist for task-parallel and data-parallel constructs
` [18]. Task-parallel constructs are based on the transfer of data
`items between tasks. Worker tasks process data items as they
`arrive and send off processed data items as soon as they are ready.
`Data-parallel constructs work on a distributed data structure that
`is stored within the worker tasks. Internally, skeletons are usually
`represented as a tree with nested parallel constructs. Skeleton
`languages also allow creating sequences of constructs.
`
`Proc. of 8th Int’l Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2003), 17th International Parallel and
`Distributed Processing Symposium (IPDPS), April 22-26, Nice, France, 2003, IEEE Press, pp. 15-24
`
`
`
`
`FireEye - Exhibit 1030 Page 1
`
`

`
`to parallel application
`We present a new approach
`development, called Dynamic Parallel Schedules (DPS). Dynamic
`Parallel Schedules have some resemblance with task parallel
`skeletons, since simple constructs are provided that can be
`combined and extended by the programmer. Instead of combining
`the constructs in a tree by nesting them within one another, DPS
`applications are defined as directed acyclic graphs of constructs1.
`This approach allows separating data distribution and collection
`of results by providing distinct split and merge constructs. Figure
`1 illustrates the flow graph of a simple parallel application,
`describing the asynchronous flow of data between operations.
`
`ComputeData
`
`ComputeData
`
`ComputeData
`
`
`Final
`
`Merge
`
`
`Figure 1. Flow graph describing data distribution (split), parallel
`processing, and collection of results (merge)
`In contrast to previous approaches [15] [17], split and merge
`operations are extensible constructs, i.e. the developer can
`provide his own code to control how data is distributed, and how
`processed sub-results are merged into one result. The data
`elements in a flow graph are complex data objects. Data objects
`may contain any combination of simple types or complex types
`such as arrays or lists. The expressivity of DPS flow graphs is
`detailed in section 2.
`Operations within a flow graph are carried out within threads
`grouped in thread collections. DPS threads are mapped to
`operating system threads. Routing of data objects from one
`operation to the next is accomplished according to user defined
`routing functions. DPS supports full pipelining of operations.
`Data objects are transferred as soon as they are computed.
`Arriving data objects are stored in queues associated with the
`thread that contains the operations that will process them. This
`macro data flow behavior enables automatic overlapping of
`communications and computations. The execution of a flow graph
`within its collection of threads and according to its routing
`functions is referred to as a parallel schedule.
`For solving data-parallel problems, operations can store data
`within their local threads, e.g. a matrix distributed across different
`nodes. Libraries of flow graphs can be created to perform
`operations on distributed data structures. These flow graphs can
`then be used by applications for performing higher level
`computations.
`The present approach to parallel application development was
`first introduced in the context of data-intensive computing
`applications. The first generation parallel schedule system
`successfully performed out-of-core parallel access to 3D volume
`images [20], computation of radio-listening rates [21], and
`streaming real-time slice extraction from a time-varying 3D
`volume image [22]. These applications allowed to validate the
`approach and to identify the desirable new features of the second
`generation DPS framework. These new features include facilities
`
`1 Task graphs [19] also describe parallel applications as directed acyclic
`graphs of sequential operations. They however do not offer a split-merge
`construct, a fundamental element facilitating the creation of parallel programs.
`
`
`
`Init
`
`Split
`
`
`
`for simplifying the creation of parallel schedules, for providing
`interoperable parallel services and for allowing the dynamic
`deployment of applications.
`Therefore all DPS structures that describe the application such
`as its flow graph and thread mapping are created dynamically at
`runtime. This dynamic behavior allows applications
`to
`reconfigure themselves in order to adapt to changes in the
`problem definition or in the computing environment without
`requiring recompilation or restarting. A DPS application can
`expose its graphs to other DPS applications, thus enabling a
`parallel application to call parallel services exposed by another
`parallel application. Finally, the second generation framework
`adds the stream construct to the model. The stream construct
`enables partial merging and forwarding of data elements in order
`to ensure the pipelining of subsequent sets of parallel operations
`(section 3).
`High-level approaches for parallel programming often rely on
`a new programming language, or add extensions to existing
`languages. In order to avoid requiring program developers to
`learn a new language or language extension, DPS applications are
`written in pure C++. The parallel constructs are handled in DPS
`by using C++ classes, templates, macros and operator overloading.
`DPS programs may be
`recompiled and
`run without
`modification on platforms on which the DPS library has been
`ported (presently Windows and Linux).
`The remainder of this paper is structured as follows. Section 2
`presents the basic concepts for creating DPS schedules. Section 3
`describes the functionality and interfaces of the DPS C++ library.
`Section 4 gives an overview of the runtime system and of some of
` 5 presents
`its associated performance issues. Section
`two
`applications illustrating the philosophy and the concepts of DPS.
`Section 6 draws the conclusions and presents future research
`directions.
`
`2. Expressing Parallel Schedules
`
`An application using Dynamic Parallel Schedules is expressed
`as a directed acyclic graph of sequential operations. The nodes on
`the graph are user-written functions deriving from the elementary
`DPS operations: leaf operation, split operation, merge operation,
`and stream operation. Pipelining of operations is implicit in the
`construction of the flow graph.
`The basic parallel construct (split, computation, merge) is
`illustrated in Figure 1. The split operation takes as input one data
`object, and generates as output multiple output data objects
`representing the subtasks to execute. Each ComputeData leaf
`operation processes the data of the incoming data object and
`generates one output data object. The merge operation collects all
`the output data objects to produce one global result. The
`programmer does not have to know how many data objects arrive
`at the merge operation, since DPS keeps track of the number of
`data objects generated by the corresponding split operation. An
`additional operation, the stream operation, is described in section
` 3.
`
`The flow graph shown in Figure 1 is an unfolded view of a
`DPS graph showing its inherent parallelism. However, DPS
`graphs are generally represented by only single instances of
`
`
`
`FireEye - Exhibit 1030 Page 2
`
`

`
`split-
`
` for a
`
`
`e.g.
`constructs,
`parallel
`computation-merge construct.
`As it stands, a flow graph only indicates the processing
`operations and their order of execution. To enable parallelism, the
`operations need to be mapped to different threads, possibly
`located in different processing nodes. Threads may incorporate
`user defined data structures. They are associated to operating
`system threads and are mapped onto the nodes where their
`operations will execute. Developers instantiate collections of
`threads. A user-defined routing function specifies at runtime to
`which instance of the thread in the thread collection a data object
`is directed in order to execute its next operation.
`The execution of a sequence of leaf operations within a flow
`graph is automatically pipelined. This enables overlapping of
`computations, communications and possibly I/O operations.
`The graph elements are compositional: a split-merge construct
`may contain another split-merge construct. A split-merge
`construct may incorporate different paths, enabling a data-
`dependant conditional execution of parts of the flow graph at
`runtime.
`DPS flow graphs and the mapping of thread collections to
`operating system threads are specified dynamically at run time by
`the application. Dynamic flow graphs enable to adjust the graph
`to the size of the problem to be solved. In section 5, we show how
`a dynamic graph is used to parallelize the LU factorization.
`Dynamically created thread collections and mappings of threads
`to nodes also offer the potential for dynamically allocating
`computing and I/O resources according to the requirements of
`multiple concurrently running parallel applications.
`
`3. DPS library constructs
`
`The DPS C++ library provides a class framework for
`developing parallel applications. For maximum ease of use and
`maintainability, it does not extend the C++ language in any way.
`DPS applications can be compiled with a standard C++ compiler
`such as gcc or MS Visual C++. To simplify the development and
`debugging effort, the library takes care of serialization and
`deserialization of the data objects used in the application without
`requiring additional coding. It also detects invalid constructs at
`compile time, such as an attempt to create a sequence of
`operations in a flow graph where the output data object type is
`different from the input data object type of the next operation.
`The library also takes care of releasing memory using smart
`pointers with reference counting.
`The following paragraphs illustrate the syntax of the various
`DPS constructs. The source code originates from a tutorial
`application serving as a first introduction to DPS. It converts in
`parallel a character string from lowercase to uppercase by
`splitting the string into its individual character components.
`
`Expressing data objects
`Data objects are the fundamental data elements used in DPS.
`A data object is a standard C++ class, with additional information
`enabling serialization and deserialization. The following program
`lines describe a simple data object.
`class CharToken : public SimpleToken {
`public:
`
`
`
` char chr; // a character
` int pos; // its position within a string
` // Constructor for CharToken
` CharToken(char c = 0, int p = 0) { chr=c; pos=p; }
` IDENTIFY (CharToken);
`};
`
`The elements added for DPS is the identify macro. This macro
`provides support for serialization, deserialization, and to create an
`abstract class factory to instantiate the data object during
`deserialization [23]. The CharToken simple data object type is
`serialized and deserialized with simple memory copies. More
`complex data object types can also be created, as illustrated in the
`following example.
`class MyComplexToken : public ComplexToken {
`public:
` CT<int> id; // A simple integer
` CT<std::string> name; // A character string
` // A vector of Somethings
` Vector<Something> children;
` // A variable-size array of integers
` Buffer<int> aBuffer;
` IDENTIFY (MyComplexToken);
`};
`
`This data object contains complex data types and containers.
`DPS provides two container templates that can store variable-size
`arrays of simple elements (Buffer) or complex elements (Vector).
`Complex data objects can only contain complex elements;
`therefore a templated class (CT) is provided for inserting simple
`types or types otherwise known to the serializer, such as
`std::string. These
`templates enable programmers
`to create
`complicated data structures within their data objects without
`having to care about serialization. Support for derived classes is
`also provided. The serialization is performed with pointer
`arithmetic in order to traverse the elements of the data object. The
`present approach enables
`serialization without
`requiring
`redundant data declarations.
`
`Expressing operations
`Operations are also expressed as C++ classes deriving from
`DPS provided base classes. The following source code shows the
`three operations of a simple split-compute-merge graph.
`SplitString and MergeString operations are executed by an
`instance of MainThread. The ToUpperCase leaf operations are
`performed by instances of ComputeThread.
`class SplitString :
` public SplitOperation<MainThread, // Thread
` TV1(StringToken), TV1(CharToken)> {
`// Input token types Output token types
`// TV1 indicates one argument type
`public:
` void execute(StringToken *in) {
` // Post one token for each character
` for(int i=0;i<STRLEN;i++)
` postToken(new CharToken(in->str[i],i));
` }
` IDENTIFYOPERATION(SplitString);
`};
`
`class ToUpperCase :
` public LeafOperation<ComputeThread,
` TV1(CharToken),TV1(CharToken)> {
`public:
` void execute(CharToken *in) {
`
`
`
`FireEye - Exhibit 1030 Page 3
`
`

`
` // Post the uppercase equivalent
` // of the incoming character
` postToken(
` new CharToken(toupper(in->chr),in->pos));
` }
` IDENTIFYOPERATION(ToUpperCase);
`};
`
`class MergeString :
` public MergeOperation<MainThread,
` TV1(CharToken),TV1(StringToken)> {
`public:
` void execute(CharToken *in) {
` StringToken *out=new StringToken();
` do {
` // Store incoming characters at
` // the appropriate position of string
` out->str[in->pos]=in->chr;
` }
` // Wait for all chars
` while(in=(CharToken*)(Token*)
` waitForNextToken());
` // Post output string
` postToken(out);
` }
` IDENTIFYOPERATION(MergeString);
`};
`
`As with data objects, the operations have identify macros. The
`template parameters for the base classes indicate the thread class
`with which the operation is associated, and the acceptable input
`and output data object types. These parameters are used for
`verifying the graph coherence at compile time.
`
`Expressing threads and routing functions
`Threads are also expressed as classes. They can contain
`members for storing data elements in order to support the
`construction of distributed data structures. The following source
`code shows a simple thread type containing a single member
`variable.
`class ComputeThread : public Thread {
` int threadMember;
` IDENTIFY(ComputeThread);
`};
`
`Routing functions are also expressed as classes, as shown in
`the following lines of code:
`class RoundRobinRoute :
` public Route<ComputeThread, CharToken> {
` // Target thread type Token type
` int route(CharToken *currentToken) {
` // Return a thread index
` return currentToken->pos%threadCount();
` }
` IDENTIFY (RoundRobinRoute);
`};
`
`The route function contains an expression returning an index
`to one thread in a thread collection. Due to the simplicity of most
`routing functions, a ROUTE macro is provided. The parameters
`are the name of the routing function, the associated thread type,
`the data object type to be routed, and the routing expression
`producing the destination thread index. The following ROUTE
`macro produces the same routing function as the one described
`above:
`
`ROUTE(RoundRobinRoute, ComputeThread,
` CharToken, currentToken->pos%threadCount());
`
`Expressing thread collections and flow graphs
`With all static elements of an application defined, the dynamic
`construction of thread collections and flow graphs can now be
`described. Thread collections are simply instantiated and named:
`Ptr<ThreadCollection<ComputeThread> >
` computeThreads =
` new ThreadCollection<ComputeThread>("proc");
`
`The mapping of the threads of a thread collection onto nodes is
`specified by using a string containing the names of the nodes
`separated by spaces, with an optional multiplier to create multiple
`threads on the same node. This string can be loaded from a
`configuration file, be specified as a constant, or be created at
`runtime, according
`to
`the application’s requirements. The
`following lines show the simplest form, where a constant is
`specified to create three threads, two on node nodeA, and one on
`node nodeB.
`computeThreads->map("nodeA*2 nodeB");
` // The string specifies the mapping
`
`Flow graphs are defined with overloaded C++ operators. The
`following lines of source code can be used to create a flow graph
`( Figure 2) containing the three previously defined operations
`(split, computation, and merge operations). The flow graphs are
`named in order to possibly reuse them by other applications.
`FlowgraphBuilder theGraphBuilder =
` FlowgraphNode<SplitString, MainRoute>
` ( theMainThread ) >>
` FlowgraphNode<ToUpperCase, RoundRobinRoute>
` ( computeThreads ) >>
` FlowgraphNode<MergeString, MainRoute>
` // Operation Routing func
` ( theMainThread );
` // Thread collection
`
`Ptr<Flowgraph> theGraph=new Flowgraph
` (theGraphBuilder,"graph");
` // Builder name of graph
`
`
`
`SplitString
`
`ToUpperCase
`
`MergeString
`
`
`
`Figure 2. Simple flow graph
`The flow graph nodes represented by FlowgraphNode objects
`specify the operation, the route to this operation and the thread
`collection on which the operation should execute. The operator
`>> is used to indicate paths in the graph. The operator >>
`generates compile time errors when two incompatible operations
`are linked together (e.g. when their data object types do not
`match).
`declaring
`by
`created
`are
`graphs
`complex
`More
`FlowgraphNode variables and reusing them to create separate
`paths. For instance, in order to create a graph with two possible
`different types of operations between the split and merge
`operations ( Figure 3), we can use the following lines of source
`code.
`
`
`
`
`
`FireEye - Exhibit 1030 Page 4
`
`

`
`FlowgraphNode<MySplit,MainRoute>
` nodeSplit(theMainThread);
`FlowgraphNode<MyMerge,MainRoute>
` nodeMerge(theMainThread);
`FlowgraphNode<MyOpOne,RoundRobinRoute>
` nodeOp1(ComputeThreads);
`FlowgraphNode<MyOpTwo,RoundRobinRoute>
` nodeOp2(ComputeThreads);
`
`// create 1st path in graph
`FlowgraphBuilder theGraphBuilder =
` nodeSplit >> nodeOp1 >> nodeMerge;
`// add 2nd path to graph
`theGraphBuilder +=
` nodeSplit >> nodeOp2 >> nodeMerge;
`
`Ptr<Flowgraph> theGraph = new Flowgraph
` (theGraphBuilder,"graph");
`
`
`
`MySplit
`
`MyOpOne
`
`MyOpTwo
`
`MyMerge
`
`
`Figure 3. Graph with two possible paths; the selected path
`depends on the data object type.
`
`Note that the += operator allows to insert an additional path to
`the graph. It can also be used to append pieces of graphs together,
`e.g. to create a graph of varying length, as illustrated in the LU
`factorization example (section 5). When multiple paths are
`available to a given output data object, the input data object types
`of the destinations are used to determine which path to follow. In
`the example of Figure 3, MyOpOne and MyOpTwo must have
`different input data object types. Programmers may create at
`runtime different types of data objects that will be routed to
`different operations.
`
`Stream operations
`In the previous paragraphs we presented graphs containing
`split, leaf, and merge operations. DPS offers in addition the
`stream operation.
`In some applications, it may be useful to collect data objects as
`in a merge operation, but to post more than one output data object.
`This may be carried out by using a sequence comprising a merge
`and a split operation, but no output data objects would be posted
`before the merge received all its input data objects. To enable
`pipelining, DPS offers the stream construct. It works like a merge
`and a split operation combined, enabling the programmer to post
`data objects at any appropriate time during the execution of the
`operation. A graph using the stream operation in a simple video
`processing application is illustrated in Figure 4. An uncompressed
`video stream is stored on a disk array as partial frames, which
`need to be recomposed before further processing. The use of the
`stream operation enables complete frames to be processed as soon
`as they are ready, without waiting until all partial frames have
`been read. Another application for the stream operation is shown
`in the LU factorization example (Section 5).
`
`
`
`Stream
`Operation
`
`
`
`(1)
`
`(2)
`
`(3)
`
`(4)
`
`(5)
`
`
`Figure 4. Graph with stream operation for processing video: (1)
`generate frame part read requests; (2) read frame parts from the
`disk array; (3) combine frame parts into complete frames and
`stream them out; (4) process complete frames; (5) merge
`processed frames onto the final stream.
`
`Flow control and load balancing
`travelling between
`Since DPS
`tracks
`the data objects
`split/merge pairs, a feedback mechanism ensures that no more
`than a given number of data objects is in circulation between a
`specific pair of split merge constructs. This prevents the split
`operation from sending many data objects in a very short time
`interval, which would possibly induce a very high memory or
`network load. The split operation is simply stalled until data
`objects have arrived and been processed by the corresponding
`merge operation. By incorporating additional information into
`posted data objects, such as the processing nodes to which they
`were sent, DPS achieves a simple form of load balancing. After
`the split operation, the routing function sends data objects to those
`processing nodes which have previously posted data objects to the
`merge operation. Such a scheme allows balancing the load within
`the nodes spanned by a split-merge construct.
`
`Sequencing of operations
`At the heart of the DPS library is the Controller object,
`instantiated in each node and responsible for sequencing within
`each node the program execution according to the flow graphs
`and thread collections instantiated by the application. The
`controller object establishes all required connections, creates
`threads, and is responsible for the transmission of the flow graph
`and
`the
`thread collection
`information
`to newly
`launched
`application node instances.
`
`4. Runtime Support
`
`The DPS runtime environment for a typical usage case is
`illustrated in Figure 5. DPS provides a kernel that is running on
`all computers participating in the parallel program execution. This
`kernel is used for launching parallel applications and for initiating
`communications within a parallel application or between two
`distinct parallel applications. A running application may use the
`services provided by another running application by calling its
`flow graphs.
`The kernels are named independently of the underlying host
`names. This allows multiple kernels to be executed on a single
`host. This feature is mainly useful for debugging purposes. It
`enforces
`the
`use
`of
`the
`networking
`code
`(serialization/deserialization) and of the complete runtime system
`although the application is running within a single computer.
`
`
`
`FireEye - Exhibit 1030 Page 5
`
`

`
`Data objects transferred over the network incorporate control
`structures gi~-ing infonnation about their state and position v.-ithin
`the flow graph. These control structures induce an o>erhead that
`is _significant only wheu seuding large amounts of small data
`ObjeCts.
`
`Dlta transfer throughput
`40 ~--------------~------~
`
`~ ~~~~~~~~~3
`-1-
`iii
`~ 25+-or~~~----------------~
`; 20~~~~----~
`0..
`~ 15 t-~~----~~~~~~------------~
`
`e 10+-------~~~~~
`
`.s:.
`1-
`
`5 -l-----------~
`0+-------~--------~------~
`100000
`1000000
`1000
`10000
`Single tranfe r data size (bytes]
`
`Figure 6. Round-trip data transfer throughput: comparine DPS
`with transfers relying on direct socket accesses
`-
`
`Benefits of overlappmg communications and computations
`The second experiment illustrates the benefits of the implicit
`overlapping of communications and computations obtained with
`DPS. ~pbs. To evaluate this overlap. we run a program
`muluplylng two square n x n ora trices by performing block-based
`matrix multiplications. Assuming that then x n matrix is split into
`s
`blocks horizontally
`and verrical1y.
`the
`amount of
`•
`.
`•
`•
`2
`commuwcatton ts proportional ton {2s+l ). whereas computation
`3
`is proportional to 11
`• By keeping the size of the nratrix n constant
`factor s.
`and varying
`the
`splining
`the
`ratio between
`conununication time and computation time can be modified. For
`this test, two 1024x1024 element matrices are multiplied on l to 4
`compute nodes. with block sizes ranging from 256x256 (s=4) to
`32x32 (s=32). This enables
`testing situations where either
`conummications (s=l 6 and s=32) or computations (s=4 and s=8)
`are U1e bottleneck. The reductions in execution time due to
`overlapping of conununicatious and computations and
`the
`con·esponding ratios of communication time over computation
`time are given in Table 1.
`
`128
`
`64
`
`Block size
`256
`reduct rabo
`Nodes
`6.7% 022 91% 0.45 17.6% 0.94 25.2% 2_09
`1
`2 13.6% 033 19.8% 0.66 28.7% 128 24.9% 276
`3 158% 044 295% 0.97 321% 1.92 19.5% 4.19
`4 239% 063 356% 1.36 27.2% 2 54 15.6% 5_54
`
`32
`
`Table 1. Reduction in execution time due to overlappin2 and
`corresponding ratio of communication over computation-time
`
`. The potential reduction g in execurion time due to pipelining is
`etther
`g = ratiol(ratio+ 1)
`g = 1/( 1 +ratio)
`
`if ratio ~ 1 . or
`if ratio~ 1.
`
`Figure 5. Two parallel applications calling parallel striped file
`services provided by a third parallel application within the DPS
`nmtime envirorunent
`
`The DPS runtime system was designed to be as dynamic as
`possible. The kernels can be started or stopped at any point in
`time to add or remove nodes from the cluster. Kernels locate each
`other either by using UDP broadcasts or by accessing a simple
`name server. When an application is started on a given node. it
`fir">t contacts the local kernel. and starts constrnctin2 its flow
`graphs and thread collections. DPS uses a delayed mechanism for
`starting communications. It neither launches an application on a
`node nor opens a connection (TCP socket) to another application
`unless a data object needs to reach that node. When an application
`thread posts a data object to a thread running on a node where
`there is no acti\'e instance of the application. the kernel on that
`n~~ ~tarts a new instance of the application. This strategy
`m•mm•zes resource ~onswnprion and enables dynamic mapping
`of~eads to _processmg nodes at runtime. However, this approach
`requ.II'es a shgbtly longer startl•p time (e.g. one second on an 8
`node system), especially for applications that need full N-to-N
`node connectivity.
`DPS perfonm conununicat'ions using TCP sockets. When a
`data object is sent between two tlu·eads within the same address
`space, it bypasses the conununication layer -
`the pointer to the
`data object is transfen-ed directly to the destination tlli'ead. Thus
`messages are transfen·ed at a negligible cost between threads of a
`shared memory muliiprocessor node.
`
`Communication overhead
`The communication overhead of DPS was evaluated with
`se,;eraJ simple experiments. These experiments were executed
`and timed on a cluster of hi-processor 733MHz Pentium ill PCs
`·with 512 MB of RAM. running Windows 2000. The clnster is
`composed of 8 computers (nodes). interconnected with a Gi23bit
`Ethernet switch.
`-
`In o~der _to evaluate the maximal data throughput when
`perfo=g stmultaneous scud and receive operations. the first
`test transfers I 00 MB of data along a rin2 of 4 PCs. The
`indi\-iduaJ machines forward the data as soon ~ they receiYe it. In
`Figure 6. we compare tbe steady state data

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases