IPR2018-01605, No. 1057-66 Exhibit - Bala, et al, Application of PFortran and Co Array Fortran in theparallelization of the GROMOS96 molecular dynamics module,Scientific Programming, vol 9 2001, (P.T.

61
`
`Application of Pfortran and Co-Array Fortran
`in the parallelization of the GROMOS96
`molecular dynamics module
`
`Piotr Bałaa, Terry Clarkb and L. Ridgway Scottb
`aFaculty of Mathematics and Computer Science, N.
`Copernicus University, Chopina 12/18, 87-100 Toru ´n,
`Poland
`Tel.: +48 56 611 3468; Fax: +48 56 622 8979;
`E-mail: bala@mat.uni.torun.pl
`bDepartment of Computer Science, University of
`Chicago and Computation Institute, 1100 E. 58th
`Street, Chicago, IL 60637, USA
`E-mail: {ridg,twclark}@cs.uchicago.edu
`
`After at least a decade of parallel tool development, paral-
`lelization of scientiﬁc applications remains a signiﬁcant un-
`dertaking. Typically parallelization is a specialized activity
`supported only partially by the programming tool set, with
`the programmer involved with parallel issues in addition to
`sequential ones. The details of concern range from algo-
`rithm design down to low-level data movement details. The
`aim of parallel programming tools is to automate the latter
`without sacriﬁcing performance and portability, allowing the
`programmer to focus on algorithm speciﬁcation and develop-
`ment. We present our use of two similar parallelization tools,
`Pfortran and Cray’s Co-Array Fortran, in the parallelization
`of the GROMOS96 molecular dynamics module. Our paral-
`lelization started from the GROMOS96 distribution’s shared-
`memory implementation of the replicated algorithm, but used
`little of that existing parallel structure. Consequently, our par-
`allelization was close to starting with the sequential version.
`We found the intuitive extensions to Pfortran and Co-Array
`Fortran helpful in the rapid parallelization of the project. We
`present performance ﬁgures for both the Pfortran and Co-
`Array Fortran parallelizations showing linear speedup within
`the range expected by these parallelization methods.
`
`1. Introduction
`
`Molecular dynamics (MD) is widely used to inves-
`tigate function of biomolecular systems with large size
`and long time scales. Biomolecular complexes con-
`
`Scientiﬁc Programming 9 (2001) 61–68
`ISSN 1058-9244 / $8.00 © 2001, IOS Press. All rights reserved
`
`sisting of components such as proteins, lipids, DNA
`and RNA, and solvent are typically large in simulation
`terms. The explosive growth in interest in investigat-
`ing inherently complex biomolecular systems such as
`solvated protein complexes leads to molecular systems
`with tens to hundreds of thousands of atoms, as for
`example in [39]. Parallel algorithms are critical to the
`application and progress of MD in order to 1) improve
`the accuracy of simulation models, 2) extend the length
`of simulations, and 3) simulate large, complex systems.
`Numerous MD parallelizations have been described in
`the literature, ranging from the easy to implement repli-
`cated algorithm [6,20] to the more difﬁcult to imple-
`ment spatial decomposition [9,30], which is generally
`more scalable. The force decomposition algorithm is
`an intermediate approach in that it is generally more
`efﬁcient than the replicated algorithm and easier to im-
`plement than the spatial decomposition [27].
`The ease of implementation of an MD algorithm is
`important given the need for multiple algorithms to ad-
`dress the variability encountered in mapping molecu-
`lar dynamics algorithms onto parallel architectures [8,
`9].
`In addition, experimenting with MD algorithms
`on novel parallel architectures is facilitated by tools
`aiding the parallelization process. Various tools have
`been applied to molecular dynamics simulations with
`varying success. Data parallel approaches have been
`found to be problematic due to the irregularity inherent
`to molecular dynamics [38], which is compounded by
`unstructured legacy applications [7]. Low-level tools
`such as MPI have been successful for performance[27],
`but do compromise readability and consequently main-
`tenance after the development period. Many good tools
`have been developed for problems structured similar-
`ly to molecular dynamics, but often target regularly
`structured applications, for example [12].
`There remains a long way to go in expediting the
`development of robust molecular dynamics algorithms.
`At the moment, there are tools which we have found
`
`Petitioner Microsoft Corporation - Ex. 1057, p. 61
`
`

`62
`
`P. Bała et al. / Application of Pfortran and Co-Array Fortran in the parallelization of the GROMOS96 molecular dynamics module
`
`mi
`
`(cid:2)
`
`bonds
`
`angles
`
`to ﬁll somewhat the void. We used the tool Pfortran
`to implement the replicated algorithm for the GRO-
`MOS96 MD module [36], followed by a parallelization
`using Co-Array Fortran. Our Pfortran parallelization
`was completed after an aggregate of about 60 hours for
`a team of two over a period of one week. The effort
`started with an SGI parallelization based on SGI direc-
`tives. The parallelization is machine independent and
`performs robustly.
`We brieﬂy review the MD model; the interested read-
`er is referred to [1,21,18] for detailed treatments. The
`MD method provides a numerical solution of classical
`(Newtonian) equations of motion
`d2ri
`= Fi(r1, r2, . . . , rN )
`(1)
`dt2
`where the force Fi(r1, r2, . . . , rN ) acting on particle i
`is deﬁned by the interaction potential U i(r1, r2, . . . , rN ).
`The general functional form of the potential is
`(cid:2)
`U(r1, r2, . . . , rN )
`Kb(rij − r0
`ij)2 +
`=
`(cid:2)
`(cid:3)
`(cid:5)
`
`Ka(ψij − ψ0
`ij)2
`(cid:4)
`
`(2)
`
`qiqj
`rij
`
`reduced to close to a linear dependence with the cutoff
`radius approximation coupled with strategic use of a
`pairlist [16,37]. Other algorithms used to reduce the
`cost of evaluation of nonbonded interactions include
`reaction ﬁeld methods [32] and multipole expansions
`of coulombic interactions [1,5,10,28].
`The shortcomings of parallel paradigm support for
`molecular dynamics stems from the difﬁculties posed
`by the irregularity of the calculation [12,14,40], and a
`general shortage of integrated tools for parallelization.
`Popular parallelization libraries such as PVM [13] and
`MPI [24], while suitable for irregular applications, of-
`fer little abstraction, requiring the programmer to man-
`age low-level details in the communication mechanism
`such as message identiﬁers. Higher-level methods such
`as HPF [17] encounter difﬁculties in dealing with ir-
`regular problems and legacy code [7].
`
`3. Pfortran and Co-Array Fortran
`
`A Fortran implementation of the Planguages, the
`Pfortran compiler extends Fortran with the Planguage
`operators which are designed for specifying off-process
`access [2,3]. In a sequential program the assignment
`statement speciﬁes a move of a value at the memory
`location represented by j to the memory location rep-
`resented by i. Planguages allow the same type of as-
`signment, however, the memory need not be local, as
`in the following example in a two-process system
`i@0 = j@1
`stating the intention to move the value at the memory
`location represented by j at process 1 to the memory
`location represented by i at process 0.
`With the aid of the @ operator one can efﬁciently
`specify broadcast of the value at memory location a
`for logical process 0 to the memory location a on all
`processes:
`a = a@0
`The other Pfortran operator consists of a pair of curly
`braces with a leading function, f{}. This operator
`represents in one fell swoop the common case of a
`reduction operation where the function f is applied to
`data across all processes. For example, to sum an array
`distributed across nP roc processes, with one element
`per process, one can write
`sum = +{a}
`Although a is a scalar at each process, it is logically
`an array across nProc processes. With @ and {}, a
`
`Petitioner Microsoft Corporation - Ex. 1057, p. 62
`
`1 + Ctcos(mtφt − φ0
`t )
`(cid:6)
`(cid:7)
`(cid:8)
`
`(cid:2) i
`
`+
`
`<j
`
`+ Bij
`r12
`ij
`
`Kt
`
`torsions
`
`(cid:2) i
`
`<j
`
`+
`
`+
`
`Aij
`r6
`ij
`+Uspecial
`where rij is the distance between atoms i and j, and
`other constants deﬁne force ﬁeld parameters for dif-
`ferent chemical atom types. Well known algorithms
`such as leap-frog [34] and Verlet [33,22] are used to
`calculate new positions and velocities.
`
`2. Related work
`
`The parallelization of molecular dynamics has been
`explored widely in the literature [4,6,8,9,11,15,19,23,
`27,29–31]. Fortunately, molecular dynamics simula-
`tions of biomolecular systems are well suited for par-
`allel computation since the forces acting on each atom
`can be calculated independently with a small amount
`of boundary information consisting of a neighborhood
`of atomic coordinates and in some cases, velocities.
`The leading computational component of the MD cal-
`culation involves the nonbonded forces, a calculation
`generally quadratic in the number of atoms that can be
`
`

`P. Bała et al. / Application of Pfortran and Co-Array Fortran in the parallelization of the GROMOS96 molecular dynamics module
`
`63
`
`variety of operations involving off-process data can be
`concisely formulated.
`In the Planguage model, processes interact through
`the same statement. Programmers have access to the
`local process identiﬁer called myProc. With myProc,
`the programmer distributes data and computational
`workload. The Planguage translators transform user-
`supplied expressions into algorithms with generic calls
`to a system-dependent library using MPI, PVM, shared
`memory libraries or other system-speciﬁc libraries.
`Cray Co-Array Fortran is the other parallelization
`tool considered in this study [25,26]. Co-Array For-
`tran introduces an additional array dimension for arrays
`distributed across processes. Co-Array Fortran gen-
`erally requires more changes in the legacy code than
`does Pfortran, however, Co-Array Fortran provides au-
`tomatic distribution of user-deﬁned arrays. Co-Array
`Fortran does not supply intrinsic reduction-operation
`syntax; these algorithms must be built on point-to-point
`exchanges by the programmer.
`
`4. Parallelization strategy
`
`We noted in the introduction that the molecular dy-
`namics parallelization methods of domain decomposi-
`tion and the replicated algorithm are at the extremes in
`implementation difﬁculty, domain decomposition be-
`ing the more difﬁcult.
`In terms of minimizing com-
`munication, domain decomposition can be shown to
`be optimal for various communication topologies and
`switches. With the replicated model, on the other hand,
`the accumulation of forces is a global operation. Both
`algorithms scale with respect to increasing problem size
`while maintaining a suitable workload per process [8,
`9], however, the replicated algorithm reaches a scala-
`bility limit as a function of the number of processes
`as a result of the global force accumulation [8]. The
`replicated algorithm, which was implemented in this
`study, performs more robustly than the spatial decom-
`position for a range of processor and problem conﬁgu-
`rations [9], making it the preferred method under some
`common conditions. In addition, the replicated algo-
`rithm is straightforward to implement.
`
`4.1. Pairlist parallelization
`
`Our parallelization of the replicated algorithm mod-
`iﬁed three parts of the program: 1) force calculation,
`2) pairlist calculation, and 3) I/O. We consider the
`principal components of the overall strategy shown in
`
`Fig. 1. Parallelization Schematic. One cycle through the ﬂowchart
`constitutes a single timestep. The parallelized nonbonded pairlist
`and force calculations are shown as branched regions of the ﬂowchart
`indicating the process-dependent control ﬂow through that part of
`the program. The single-line edges between components represent
`portions of the program executed redundantly at each process. The
`pairlist is calculated at intervals (roughly once every 10 timesteps);
`inter-process data movement (black circle) follows to accumulate
`energies and to exchange pairlist data used to retain an indexing
`scheme compatible with the sequential code. The force calculation is
`performed in parallel and at every timestep; immediately following,
`the forces are accumulated with a reduction operation (second black
`circle) as described in the text. The remainder of the timestep is
`performed redundantly by all processes, with each process holding
`the same system state.
`
`In the pairlist calculation all atom pairs are
`Fig. 1.
`scanned, and for each atom a list of atoms located with-
`in the cut-off radius is tabulated:
`do i = 1, n
`jl = 0
`do j = i+1, n
`if ( | X(i) - X(j) | < R ) then
`jl = jl + 1
`JNBL(jl,i) = j
`endif
`enddo
`JNB(i) = jl
`enddo
`GROMOS96 interactions models non-bonded inter-
`actions collectively between charge groups, rather than
`atoms. This modiﬁcation does not change the algorithm
`
`Petitioner Microsoft Corporation - Ex. 1057, p. 63
`
`

`i
`
`(3)
`
`64
`
`presented above, however its practical implementation
`is more complicated (see [7] for details).
`Based on the pair list, the forces can be evaluated
`efﬁciently as given in this pseudo code:
`do i = 1, n
`F(i) = 0
`do j = start(i), end(i)
`tmpforce = force ( X(i), X(JNB(j)) )
`F(i) = F(i) + tmpforce
`F(JNB(j)) = F(JNB(j)) - tmpforce
`enddo
`enddo
`Both the force and pairlist portions parallelization are
`based on a modulo strategy implementing a cyclic dis-
`tribution of pairs for the nonbonded-force routines [4],
`where in pseudo code
`if (MOD(chargeGroup-1,nProc).EQ.
`myProc)
`perform calculations
`endif.
`Our parallelization began with the GROMOS96 dis-
`tribution’s SGI-speciﬁc shared-memory code where
`each process calculates neighbors and forces for the
`portion of the pairlist assigned to it by the cyclic dis-
`tribution. In a distributed memory implementation is a
`good idea to distribute (rather than replicate) the pairlist
`array since it is the largest data structure in the pro-
`gram. Note that the cyclic distribution approximately,
`but effectively, distributes the load in the calculation.
`
`P. Bała et al. / Application of Pfortran and Co-Array Fortran in the parallelization of the GROMOS96 molecular dynamics module
`F(1:natoms*3) = +{F(1:natoms*3)}.
`The notation speciﬁes that the summation operator
`be applied to each instance of F in the process group
`with the mathematical meaning
`
`Fi = F (0)i + F (1)i + ··· + F (P−1)
`
`for 1 ≤ i ≤ natoms ∗ 3 and P processes.
`Without reduction operators, the Co-Array Fortran
`implementation of the force accumulation can be per-
`formed through explicit point-to-point exchanges as
`F(1:natoms) = 0.0
`call sync images()
`do iproc = 0, nProc-1
`F(1:natoms*3) = F(1:natoms*3) +
`F dist(1:natoms*3)[iproc]
`enddo.
`The co-array F dist is distributed across images,
`the Co-Array Fortran equivalent to processes, with the
`image speciﬁed by the index within the square brackets.
`The array F is a usual sequential array, local to each
`process and therefore considered replicated. So, in the
`code segment above,each process accesses the co-array
`portion of each other process to perform the sum in
`Eq. 3. sync images is a familiar shared–memory
`construct required to insure the one-sided accesses of
`non-local memory are consistent with the point of ac-
`cess in the program. In the Pfortran implementation,
`the consistency determination is left to the implemen-
`tation of the compiler. In the current Pfortran, synchro-
`nization is achieved through message buffering.
`The force calculation for covalent bonds, angles, di-
`hedrals and torsions may be performed independently
`and in parallel for each component. In the present im-
`plementation this part of the program was not paral-
`lelized due to its meager contribution to the total exe-
`cution time.
`
`4.2. Force calculation parallelization
`
`At the end of the force-calculation loop the replicated
`algorithm leaves processes with incomplete nonbonded
`forces, making it necessary to accumulate the values
`with a global summation. This step requires signiﬁcant
`communication and becomes the barrier to scalability
`with the replicated algorithm. However, the algorithm
`is effective over a wide range of process and problem
`conﬁgurations where the computation costs dominates
`the communication cost.
`The partial results calculated at each process are
`stored in a local copy of the force array (F in Pfortran
`pseudo code and F dist where Co-Array Fortran is
`used). Upon completion of the force calculation in
`each timestep, the partial forces are summed into the
`the force array at each process.
`The global accumulation of the force array is ex-
`pressed concisely by the Pfortran reduction operator
`as
`
`4.3. I/O
`
`A typical I/O strategy for SPMD codes is to use a
`designated process to open and read data, then to com-
`municate the data obtained from ﬁles to all other pro-
`cesses over a network. Similarly, non-replicated data is
`sent to, and then output from, a designated process. In
`that way, sequential semantics are retained for ﬁle I/O.
`In the Pfortran model, I/O from the sequential program
`must be modiﬁed to retain the sequential semantics.
`With the Cray Co-Array Fortran, however, the compil-
`er allows for synchronous ﬁle operations; that is, the
`
`Petitioner Microsoft Corporation - Ex. 1057, p. 64
`
`

`P. Bała et al. / Application of Pfortran and Co-Array Fortran in the parallelization of the GROMOS96 molecular dynamics module
`
`65
`
`disk operations are performed by all nodes and data is
`read by all images. Thus with Co-Array Fortran, I/O
`modiﬁcations are not required in general.
`A typical read operation is written using Pfortran as
`follows
`if ( myProc.eq.0 ) then
`read(unit,*) temperature
`endif
`temperature = temperature@0
`with the designated process broadcasting the value
`read. For the typical write operation, the designated
`process outputs the values. In the following example,
`partial energy terms are summed to the total energy
`for the system, and output by the designated process:
`energy = +{energy}
`if ( myProc.eq.0 ) then
`write(unit,*) energy
`endif
`More complicated “gathers” of data to the designated
`process may be required, however for the replicated
`algorithm, the resulting replication of state simpliﬁes
`this step. Performed manually, the I/O modiﬁcations
`were the most tedious aspect of our replicated algorithm
`implementation using Pfortran.
`
`4.4. Parallelization details
`
`GROMOS96 is written in FORTRAN77 for which
`Co-Array Fortran and Pfortran are supersets. The Pfor-
`tran implementation can run on systems where Pfortran
`is ported, independent of the underlying communica-
`tion paradigm; at present, MPI, PVM and a parallel
`simulator are targeted by Pfortran. 1 A port to a new
`communication library requires only changes in the
`Pfortran communication library. The Co-Array Fortran
`version is dependent on Cray systems, thus limiting the
`portability.
`The roughly 40,000 lines of GROMOS molecular
`dynamics code required the introduction of 65 declara-
`tions of co-arrays, about 300 lines containing co-array
`syntax, and almost the same number of calls to the Co-
`Array Fortran image-synchronization procedure. With
`Pfortran, various reduction operations were required
`33 times, with another 290 off-process data-access op-
`erations.
`In both the Co-Array Fortran and Pfortran
`parallelizations, most of modiﬁcations were associat-
`
`1Other Pfortran ports exist, but for machines that are no longer
`marketed.
`
`ed with the I/O subroutines, a feature of the replicated
`algorithm and I/O in general. The source code mod-
`iﬁcations to the code were reduced with the abstrac-
`tions provided by Co-Array Fortran and Pfortran, com-
`pared to an implementation using standard communi-
`cation libraries such as MPI or PVM. Co-Array Fortran
`and especially Pfortran requires just one additional line
`for each point-to-point communication compared to at
`least several lines of code using MPI or PVM libraries.
`Moreover, the co-array syntax and Pfortran operators
`provide an intuitive notation aiding the reasoning about
`the program in a way not dissimilar to the “+” operator
`in sequential languages.
`
`5. Program performance
`
`The performance of the parallelized GROMOS MD
`codes was measured using HIV-1 protease in water.
`The total system of 18,700 atoms consists of 1,970 pro-
`tein atoms, 14 ions and 5,572 water molecules. Peri-
`odic boundary conditions were used and a nonbonded-
`interaction cut-off radius of 8 Å. The principal features
`of the three multiprocessor systems used in this study
`are summarized in Table 1.
`We found close to linear speedup for the systems
`tested (Fig. 2). On the Cray T3E the program scales
`up to 32 processors (Fig. 3). With more processors we
`expect the communication costs to dominate (Figs 2
`and 3) for the HIV-1 system and parameters. Note that
`the one-time cost of data inputting was not removed
`from the total time. In practice this cost will be amor-
`tized by runs longer than our short, 100-step run with
`the I/O costs effectively going to zero and improving
`prospects for scalability.
`The communication costs are dominated by the re-
`duction of the force array during each timestep. This
`cost depends on the algorithm and the underlying
`communication layer. From Fig. 4, as expected, the
`O(N +logP ) algorithm underlyingPfortran reductions
`outperforms the Co-Array Fortran reduction algorithm
`we implemented in this study, a collection of point-to-
`point exchanges (see Fig. 3).
`
`6. Concluding remarks
`
`We completed an adaptation of the replicated algo-
`rithm implemented in the GROMOS96 MD module (as
`an SGI-speciﬁc implementation for shared-memory) to
`a portable distributed-memory version in about 60 pro-
`
`Petitioner Microsoft Corporation - Ex. 1057, p. 65
`
`

`66
`
`P. Bała et al. / Application of Pfortran and Co-Array Fortran in the parallelization of the GROMOS96 molecular dynamics module
`
`Table 1
`Features of the parallel computer systems and environments
`Nodes
`Processor
`Network
`OS
`Unicos
`mk 2.0.5
`MPT 1.3
`
`4 Gbit/sec
`3D torus
`
`CRAY
`T3E
`
`36
`
`300MHz
`
`8
`
`8
`
`194 MHz
`
`300–600 MHz
`
`8 Gbit/sec
`custom
`interconnect
`0.1 Gbit/sec
`Ethernet
`
`IRIX 6.2
`
`Linux 6.2
`MPICH 1.1.2
`
`T3E
`CAF
`SGI
`Cluster
`
`SGI
`Power
`Challenge
`R10K
`Pentium
`cluster
`
`1000
`
`100
`
`Time [s]
`
`1
`
`2
`
`8
`4
`Number of nodes
`
`16
`
`32
`
`Fig. 2. Total execution time for several processor conﬁgurations on SGI, Cray T3E and cluster of workstations. “CAF” denotes results for
`Co-Array Fortran on the Cray T3E, whereas the Pfortran implementation using MPI is labeled “T3E”.
`
`grammer hours. The parallel version of the code re-
`tains full functionality of the numerous options in the
`GROMOS96 MD module. The adaptation of the SGI-
`speciﬁc code was tantamount to parallelizing from the
`sequential version given the few changes made to the
`sequential coded to arrive at the SGI-speciﬁc shared-
`memory. The abstractions provided by the Planguages
`allowed us to focus our attention on details of logic
`unhampered by message-passing details, such as pro-
`viding a message-passing API with message sizes and
`tags, proper communicator and so on. What is more,
`the maintainability of the Pfortran code after parallel
`development is improved over an MPI version and the
`source code is portable to systems where the Pfortran
`compiler has been ported.
`We next adapted the Pfortran code to Co-Array For-
`tran, ﬁnding both implementations to scale with num-
`ber of processors as expected using the replicated al-
`gorithm on a constant-size problem. Pfortran has per-
`
`formance comparable to Co-Array Fortran, without the
`portability limitations. Pfortran also provides built in
`reduction operations and the facility for user-deﬁned
`ones. On the T3E, the O(N + logP ) algorithm un-
`derlying Pfortran reductions outperforms the Co-Array
`Fortran reduction algorithm that we implemented as a
`collection of point-to-point exchanges. Development
`of efﬁcient Co-Array Fortran reductions is left to the
`programmer. Having a library of such routines would
`be very useful.
`The small number of extensions and intuitive appli-
`cation of Pfortran and Co-Array Fortran contribute to
`making them effective tools for developing explicitly-
`parallel scientiﬁc applications. Key advantages in us-
`ing both approaches are 1) code with performanceclose
`to if not at the underlying libraries, 2) machine inde-
`pendence, 3) compiler optimizations, and 4) improved
`code readability. The abstractions allowing the pro-
`grammer to manipulate off-process data accesses fa-
`
`Petitioner Microsoft Corporation - Ex. 1057, p. 66
`
`

`P. Bała et al. / Application of Pfortran and Co-Array Fortran in the parallelization of the GROMOS96 molecular dynamics module
`
`67
`
`Total time
`
`I/O
`
`Force
`
`Force reduction
`
`1
`
`2
`
`8
`4
`Number of nodes
`
`16
`
`32
`
`100
`
`10
`
`1
`
`Time [s]
`
`Fig. 3. Execution time for different sections of the code for as a function of the number of Cray T3E computing nodes.
`
`T3E
`CAF
`SGI
`Cluster
`
`100
`
`10
`
`1
`
`Time [s]
`
`1
`
`2
`
`8
`4
`Number of nodes
`
`16
`
`32
`
`Fig. 4. Communication time for the reduction of the long-range force array for different numbers of computing nodes. CAF denotes results for
`Co-Array fortran on Cray T3E.
`
`cilitate one’s ability to reason about a code. That the
`off-process data access syntax are part of the language
`also permits compiler optimizations, which would be
`difﬁcult to perform with an API such as that provided
`by MPI (unpublished work in progress). Our use of
`Co-Array Fortran does not take advantage of its shared-
`memory paradigm, but it does highlight the advantage
`of the abstractions provided by Co-Array Fortran and
`Pfortran. The parallel programming models provided
`by the two paradigms have orthogonal and potentially
`complementary aspects. We are exploring the use of
`the paradigms together in a programming language.
`
`Acknowledgements
`
`Piotr Bala was supported by the Polish State Com-
`mittee for Scientiﬁc Research. Terry Clark and L. Ridg-
`way Scott were supported by NSF with the Nation-
`al Partnership for Advanced Computational Infrastruc-
`ture, NPACI. The computations were performed at the
`ICM, Warsaw University and the High Performance
`Computing Laboratory at the University of Chicago.
`
`References
`
`[1] M.J. Allen and D.J. Tildesley, Computer Simulation of Liquids,
`Oxford University Press, 1987.
`
`Petitioner Microsoft Corporation - Ex. 1057, p. 67
`
`

`68
`
`P. Bała et al. / Application of Pfortran and Co-Array Fortran in the parallelization of the GROMOS96 molecular dynamics module
`
`[2] B. Bagheri, Parallel Programming with Guarded Objects,
`Pennsylvania State University, Department of Computer Sci-
`ence, PhD dissertation, 1994.
`[3] B. Bagheri, T. Clark and L.R. Scott, Pfortran: a parallel dialect
`of Fortran, ACM Fortran Forum 11 (1992), 20–31.
`[4] B.R. Brooks and M. Hodoscek, Parallelization of CHARMM
`for MIMD machines. Chemical Design Automation News 7
`(1992), 16–22.
`[5] T.C. Bishop, R.D. Skeel and K. Schulten, Difﬁculties with
`multiple time stepping and fast multipole algorithm in molec-
`ular dynamics. Journal of Computational Chemistry 18(14)
`(1997), 1785–1791.
`[6] T. Clark and J.A. McCammon, Parallelization of a molecu-
`lar dynamics non-bonded force algorithm for MIMD architec-
`tures, Computers & Chemistry 14 (1990), 219–224.
`[7] T. Clark, R. von Hanxleden and K. Kennedy, Experiences
`in data-parallel programming. Journal of Scientiﬁc Program-
`ming 6 (1997), 153–138.
`[8] T. Clark, R. von Hanxleden, K. Kennedy, C. Koelbel and
`L.R. Scott, Evaluating parallel languages for molecular dy-
`namics computations. in: Scalable High Performance Com-
`puting Conference, IEEE Comput. Soc. Press, 1992, pp. 98–
`105.
`[9] T. Clark, R. von Hanxleden, J.A. McCammon and L.R. Scott,
`Parallelizing molecular dynamics using spatial decomposition.
`in: Scalable High Performance Computing Conference, IEEE,
`Knoxville, 1994, pp. 95–102.
`[10] H.-Q. Ding, N. Karasawa and W.A. Goddard III, Atomic level
`simulations on a million particles: The cell multipole method
`for Coulomb and London nonbond interactions. Journal of
`Computational Physics 97 (1992), 4309–4315.
`[11] D. Fincham, Parallel computers and molecular simulation.
`Molecular Simulation 1 (1987), 1–45.
`[12] S.J.Fink, S.R. Kohn and S.B. Baden, Efﬁcient run-time sup-
`port for irregular block-structured applications, J. Parallel and
`Distributed Computing 50 (1998), 61–82.
`[13] G.A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek
`and V. Sunderam, PVM: Parallel Virtual Machine, The MIT
`Press, 1994.
`[14] R. von Hanxleden, K. Kennedy, C. Koelbel, R. Das and J. Saltz,
`Compiler analysis for irregular problems in Fortran D, in: Pro-
`ceedings of the Fifth Workshop on Languages and Compilers
`for Parallel Computing, New Haven, 1992.
`[15] H. Heller, H. Grubmuller and K. Schulten, Molecular dynam-
`ics simulation on a parallel computer. Molecular Simulation
`5 (1990), 133–1650.
`J.W. Eastwood R.W. Hockney, Computer Simulation Using
`Particles, Cambridge University Press, Cambridge, 1987.
`[17] Proceedings of the High Performance Fortran Forum, Hous-
`ton, 1992.
`[18] R.W. Hockney and J.W. Eastwood, Computer simulation us-
`ing particles,
`Institute of Physics Publishing, Bristol and
`Philadelphia, 1994.
`[19] L.V. Kal´e, R. Skeel, M. Bhandarkar, R. Brunner, A. Gursoy,
`N. Krawetz, J. Phillips, A. Shinozaki, K. Varadarajan and
`K. Schulten. NAMD2: Greater scalability for parallel molec-
`ular dynamics, Journal of Computational Physics 151 (1999),
`283–312.
`[20] S.L. Lin, J. Mellor-Crummey, B.M. Pettitt and G.N. Jr.
`Phillips, Molecular dynamics on a distributed-memory multi-
`processor, Journal of Computational Chemistry 13(8) (1992),
`1022–1035.
`
`[16]
`
`[28]
`
`[32]
`
`[21]
`
`[22]
`
`J.A. McCammon and S.C. Harvey, Dynamics of proteins and
`nucleic acids, Cambridge University Press, Cambridge, 1987.
`J.A. McCammon, B.M. Pettitt and L.R. Scott, Ordinary dif-
`ferential equations of molecular dynamics. Computers Math.
`Applications 28(10–12) (1994), 319–326.
`[23] F. M¨uller-Plathe and D. Brown, Multi-colour algorithms in
`molecular simulation: Vectorisation and parallelisation of in-
`ternal forces and constraints, Computer Physics Communica-
`tions 64 (1991), 7–14.
`[24] Message Passing Interface Forum, MPI: a message-passing
`interface standard. Technical report, 1994.
`[25] R.W. Numrich, F−−: a parallel extension to Cray Fortran,
`Scientiﬁc Programming 6 (1997), 275–284.
`[26] R.W. Numrich, J. Reid and K. Kim, Writing a multigrid
`solver using co-array Fortran, in: Recent Advances in Applied
`Parallel Computing, Lecture Notes in Computer Science 1541,
`B. Kågstr¨om, J. Dongarra, E. Elmroth and J. Wa´sniewski, eds,
`Springer-Verlag Berlin, 1998, pp. 390–399.
`[27] S. Plimpton, Fast parallel algorithms for short-range molecular
`dynamics. Journal of Computational Physics 117 (1995), 1–
`19.
`J. Singer, The Parallel Fast Multipole Method in Molecular
`Dynamics. PhD thesis, University of Houston, August 1995.
`[29] W. Smith, Molecular dynamics on hypercube parallel comput-
`ers. Computer Physics Communications 62 (1991), 229–248.
`[30] T.P. Straatsma, M. Philippopoulos and J.A. McCammon,
`NWChem: Exploiting parallelism in molecular simulations.
`Computer Physics Communications 128 (2000), 377–385.
`[31] V.E. Taylor, R.L. Stevens and K.E. Arnold, Parallel molecu-
`lar dynamics: Implications for massively parallel machines.
`Journal of Parallel and Distributed Computing 45 (1997),
`166–175.
`I.G. Tironi, R. Sperb, P.E. Smith and W.F. Gunsteren, A gen-
`eralized reaction ﬁeld method for molecular dynamics simu-
`lations, Journal of Chemical Physics 102 (1995), 5451–5459.
`[33] W.F. van Gunsteren and H.J.C. Berendsen, Algorithms for
`Brownian dynamics, Molecular Physics 45 (1982), 637–647.
`[34] W.F. van Gunsteren and H.J.C. Berendsen, A leap-frog algo-
`rithm for stochastic dynamics. Molecular Simulation 1 (1988),
`173–182.
`[35] W.F. van Gunsteren, H.J.C. Berendsen, F. Colonna, D. Perahia,
`J.P. Hollenberg and D. Lellouch, On searching neighbors in
`computer simulations of macromolecular systems, Journal of
`Computational Chemistry 5 (1984), 272–279.
`[36] W.F. van Gunsteren, S.R. Billeter, A.A. Eising, P.H. Hunen-
`berger, P. Kruger, A.E. Mark, W.R.P. Scott and I.G. Tironi,
`Biomolecular Simulation: The GROMOS96 Manual and User
`Guide, ETH Zurich and BIOMOS b.v., Zurich, Groningen,
`1996.
`[37] L. Verlet, Computer experiments on classical ﬂuids. I. Ther-
`modynamical properties of Lennard-Jones molecules, Physi-
`cal Review 159 (1967), 98.
`[38] R. von Hanxleden,
`for Machine-
`Compiler Support
`Independent Parallelization of Irregular Problems, PhD the-
`sis, Rice University, 1994.
`[39] S.T. Wlodek, T. Clark, L.R. Scott and J.A. McCammon,
`Molecular dynamics of acetylcholinesterase dimer complexed
`with tacrine, Journal of the American Chemical Society 119
`(1997), 9513–9522.
`[40] Z. Zhang and J. Torrellas, Speeding up irregular applications in
`shared-memory multiprocessors: Memory binding and group
`prefetching, Comp. Arch. News 23 (1995), 188–199.
`
`Petitioner Microsoft Corporation - Ex. 1057, p. 68
`
`

` Advances in
`
`Multimedia
`
`
`JouoJ rnnaal ol o of
`InnInduduuststs rial Engineering
`
`The Scientiﬁc
`World Journal
`
`International Journal of
`Distributed
` Sensor Networks
`
` Applied
`Computational
`Intelligence and Soft
`Computing
`
`Hindawi Publishing Corporation
`http://www.hindawi.com
`
`Volume 2014
`
`Hindawi Publishing Corpo

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases