throbber
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 2, FEBRUARY 1998
`
`97
`
`Hyper-Systolic Parallel Computing
`
`Thomas Lippert, Armin Seyfried, Achim Bode, and Klaus Schilling
`
`Abstract—We introduce a new class of parallel algorithms for the exact computation of systems with pairwise mutual interactions of
`2
`n elements, so called n
`-problems. Hitherto, practical conventional parallelization strategies could achieve a complexity of O(np)
`with respect to the inter-processor communication, p being the number of processors. Our new approach can reduce the inter-
`)1
`processor communication complexity to a number Onp(
`. In the framework of Additive Number Theory, the determination of the
`optimal communication pattern can be formulated as h-range minimization problem that can be solved numerically. Based on a
`complexity model, the scaling behavior of the new algorithm is numerically tested on the connection machine CM5. As a real life
`2
`-problem, on the CRAY T3D, with
`example, we have implemented a fast code for globular cluster n-body simulations, a generic n
`striking success. Our parallel method promises to be useful in various scientific and engineering fields like polymer chain
`computations, protein folding, signal processing, and, in particular, for parallel level-3 BLAS.
`
`2
`
`Index Terms—Systolic algorithm, hyper-systolic algorithm, n-body computation, n2-loop computation, parallel computer, connection
`machine CM5 and Cray T3D, novel complexity class.
`
`—————————— ✦ ——————————
`
`moreover, on massively parallel computers with distributed
`memory, they imply a substantial amount of interprocessor
`communication as the elements xi of the array x are spread
`out over many processing elements, i.e., local memories.
`An example taken from astrophysics will illustrate the
`computational and communicational efforts required. In
`globular cluster simulations, molecular dynamics techniques
`with gravitational forces are applied to track the time evolu-
`tion of the cluster. A wide range of cluster sizes must be
`treated in order to extrapolate to large n. At the upper edge,
`exact state-of-the-art calculations are, at present, restricted to
`ensemble sizes n < O(50,000). The reason is that exact com-
`putations are needed, rather than hierarchical approxima-
`tions that would fail in modeling the relevant physics. In
`such applications, computations of all elements f(xi, xj) have
`to be performed at each time step, with the number of time
`steps along the system trajectory in the range of 103 to 105 [6],
`[7]. Thus, each integration step amounts to up to O(1010)
`floating point operations and O(105 p) elements to be com-
`municated between the processors of the parallel machine.1
`Various methods have been devised in the past to ex-
`actly solve the n-body problem. We mention two generic
`parallel approaches:
`1) The replicated data method [1] deals with p identical
`copies of the entire array x that contains n particles.
`These copies are to be placed within each processor.
`On these data, the computation is performed such
`that each processor k calculates the elements f(xi, xj)
`for j = 1, …, n and i
`=
`-
`+
`(
`
`
`)1
`,1  , i.e., each,
`
`k
`k
`
`np
`
`np
`
`1. We have to emphasize that this situation must be distinguished from
`simulations with short-range forces (where multi-million-particle simula-
`tions are today’s standard). In that case, parallel “linked-cells” algorithms
`are the method of choice [8].
`
`1 INTRODUCTION
`W
`ITHIN many scientific and engineering applications,
`one is faced with the intermediate computation of
`bilocal objects or functions, f(xi, xj), on a given set of num-
`bers xi, i = 1, …, n. Think, for example, of the exact treat-
`ment of two-body forces in n-body molecular dynamics [1]
`as employed in astrophysics or thermodynamics of instable
`systems, convolutions in signal processing [2], autocorrela-
`tions in time series [3], n-point polymer chains with long-
`range interactions, protein-folding [4], or genome analysis.
`In general, computation of all f(xi, xj) bilocals requires n2
`compute steps. With f(xi, xj) being symmetric in i and j, the
`calculation needs n(n - 1)/2 different pairings. Thus, the
`computational cost increases quadratically with the number
`of particles and, therefore, n2-calculations are classified as
`computationally hard problems [5].
`As one would wish to increase the computational speed,
`massively parallel computers become very attractive, as long
`as an appropriate speedup can be attained according to the
`number of processors, p. On parallel machines, the O(n2)
`)2
`complexity, in principle, can be segmented to O n
` com-
`p(
`putations of f(xi, xj) per processor. This would suggest
`= small.
`aiming for the low granularity limit, g
`Computations of n2-problems are not only extremely de-
`manding on memory-to-register (and cache) data movement,
`
`np
`
`
`
`(cid:129) T. Lippert and K. Schilling are with HLRZ, c/o FZ-Jülich, D-52425 Jülich,
`Germany. E-mail: lippert@theorie.physik.uni-wuppertal.de.
`(cid:129) A. Seyfried is with the Department of Physics, University of Wuppertal,
`D-42097 Wuppertal, Germany.
`(cid:129) A. Bode is with the Department of Physics, Humboldt University, D-10115
`Berlin, Germany.
`Manuscript received 28 July 1995; revised 22 Mar. 1997.
`For information on obtaining reprints of this article, please send e-mail to:
`tpds@computer.org, and reference IEEECS Log Number 104615.
`
`1045-9219/98/$10.00 © 1998 IEEE
`Petitioner Microsoft Corporation - Ex. 1056, p. 97
`
`

`

`98
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 2, FEBRUARY 1998
`
`structure, is worked out in Section 4. In Section 5, the
`communication of the hyper-systolic algorithm is bench-
`marked in comparison with standard-systolic computa-
`tions on the Connection Machine CM5. In the last section,
`we consider a real life example as taken from astrophys-
`ics, namely, an n-body simulation code as implemented
`on the Cray T3D.
`
`
`
`=Â
`
`
`
`,1 2 3,
`
`
`
`,
`
`
`
`,
`
`p
`
`,
`
`i
`

`
`j
`
`,
`
` (3)
`
`
`
`
`
`,
`
`'
`
`j
`
`i
`
`=
`
`=Â
`
`=
`
`2 THE COMPUTATIONAL PROBLEM
`Our task is to compute the n2-problem
`n
`,
`,
`,
`:
`,
`,
`
`
`,,1 2 3
`,
` (1)
`h
`k
`n
`y
`f x x
`h
`
`%
`=
`=

`h
`h
`k
`1
`k
`with f(xh, xk) being any long-range pair-interaction, e.g.,
`gravitational or Coulomb forces, as well as more general
`types of interactions as occur in neural nets or protein
`folding. We further suppose that the forces are symmetric
`in h and k (up to a minus sign). The number of combina-
`tions {xh, xk} required for the computation of all f(xh, xk) is
`1
`n n
`-
`# ,
`n
` (2)
`=
`2
`2
`i.e., the computational complexity inherent in (1) is a number
`O(n2).
`On a parallel computer equipped with p compute nodes,
`this problem can be split into n
`p subtasks represented by
`the generic problem,
`p
`:
`,
`y
`f x x
`"
`i
`i
`1
`j
`with the input data
`x = (x1, x2, x3, …, xp) and y = (y1, y2, y3, …, yp), (4)
`as the resulting set written as one-dimensional arrays. In
`molecular dynamics terms, the sequence x can, e.g., be
`thought of as the coordinates of a number of n particles,
`whereas the sequence y describes the potential (or equiva-
`lently a component of the force) by which particle # i is
`influenced in the presence of all the other particles. In
`the following, we will concentrate on the generic prob-
`lem (3).
`The number of processors p of a given parallel machine and
`the number of elements n will differ in practical applications.
`Here, we will partition the array of n elements into n
`p subar-
`p data elements.
`rays, such that each processor deals with n
`Each subarray of p elements is distributed across the p proc-
`essors as in the generic case. The implementation of (1) with
`n > p, i.e., granularity g
`> 1, is a straightforward generali-
`=
`zation of the generic case, because (1) can be rewritten as:
`np
`p
`ÂÂ
`,
`,
`y
`f x
`x
`"
`(
`(
`(
`1
`1
`m
`j
`,
`,
`i
`i
`+ π
`
`l
`
`
`
`)1
`
`p i
`+
`
`-
`
`
`)1
`m p j
`-
`+
`
`'
`
`,
`,
`,
`1 2 3
`=
`(
`)
`1
`m
`p
`-
`
`,
`
`,
`p
`.
`j
`
`
`
`+
`
`( )
`5
`
`np
`
`=
`
`np
`
`l
`
`
`
`)1
`
`p i
`+
`
`-
`
`=
`
`=
`
`,
`1
`
`$
`
`p
`
`,
`,
`1 2 3
`l
`=
`with
`l
`
`-
`
` compute operations, the
` )2processor performs O np(
`
`
`total amount of communicated data is np elements.
`2) The parallel organization of the evaluation in form of a
`systolic array computation [1], [11], [12], [13], on ma-
`chines that allow for a one-dimensional ring connectiv-
`ity, proceeds with one moving and one fixed array. The
`n elements are distributed across the processing nodes,
`p elements are assigned to a given processor. In
`i.e., n
`each systolic step, the moving array is shifted by one
`position along the processing elements and subse-
`quently, the computations are performed. After p - 1
`“pulse” and “move” operations, all pairs f(xi, xj) have
`been generated. Again, the total amount of communi-
`cated data is np elements.
`In astrophysics, such systolic array computations are
`known as Orrery-algorithms [14], [15], [16]. Various pro-
`cedures which combine features of both methods are
`known in the literature. Common to such approaches is
`that the complexity of the interprocessor communication
`is O(np).
`In this work, we introduce a parallel computing concept
`that, to a certain extent, can be regarded as a generalization of
`systolic array computations with “pulse” and “circular
`movement.” We will point out that a formulation of the
`computational problem in the context of Additive Number
`Theory—leading to an h-range optimization problem [17]—
`defines a new class of algorithms. With the base of the h-
`range problem suitably chosen, one can recover the systolic
`realization of the computational problem. In addition, vari-
`ous other bases can be contrived that allow the complexity of
`the interprocessor communication to diminish. We will pres-
`ent a selection of shortest bases and a “regular” base that
`both reduce the complexity of the interprocessor communi-
`)12
`cation to O np(
`. The new method, in the following denoted
`as “hyper-systolic,” has in common with the standard-
`systolic array computation that, during each step, only one
`array is moving in a regular communication pattern. There-
`fore, it applies equally to both SIMD and MIMD machines,
`just as standard-systolic computation. The distinctive feature
`as compared to the standard-systolic concept, however, lies
`in the fact that storing of shifted arrays is required, similar to
`the replicated data method. But, in contrast to the latter,
`where the storage increases like n ¥ p, a significant reduction
`in interprocessor communication is achieved with only a
`moderate amount of additional storage. To reach the optimal
`hyper-systolic speedup, one needs storage space of size
`)12
`O np(
`. We will demonstrate that, in view of the state-of-the-
`art problem sizes mentioned above, this is not an obstacle for
`today’s parallel supercomputers.
`In Section 2, we present the generic form of the com-
`putational problem to be solved. In Section 3, we review
`the standard-systolic concept and introduce a complexity
`model to set the stage for further comparisons. In order to
`provide a graphical illustration, we will phrase the issue
`in systolic automata models [12], [18]. A suitable generali-
`zation of the latter, which gives us the means to graphi-
`cally represent the hyper-systolic parallel computing
`
`Petitioner Microsoft Corporation - Ex. 1056, p. 98
`
`

`

`LIPPERT ET AL.: HYPER-SYSTOLIC PARALLEL COMPUTING
`
`99
`
`Fig. 1. A systolic automaton cell (large dotted square) is an arrangement of a processing element (open triangle) and a delay element (black
`square). The data element xi resides stationary in the processing element i. In each clock step, the delay element delivers another data element
`xj of the moving array. The function f(xi, xj) is computed and successively added to yi that is resident in the processing element as well. The plot
`shows the data location on the systolic array after the third clock step.
`
`3 SYSTOLIC ALGORITHM
`First, we describe the one-dimensional systolic implemen-
`tation of (3), with n = p, to set the stage for our further con-
`siderations. We use the concept of systolic automaton mod-
`els to illustrate parallel computing structures [12], [18].
`A systolic automaton consists of cells with data transfer
`and processing units. The cells are coupled to each other in
`a uniform next-neighbor wiring pattern, as depicted in Fig. 1.
`The processing elements (PE) are drawn as open triangles.
`They realize functions of equal load between consecutive
`communication events. The data transfer elements are
`drawn as black squares. They are delay elements (DE) that
`represent abstractions for memory locations in which data
`is shifted in and out in regular clock steps, which means the
`clock step of the abstract automaton, rather than a physical
`computer clock step. Data processing and transfer are com-
`pletely pipelined. More precise definitions of systolic arrays
`and algorithms along with many examples and applications
`can be found in two monographs by Petkov [12], [18].
`The graphical structural counterpart of (3) is given in
`Fig. 1. The cells are consecutively arranged in a linear order.
`Each cell is connected with its left and right next neighbors.
`Note that the systolic computation of (3) requires a toroidal
`topology of the linear array i.e., a ring connectivity. At clock
`step 0, a sequence x of p data elements is distributed over p
`PEs. This sequence will stay fixed in the following process.
`Initially, a copy x of the array x is made. The array x is
`shifted and the processing on all cells is performed subse-
`quently, clock step by clock step. Fig. 1 illustrates the state
`of the systolic computation after three clock steps.
`The small black boxes stand for the DEs, the function of
`which is to shift the data element xi in one clock step from
`cell # i to cell # (i + 1). Subsequently, the PEs perform the
`numerical computation of the function f, i.e., they combine
`the elements of the fixed array x with the elements of the
`shifted array x . The result in the ith cell is added to the
`resulting data element yi of the array y such that, after p - 1
`steps, the resulting data elements are completely computed
`and are distributed over all processing elements. In terms of
`the molecular dynamics example, each particle coordinate
`is located together with its respective force value in the
`same processing element.
`
`
`
`
`
`+
`
`
`
`The mapping of the abstract automaton model onto a
`real parallel computer appears straightforward: The PEs are
`identified with the processors of the parallel machine, the
`DEs symbolize registers or memory locations exchanging
`data via the interprocessor communication network. A
`connectivity of the parallel system is required that allows
`the embedding of a logical ring to realize the communica-
`tion pattern of the systolic automaton.
`The explicit systolic realization of (3) is given in Algo-
`rithm 1 for the generic case n = p.
`Algorithm 1. Conventional systolic algorithm.
` foreach processor i = 1 : p Πsystolic ring
`x
`x
`0 =
`i
`i
` for j = 1 : p - 1
`x
`x
`1
`j=
`j
`-
` mod
`1
`i
`i
`p
`(  ,  )f x x
`
`y
`y
`0
`j
`+
`=
`i
`i
`i
`i
` end for
`end foreach
`3.1 Communication Model
`For further evaluation and comparison of complexities be-
`tween systolic and various hyper-systolic methods, we intro-
`duce a communication complexity model.
`The time required for a processor to communicate a
`message of m words to a neighbor processor node is mod-
`eled as tl + mtt, where tl is the start-up time that is assumed
`to be independent of the message length and tt is the data
`transmission time per word.
`For the systolic computation of (3), p - 1 systolic steps are
`performed, where, in the generic case, each processor sends
`exactly one word to its neighbor. Thus, the total number of
`words to send is a number order p2, one word per processor
`in one systolic step, and a total communication time of
`T = (p - 1)(tl + tt)
`
` (6)
`
`is found.2
`2. One can improve on the scaling of the systolic realization by taking into
`account the symmetries. The computations then can be reduced to p(p - 1)/2
`processing operations, but p(p + 1) communication events have to be per-
`formed, since a moving result array has to be shifted p/2 + 1 times. This
`method goes under the name “Half-Orrery” algorithm [15] and originally
`was introduced by Seitz [16].
`
`Petitioner Microsoft Corporation - Ex. 1056, p. 99
`
`

`

`100
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 2, FEBRUARY 1998
`
`For the case n > p, the array will be partitioned into np parts.
`
`One can organize the computation such that, successively, all
`required computations between all n data elements can be
`constructed, according to (5). Each inner loop for each array
`is carried out as described for the generic case n = p; addi-
`tionally, combinations between rows must be computed. In
`p words to
`each systolic step, a given processor has to send n
`its neighbor. As communication is pipelined, the total
`communication time thus turns out to be
`(
`)1
`
`.
`T
`p
`t
`t
`=
`-
`+
`t
`l
`
`
`
`np
`
`
`
` (7)
`
`We call this reduced matrix “hyper-systolic” matrix . From
`this matrix, it is easy to see that, indeed, all required pairs
`come together within the processors: Let us, e.g., consider
`combinations with element # 1: In the first column, the
`combinations {1, 2}, {1, 4}, {1, 6}, and {1, 10}, in the 16th col-
`umn, {1, 3}, {1, 5}, {1, 9}, and {1, 16}, in the 14th column, we
`have {1, 15}, {1, 14}, and {1, 7}, in the 12th column, {1, 12} and
`{1, 13}, and, in the eighth column, the combinations {1, 11}
`and {1, 8} can be found. As the system is toroidally closed, it
`is evident that all pairing are present. As a result, we need
`only four shifts and five arrays to be stored, to get all pairs
`done. In the following, it will be important that the structure
`presented implies a homogeneous load of all processors in
`each step of the hyper-systolic computation, as was the case
`in the systolic computation.
`We note, further, that in order to realize (3), we cannot
`just add the outcome of the computations to one resulting
`array: For every row xt to be stored, a separate, intermedi-
`ate array yt is required. In a process that involves the in-
`verse sequence of shifts, a moving “collector” array is
`shifted back, while step by step, the results in the arrays yt
`are accumulated. Finally, y contains the desired results.
`Therefore, for this small example, in total, eight shift-
`operations are required. In the standard systolic process, 15
`shifts of the moving array must be carried out, while, for
`Half-Orrery, 17 shifts are to be performed [15].
`4.2 Hyper-Systolic Breviary
`Next, we give a general prescription to implement (3) accord-
`ing to the ideas exposed in the last section. The new algorithm
`is called hyper-systolic, the name referring to the particular
`features of the underlying communicational structure.
`1) For a given array x of length p, k replicas are gener-
`ated by shifting the array x k times by a stride at and
`storing the resulting arrays as xt, 1 £ t £ k. The vari-
`able strides at of the shifts are chosen such that
`a) all pairs of data elements (taken within columns
`of ) occur at least once and the number of equal
`pairings is minimized,
`b) the number of shifts k is minimized, under the con-
`straint that the strides at can be realized efficiently
`by the given parallel computer’s connectivity.
`2) The p(p - 1)/2 results yi, according to (3), are succes-
`sively computed and are added to k arrays yt.
`3) Finally, applying the inverse shift sequence, a collec-
`tor array y is shifted back by the inverse sequence of
`shifts, while the intermediate results in the array yt is
`added to y in step k - t.
`4.3 Graphical Representation
`The concept of traditional systolic automata models can be
`extended to the hyper-systolic parallel computing structure.
`For the graphical representation of the latter, we use the
`objects introduced in Section 3. Fig. 2 shows the ith hyper-
`systolic cell unit. The DEs again represent input and output.
`Additionally, we have drawn open rectangles to symbolize
`the memory locations for the stored data elements xt and
`
`Petitioner Microsoft Corporation - Ex. 1056, p. 100
`
`4 HYPER-SYSTOLIC ALGORITHM
`The main focus of this work is on the reduction of the total
`time expense for the interprocessor communication in the
`parallel computation of (3). We will show for the generic
`case, (3), that for the communicational part a complexity of
`)32
`O p(
` can be achieved by reorganization of the data-
`movement. Going to the case n > p, the conventional com-
`plexity can be reduced from O(np) to O np(
`)12
` for the hyper-
`systolic algorithm. The amount of processing operations is
`not altered compared to conventional methods.
`4.1 Motivation
`Consider the matrix $, called systolic matrix, which explic-
`itly shows the data elements of the shift-array xt, (4), for the
`clock steps t = 0, 1, 2, …, p - 1, as delivered successively in
`the systolic implementation, Section 3:
`1 2 3
`p
`
`2 3 4
`1
`p
`3 4 5
`2
`1
`$ = ◊
`◊
`◊
`◊
`◊
`◊
`1
`.
`p
`1
`2
`1 2
`p
`p
`p
`-
`-
`-
`
`
`We observe in (8) that all combinations of two different
`elements within the columns of $ actually occur p times, if
`each shifted array xt is stored for t = 0, 1, 2, …, p - 1 and
`combinations between all the rows are allowed. Note that
`storing the shifted arrays is not required in the systolic con-
`cept with only two arrays, one fixed and one moving array.
`Here, as time proceeds, an increasing number of “resident”
`arrays is stored, with only one array in movement.
`The matrix $ shows that a p-fold redundancy of pairings
`is encountered if all shifted arrays are stored. It appears
`natural to reduce this redundancy of pairings and, thus, to
`save on interprocessor communication by storing some of
`the arrays xt only, for a suitably selected subset of time steps t.
`To illustrate the idea, let us consider a simple example
`for p = 16; from matrix $ in (8) we extract only five rows:
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10 11 12 13 14 15 16
` (9)
`
`2
`3
`4
`5
`6
`7
`8
`9
`10 11 12 13 14 15 16
`1
`4
`5
`6
`7
`8
`9
`10 11 12 13 14 15 16
`1
`2
`3
`6
`7
`8
`9
`10 11 12 13 14 15 16
`1
`2
`3
`4
`5
`9 .
`10 11 12 13 14 15 16
`1
`2
`3
`4
`5
`6
`7
`8
`
` (8)
`
`0123
`
`==== =
`
`tttt t
`
`
`
`
`
`  
`
`
`
`
`
`
`
`
`
`
`
` =
`
`

`

`LIPPERT ET AL.: HYPER-SYSTOLIC PARALLEL COMPUTING
`
`101
`
`p
`
`Fig. 2. Processing element (PE) (open triangle), delay element (DE) (black square), and store elements (SE) (open squares), arranged in a hyper-
`systolic automaton cell (large dotted square). The element xi resides stationary. In each clock step t, the DE delivers a new data element from loca-
`- 1),modp + 1] to be stored in the local SE and transmitted further to the next DE. Subsequently, the computation is performed on elements
`tion [(i + at
`yi at
`stored in the SE. The results are added to memory locations [
`]
`. Note that each DE is equipped with r connectors to account for r different
`+
`-
`+
`(
`1)mod
`1
`strides at. Second, the results yi at
`+ are step by step added to the array y that is shifted by the inverse sequence of strides.
`1]
`
`+
`
`[
`
`-
`
`1mod
`
`p
`
`Fig. 3. Three abstract automaton cells as part of a hyper-systolic array with r= 2 The data location on a part of the hyper-systolic array is shown
`after the second systolic cycle.
`
`yt, with 1 £ t £ k. The DEs now have one input and r output
`connectors, one separate connector for each required shift
`width at.
`Again, the cells are arranged as a uniform grid; in addi-
`tion to the next-neighbor connectivity, the r new nonlocal
`regular connections between the processors are indicated in
`Fig. 3. They correspond to the shifts of data elements by
`strides at. According to the number r of different strides at, r
`direct connections would be optimal. From the point of
`view of the original one-dimensional systolic array struc-
`ture—due to these additional communication connec-
`tions—the (one-dimensional) space of processing elements
`has acquired topologically nontrivial interconnections. Fast
`regular interprocessor data transfer to distant cells can be
`achieved in one clock step via these interconnections that
`are used as shortcuts. Thus, the characterization of our new
`algorithm as hyper-systolic. We emphasize that hyper-
`systolic data movement and storing extends the standard-
`
`systolic concept. From the uniform structure of the hyper-
`systolic automaton model and the regularity of the inter-
`processor communication, it is evident that the PEs can per-
`form functions of equal load in each hyper-systolic step.
`The amount of processing operations that is equal for each
`processor increases with an increasing number t of stored
`1
`.
`
`arrays  ,y ¢
`t
`t
`£ ¢ £
`t
`Fig. 3 represents the graphical structural counterpart of
`(3) for the hyper-systolic algorithm. At clock step t = 0, a
`sequence x of p data elements is distributed over p proc-
`essing elements. In every clock step t = 0, 1, 2, …, k, a copy
`xt of the array xt-1 is generated, and, in the next clock step,
`the array is shifted by a stride at. Subsequently, the new
`array xt is stored in the SEs. Then, the processing can be
`done by the PEs involving ever more of the stored arrays
`for increasing t. Fig. 3 shows the situation after two clock
`steps. Note, however, that only one array is moving just as
`in standard-systolic computation.
`
`Petitioner Microsoft Corporation - Ex. 1056, p. 101
`
`

`

`102
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 2, FEBRUARY 1998
`
`The formulation in terms of an h-range problem appears
`to provide quite a general framework covering a whole
`class of algorithms: Any chosen base A induces its specific
`new algorithm; the realization of Ak with k = p - 1 brings us
`back to the standard-systolic algorithm with the base
` (14)
`
`0{ ,
`,
`,
`},
`1
`
`1,
`
`,
`1,
`A
`a
`a
`a
`i
`p
`=
`=
`" =
`-
`1
`1
`1
`p
`p
`i
`the realization
`p
`1
`0
`},
`,
`,
`{ ,
`,, (15)
`a
`a
`a
`A
`=
`=
`2
`1
`i
`leads to the Half-Orrery algorithm.
`A lower bound on the minimal k, with optimal complex-
`ity, can be derived by the following consideration:
`The total number of combinations required is p(p - 1)/2.
`Let the matrix  of (10) be realized by k shifts, i.e., by k + 1
`arrays. The number of possible combinations between the
`12
`elements of a given column of  follows as k +
`. Thus,
`
`the following inequality holds:
`(
`)
`1
`p p
`12
`k
`-
`+

`2
`
`
` (16)
`
`
`
`p
`
`,
`
`
`
`therefore,
`
`34
`
`12
`
` (17)
`.
`p
`k
`≥ -
`+
`-
`with k integer. Obviously, a sequence Ak with less than kmin
`elements cannot solve the h-range problem presented above.
`In other words, the complexity for the inter-processor com-
`munication of any hyper-systolic algorithm to solve (3) is
`bounded by p p from below for n = p and by n p for n > p.
`4.5 Shortest Bases
`In absence of analytical solutions, for various numbers of
`processors p ranging from four to 64, we could construct
`shortest bases by straightforward computation according to
`the above minimization prescription. We tried to prefer bases
`with elements being a power of two, as these strides might
`lead to fastest interprocessor communication on most parallel
`machine’s networks. Computing these bases, we assumed
`that shifts with different stride require equal communication
`time. The general case, treating different communication
`times for different strides, will be discussed elsewhere [22].
`Table 1 collects our selection of extremal bases. For large
`p, it becomes exponentially difficult to find extremal bases.
`For illustration of the hyper-systolic shift pattern, let us
`present the hyper-systolic matrix , as constructed by the
`extremal base for p = 32, see Fig. 4.
`The realization of the hyper-systolic computation is shown
`in Algorithm 2: We emphasize that, in general, one cannot just
`combine all the rows; rather, we must take care for the fact that
`a redundancy can occur. Since the minimality condition, (17),
`, more than p2
`is unlikely to be fulfilled in general, i.e., k
`p≥
`elements can be computed. Therefore, we define a
`“multiplicity table” m(i, j), a k ¥ k-matrix (since  has k rows),
`to ensure any pair (i, j) to be computed once and only once.
`The matrix element m(i, j) = 1 if pair (i, j) was not yet treated
`and 0 if (i, j) or (j, i) have been computed before. As we only
`
`-
`
`p
`2
`
`
`
`-
`
`
`
`
`
`p
`2
`
`i
`" =
`
`1
`,
`
`
`
`
`
`Petitioner Microsoft Corporation - Ex. 1056, p. 102
`
`In systolic cycle t, the results as computed in the ith cell
`are added to the corresponding elements yt¢, 1 £ t¢ £ t.
`Finally, when p(p - 1)/2 processing operations have been
`performed, the collector array y is shifted back in k steps by
`strides inverse to the above sequence, while the intermedi-
`ate results stored on the arrays yt are step-wise added to y.
`4.4 Optimization of Hyper-Systolic Computing
`Our task is to construct a sequence of strides Ak = {a0 = 0, a1,
`a2, a3, …, ak} that—in the course of the hyper-systolic proc-
`ess—provides us with k + 1 sequences xt, t = 0, 1, 2, …, k,
`represented as rectangular matrix . Note that we have
`introduced the zero-element.
`x
`x
`x
`x
`
`1
`2
`3
`a
`a
`p a
`+
`+
`x
`x
`x
`x
`0
`0
`0
`1
`2
`3
`a
`a
`p a
`+
`+
`1
`1
`1
`◊
`◊
`◊
`◊
`◊
`◊
`.
`10
`)
`(
`x
`x
`x
`x
`
`
`1
`2
`3
`a
`a
`a
`p a
`+
`+
`In (10), all indices are understood as [(i + at - 1), modp + 1].
`It is required that all combinations of elements from one to
`p can be constructed within the columns of . k should be
`chosen as small as possible.
`This recipe can be cast in terms of Additive Number
`Theory.
`.
`Let I be the set of integers m
`
`
`,{ ,0 1 2
`,
`
`}1
`,
`p
`pp

`- Œ
`=
`0
`
`
`Find the ordered set A
`,
`(
`,
` of
`,
`0
`,
`,
`)
`a
`a a
`a
`a
`1
`k
`Π+
`=
`=
`0
`3
`0
`1
`2
`k
`k
`k + 1 integers, with k minimal, where each m Œ I, (0 £ m £
`p - 1), can be represented at least once as the ordered par-
`tial sum
` (11)
`m a
`a
`a
`=
`i
`i
`i
`
`
`
`
`
`k
`
`
`
` 
`
`+
`
`+
`
`a
`0
`a
`1
`
`+
`
`+
`
`k
`
`+
`
`k
`
`
`
`+
`
`k
`
`
`
` =
`
`+
`
`+
`1 
`
`+
`
`+
`
`+
`
`j
`
`or as
`
`m p
`=
`
`-
`
`(
`
`a
`i
`
`+
`
`a
`i
`
`+
`1 
`
`+
`
`+
`
`a
`i
`
`+
`
`j
`
`)
`
`,
`
` (12)
`
`with
`
` (13)
` .
`0
`,
`,
`j
`i
`k
`i
`j
`£ + £

`0
`This optimization problem, with the extremal base Ak as so-
`lution, is reminiscent of the “postage stamp problem,” a fa-
`mous problem in Additive Number Theory [17], [20]. In con-
`trast to the original postage stamp problem, here we deal
`with sequences Ak instead of sets and with ordered partial
`sums. In these terms, we are looking for the h-range p(h, Ah)
`of the extremal basis Ah with only ordered partial sums of the
`elements of Ah allowed, as given by (11), (12), and (13).
`Unfortunately, no analytical solution to this optimization
`task exists. Therefore, we have to recourse to numerical
`methods to find optimal (with least overall communica-
`tional expense) hyper-systolic bases. In this paper, we em-
`ploy direct computation to find these bases. Furthermore,
`we will propose an approximate solution (a so-called regu-
`lar base) that carries optimal asymptotic scaling properties
`and is well-suited for the hyper-systolic realization on par-
`allel SIMD and MIMD machines. Elsewhere, we will dis-
`cuss Simulated Annealing methods [22].
`
`

`

`LIPPERT ET AL.: HYPER-SYSTOLIC PARALLEL COMPUTING
`
`103
`
`32123711
`
`19
`
`IK??AIIELAIDEBJI
`
`
`
`
`
`1
`
`mod
`
`1
`
`+
`
`p
`
`?FKJ=JE
`
`il
`
`il
`
`
`
`
`
`il
`
`
`
`il
`
`
`
`
`
`y
`1
`j
`-
`(
`i a
`+ -
`j
`
`
`
`+
`
`p
`
`>=?IDEBJ
`j
`+
`i
`1
`1
`)mod
`
`+
`
`p
`
`y
`
`
`
`TABLE 1
`SHORTEST BASES FOR VARIOUS MACHINE SIZES
`p
`k Ak
`k Ak
`p
`2
`1 1
`5
`2
`1 1
`4
`2
`1 2
`7
`2
`1 2
`6
`3
`1 1 2
`9
`3
`1 1 2
`8
`3
`1 2 2
`11
`3
`1 2 2
`10
`3
`1 2 4
`13
`3
`2 1 4
`12
`4
`1 1 1 4
`15
`4
`1 1 1 4
`14
`4
`1 2 2 4
`17
`4
`1 1 2 8
`16
`4
`2 1 8 4
`19
`4
`3 6 8 4
`18
`5
`1 1 1 4 4
`21
`4
`3 10 2 5
`20
`5
`1 1 1 4 4
`23
`5
`1 1 1 4 4
`22
`5
`1 1 1 4 8
`25
`5
`1 1 2 8 8
`24
`5
`1 2 3 4 4
`27
`5
`1 2 2 8 8
`26
`6
`1 1 1 4 4 4
`29
`6
`1 1 1 4 4 4
`28
`6
`1 1 1 4 4 4
`31
`6
`1 1 1 4 4 4
`30
`6
`1 1 1 4 4 8
`36
`6
`3 6 1 8 4 12
`32
`7
`3 1 11 7 16 4 4
`64
`8
`1 1 12 3 10 8 20 4
`48
`10 11 12 13 14 15 16 17 18 19 20
`21 22 23 24 25 26 27 28 29 30 31
`9
`8
`7
`6
`5
`4
`3
`2
`1
`10 11 12 13 14 15 16 17 18 19 20
`9
`21 22 23 24 25 26 27
`28 29 30 31 32
`8
`7
`6
`5
`4
`3
`2
`11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
`10
`28 29 30 31 32
`1
`9
`8
`7
`6
`5
`4
`3
`11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
`28 29 30
`31 32
`1
`2
`10
`9
`8
`7
`6
`5
`4
`10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
`31 32
`1
`2
`3
`4
`5
`6
`9
`8
`12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
`31 32
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`20 21 22 23 24 25 26 27 28 29 30 31 32
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11 12 13 14 15 16 17 18
`Fig. 4. Matrix  for n= 32 using the shortest base A6 = (1 1 1 4 4 8).
`want to loop over i with j > i to spare operations in Algo-
`rithm 2, we further define a “symmetry table” s(i, j) that
`tells us whether one and the same combination will occur
`twice.
`Algorithm 2. Hyper-systolic algorithm for shortest base.
` foreach processor i = 1 : p Πsystolic ring
`x
`x
`0 =
`i
`i
`for j = 1 : k
`x
`x
`1
`j
`j
`-
`=
`i
`i a
`+ -
`j
` end for
` for j = 0 : k
` for l = j + 1 : k
` if m(j, l) == true
`
`(  ,  )f x x
`y
`y
`j
`j
`j
`=
`+
`i
`i
`i
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket