`
`97
`
`Hyper-Systolic Parallel Computing
`
`Thomas Lippert, Armin Seyfried, Achim Bode, and Klaus Schilling
`
`Abstract—We introduce a new class of parallel algorithms for the exact computation of systems with pairwise mutual interactions of
`2
`n elements, so called n
`-problems. Hitherto, practical conventional parallelization strategies could achieve a complexity of O(np)
`with respect to the inter-processor communication, p being the number of processors. Our new approach can reduce the inter-
`)1
`processor communication complexity to a number Onp(
`. In the framework of Additive Number Theory, the determination of the
`optimal communication pattern can be formulated as h-range minimization problem that can be solved numerically. Based on a
`complexity model, the scaling behavior of the new algorithm is numerically tested on the connection machine CM5. As a real life
`2
`-problem, on the CRAY T3D, with
`example, we have implemented a fast code for globular cluster n-body simulations, a generic n
`striking success. Our parallel method promises to be useful in various scientific and engineering fields like polymer chain
`computations, protein folding, signal processing, and, in particular, for parallel level-3 BLAS.
`
`2
`
`Index Terms—Systolic algorithm, hyper-systolic algorithm, n-body computation, n2-loop computation, parallel computer, connection
`machine CM5 and Cray T3D, novel complexity class.
`
`—————————— ✦ ——————————
`
`moreover, on massively parallel computers with distributed
`memory, they imply a substantial amount of interprocessor
`communication as the elements xi of the array x are spread
`out over many processing elements, i.e., local memories.
`An example taken from astrophysics will illustrate the
`computational and communicational efforts required. In
`globular cluster simulations, molecular dynamics techniques
`with gravitational forces are applied to track the time evolu-
`tion of the cluster. A wide range of cluster sizes must be
`treated in order to extrapolate to large n. At the upper edge,
`exact state-of-the-art calculations are, at present, restricted to
`ensemble sizes n < O(50,000). The reason is that exact com-
`putations are needed, rather than hierarchical approxima-
`tions that would fail in modeling the relevant physics. In
`such applications, computations of all elements f(xi, xj) have
`to be performed at each time step, with the number of time
`steps along the system trajectory in the range of 103 to 105 [6],
`[7]. Thus, each integration step amounts to up to O(1010)
`floating point operations and O(105 p) elements to be com-
`municated between the processors of the parallel machine.1
`Various methods have been devised in the past to ex-
`actly solve the n-body problem. We mention two generic
`parallel approaches:
`1) The replicated data method [1] deals with p identical
`copies of the entire array x that contains n particles.
`These copies are to be placed within each processor.
`On these data, the computation is performed such
`that each processor k calculates the elements f(xi, xj)
`for j = 1, …, n and i
`=
`-
`+
`(
`
`
`)1
`,1 , i.e., each,
`
`k
`k
`
`np
`
`np
`
`1. We have to emphasize that this situation must be distinguished from
`simulations with short-range forces (where multi-million-particle simula-
`tions are today’s standard). In that case, parallel “linked-cells” algorithms
`are the method of choice [8].
`
`1 INTRODUCTION
`W
`ITHIN many scientific and engineering applications,
`one is faced with the intermediate computation of
`bilocal objects or functions, f(xi, xj), on a given set of num-
`bers xi, i = 1, …, n. Think, for example, of the exact treat-
`ment of two-body forces in n-body molecular dynamics [1]
`as employed in astrophysics or thermodynamics of instable
`systems, convolutions in signal processing [2], autocorrela-
`tions in time series [3], n-point polymer chains with long-
`range interactions, protein-folding [4], or genome analysis.
`In general, computation of all f(xi, xj) bilocals requires n2
`compute steps. With f(xi, xj) being symmetric in i and j, the
`calculation needs n(n - 1)/2 different pairings. Thus, the
`computational cost increases quadratically with the number
`of particles and, therefore, n2-calculations are classified as
`computationally hard problems [5].
`As one would wish to increase the computational speed,
`massively parallel computers become very attractive, as long
`as an appropriate speedup can be attained according to the
`number of processors, p. On parallel machines, the O(n2)
`)2
`complexity, in principle, can be segmented to O n
` com-
`p(
`putations of f(xi, xj) per processor. This would suggest
`= small.
`aiming for the low granularity limit, g
`Computations of n2-problems are not only extremely de-
`manding on memory-to-register (and cache) data movement,
`
`np
`
`
`
`(cid:129) T. Lippert and K. Schilling are with HLRZ, c/o FZ-Jülich, D-52425 Jülich,
`Germany. E-mail: lippert@theorie.physik.uni-wuppertal.de.
`(cid:129) A. Seyfried is with the Department of Physics, University of Wuppertal,
`D-42097 Wuppertal, Germany.
`(cid:129) A. Bode is with the Department of Physics, Humboldt University, D-10115
`Berlin, Germany.
`Manuscript received 28 July 1995; revised 22 Mar. 1997.
`For information on obtaining reprints of this article, please send e-mail to:
`tpds@computer.org, and reference IEEECS Log Number 104615.
`
`1045-9219/98/$10.00 © 1998 IEEE
`Petitioner Microsoft Corporation - Ex. 1056, p. 97
`
`
`
`98
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 2, FEBRUARY 1998
`
`structure, is worked out in Section 4. In Section 5, the
`communication of the hyper-systolic algorithm is bench-
`marked in comparison with standard-systolic computa-
`tions on the Connection Machine CM5. In the last section,
`we consider a real life example as taken from astrophys-
`ics, namely, an n-body simulation code as implemented
`on the Cray T3D.
`
`
`
`=Â
`
`
`
`,1 2 3,
`
`
`
`,
`
`
`
`,
`
`p
`
`,
`
`i
`
`π
`
`j
`
`,
`
` (3)
`
`
`
`
`
`,
`
`'
`
`j
`
`i
`
`=
`
`=Â
`
`=
`
`2 THE COMPUTATIONAL PROBLEM
`Our task is to compute the n2-problem
`n
`,
`,
`,
`:
`,
`,
`
`
`,,1 2 3
`,
` (1)
`h
`k
`n
`y
`f x x
`h
`
`%
`=
`=
`π
`h
`h
`k
`1
`k
`with f(xh, xk) being any long-range pair-interaction, e.g.,
`gravitational or Coulomb forces, as well as more general
`types of interactions as occur in neural nets or protein
`folding. We further suppose that the forces are symmetric
`in h and k (up to a minus sign). The number of combina-
`tions {xh, xk} required for the computation of all f(xh, xk) is
`1
`n n
`-
`# ,
`n
` (2)
`=
`2
`2
`i.e., the computational complexity inherent in (1) is a number
`O(n2).
`On a parallel computer equipped with p compute nodes,
`this problem can be split into n
`p subtasks represented by
`the generic problem,
`p
`:
`,
`y
`f x x
`"
`i
`i
`1
`j
`with the input data
`x = (x1, x2, x3, …, xp) and y = (y1, y2, y3, …, yp), (4)
`as the resulting set written as one-dimensional arrays. In
`molecular dynamics terms, the sequence x can, e.g., be
`thought of as the coordinates of a number of n particles,
`whereas the sequence y describes the potential (or equiva-
`lently a component of the force) by which particle # i is
`influenced in the presence of all the other particles. In
`the following, we will concentrate on the generic prob-
`lem (3).
`The number of processors p of a given parallel machine and
`the number of elements n will differ in practical applications.
`Here, we will partition the array of n elements into n
`p subar-
`p data elements.
`rays, such that each processor deals with n
`Each subarray of p elements is distributed across the p proc-
`essors as in the generic case. The implementation of (1) with
`n > p, i.e., granularity g
`> 1, is a straightforward generali-
`=
`zation of the generic case, because (1) can be rewritten as:
`np
`p
`ÂÂ
`,
`,
`y
`f x
`x
`"
`(
`(
`(
`1
`1
`m
`j
`,
`,
`i
`i
`+ π
`
`l
`
`
`
`)1
`
`p i
`+
`
`-
`
`
`)1
`m p j
`-
`+
`
`'
`
`,
`,
`,
`1 2 3
`=
`(
`)
`1
`m
`p
`-
`
`,
`
`,
`p
`.
`j
`
`
`
`+
`
`( )
`5
`
`np
`
`=
`
`np
`
`l
`
`
`
`)1
`
`p i
`+
`
`-
`
`=
`
`=
`
`,
`1
`
`$
`
`p
`
`,
`,
`1 2 3
`l
`=
`with
`l
`
`-
`
` compute operations, the
` )2processor performs O np(
`
`
`total amount of communicated data is np elements.
`2) The parallel organization of the evaluation in form of a
`systolic array computation [1], [11], [12], [13], on ma-
`chines that allow for a one-dimensional ring connectiv-
`ity, proceeds with one moving and one fixed array. The
`n elements are distributed across the processing nodes,
`p elements are assigned to a given processor. In
`i.e., n
`each systolic step, the moving array is shifted by one
`position along the processing elements and subse-
`quently, the computations are performed. After p - 1
`“pulse” and “move” operations, all pairs f(xi, xj) have
`been generated. Again, the total amount of communi-
`cated data is np elements.
`In astrophysics, such systolic array computations are
`known as Orrery-algorithms [14], [15], [16]. Various pro-
`cedures which combine features of both methods are
`known in the literature. Common to such approaches is
`that the complexity of the interprocessor communication
`is O(np).
`In this work, we introduce a parallel computing concept
`that, to a certain extent, can be regarded as a generalization of
`systolic array computations with “pulse” and “circular
`movement.” We will point out that a formulation of the
`computational problem in the context of Additive Number
`Theory—leading to an h-range optimization problem [17]—
`defines a new class of algorithms. With the base of the h-
`range problem suitably chosen, one can recover the systolic
`realization of the computational problem. In addition, vari-
`ous other bases can be contrived that allow the complexity of
`the interprocessor communication to diminish. We will pres-
`ent a selection of shortest bases and a “regular” base that
`both reduce the complexity of the interprocessor communi-
`)12
`cation to O np(
`. The new method, in the following denoted
`as “hyper-systolic,” has in common with the standard-
`systolic array computation that, during each step, only one
`array is moving in a regular communication pattern. There-
`fore, it applies equally to both SIMD and MIMD machines,
`just as standard-systolic computation. The distinctive feature
`as compared to the standard-systolic concept, however, lies
`in the fact that storing of shifted arrays is required, similar to
`the replicated data method. But, in contrast to the latter,
`where the storage increases like n ¥ p, a significant reduction
`in interprocessor communication is achieved with only a
`moderate amount of additional storage. To reach the optimal
`hyper-systolic speedup, one needs storage space of size
`)12
`O np(
`. We will demonstrate that, in view of the state-of-the-
`art problem sizes mentioned above, this is not an obstacle for
`today’s parallel supercomputers.
`In Section 2, we present the generic form of the com-
`putational problem to be solved. In Section 3, we review
`the standard-systolic concept and introduce a complexity
`model to set the stage for further comparisons. In order to
`provide a graphical illustration, we will phrase the issue
`in systolic automata models [12], [18]. A suitable generali-
`zation of the latter, which gives us the means to graphi-
`cally represent the hyper-systolic parallel computing
`
`Petitioner Microsoft Corporation - Ex. 1056, p. 98
`
`
`
`LIPPERT ET AL.: HYPER-SYSTOLIC PARALLEL COMPUTING
`
`99
`
`Fig. 1. A systolic automaton cell (large dotted square) is an arrangement of a processing element (open triangle) and a delay element (black
`square). The data element xi resides stationary in the processing element i. In each clock step, the delay element delivers another data element
`xj of the moving array. The function f(xi, xj) is computed and successively added to yi that is resident in the processing element as well. The plot
`shows the data location on the systolic array after the third clock step.
`
`3 SYSTOLIC ALGORITHM
`First, we describe the one-dimensional systolic implemen-
`tation of (3), with n = p, to set the stage for our further con-
`siderations. We use the concept of systolic automaton mod-
`els to illustrate parallel computing structures [12], [18].
`A systolic automaton consists of cells with data transfer
`and processing units. The cells are coupled to each other in
`a uniform next-neighbor wiring pattern, as depicted in Fig. 1.
`The processing elements (PE) are drawn as open triangles.
`They realize functions of equal load between consecutive
`communication events. The data transfer elements are
`drawn as black squares. They are delay elements (DE) that
`represent abstractions for memory locations in which data
`is shifted in and out in regular clock steps, which means the
`clock step of the abstract automaton, rather than a physical
`computer clock step. Data processing and transfer are com-
`pletely pipelined. More precise definitions of systolic arrays
`and algorithms along with many examples and applications
`can be found in two monographs by Petkov [12], [18].
`The graphical structural counterpart of (3) is given in
`Fig. 1. The cells are consecutively arranged in a linear order.
`Each cell is connected with its left and right next neighbors.
`Note that the systolic computation of (3) requires a toroidal
`topology of the linear array i.e., a ring connectivity. At clock
`step 0, a sequence x of p data elements is distributed over p
`PEs. This sequence will stay fixed in the following process.
`Initially, a copy x of the array x is made. The array x is
`shifted and the processing on all cells is performed subse-
`quently, clock step by clock step. Fig. 1 illustrates the state
`of the systolic computation after three clock steps.
`The small black boxes stand for the DEs, the function of
`which is to shift the data element xi in one clock step from
`cell # i to cell # (i + 1). Subsequently, the PEs perform the
`numerical computation of the function f, i.e., they combine
`the elements of the fixed array x with the elements of the
`shifted array x . The result in the ith cell is added to the
`resulting data element yi of the array y such that, after p - 1
`steps, the resulting data elements are completely computed
`and are distributed over all processing elements. In terms of
`the molecular dynamics example, each particle coordinate
`is located together with its respective force value in the
`same processing element.
`
`
`
`
`
`+
`
`
`
`The mapping of the abstract automaton model onto a
`real parallel computer appears straightforward: The PEs are
`identified with the processors of the parallel machine, the
`DEs symbolize registers or memory locations exchanging
`data via the interprocessor communication network. A
`connectivity of the parallel system is required that allows
`the embedding of a logical ring to realize the communica-
`tion pattern of the systolic automaton.
`The explicit systolic realization of (3) is given in Algo-
`rithm 1 for the generic case n = p.
`Algorithm 1. Conventional systolic algorithm.
` foreach processor i = 1 : p Œ systolic ring
`x
`x
`0 =
`i
`i
` for j = 1 : p - 1
`x
`x
`1
`j=
`j
`-
` mod
`1
`i
`i
`p
`( , )f x x
`
`y
`y
`0
`j
`+
`=
`i
`i
`i
`i
` end for
`end foreach
`3.1 Communication Model
`For further evaluation and comparison of complexities be-
`tween systolic and various hyper-systolic methods, we intro-
`duce a communication complexity model.
`The time required for a processor to communicate a
`message of m words to a neighbor processor node is mod-
`eled as tl + mtt, where tl is the start-up time that is assumed
`to be independent of the message length and tt is the data
`transmission time per word.
`For the systolic computation of (3), p - 1 systolic steps are
`performed, where, in the generic case, each processor sends
`exactly one word to its neighbor. Thus, the total number of
`words to send is a number order p2, one word per processor
`in one systolic step, and a total communication time of
`T = (p - 1)(tl + tt)
`
` (6)
`
`is found.2
`2. One can improve on the scaling of the systolic realization by taking into
`account the symmetries. The computations then can be reduced to p(p - 1)/2
`processing operations, but p(p + 1) communication events have to be per-
`formed, since a moving result array has to be shifted p/2 + 1 times. This
`method goes under the name “Half-Orrery” algorithm [15] and originally
`was introduced by Seitz [16].
`
`Petitioner Microsoft Corporation - Ex. 1056, p. 99
`
`
`
`100
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 2, FEBRUARY 1998
`
`For the case n > p, the array will be partitioned into np parts.
`
`One can organize the computation such that, successively, all
`required computations between all n data elements can be
`constructed, according to (5). Each inner loop for each array
`is carried out as described for the generic case n = p; addi-
`tionally, combinations between rows must be computed. In
`p words to
`each systolic step, a given processor has to send n
`its neighbor. As communication is pipelined, the total
`communication time thus turns out to be
`(
`)1
`
`.
`T
`p
`t
`t
`=
`-
`+
`t
`l
`
`
`
`np
`
`
`
` (7)
`
`We call this reduced matrix “hyper-systolic” matrix . From
`this matrix, it is easy to see that, indeed, all required pairs
`come together within the processors: Let us, e.g., consider
`combinations with element # 1: In the first column, the
`combinations {1, 2}, {1, 4}, {1, 6}, and {1, 10}, in the 16th col-
`umn, {1, 3}, {1, 5}, {1, 9}, and {1, 16}, in the 14th column, we
`have {1, 15}, {1, 14}, and {1, 7}, in the 12th column, {1, 12} and
`{1, 13}, and, in the eighth column, the combinations {1, 11}
`and {1, 8} can be found. As the system is toroidally closed, it
`is evident that all pairing are present. As a result, we need
`only four shifts and five arrays to be stored, to get all pairs
`done. In the following, it will be important that the structure
`presented implies a homogeneous load of all processors in
`each step of the hyper-systolic computation, as was the case
`in the systolic computation.
`We note, further, that in order to realize (3), we cannot
`just add the outcome of the computations to one resulting
`array: For every row xt to be stored, a separate, intermedi-
`ate array yt is required. In a process that involves the in-
`verse sequence of shifts, a moving “collector” array is
`shifted back, while step by step, the results in the arrays yt
`are accumulated. Finally, y contains the desired results.
`Therefore, for this small example, in total, eight shift-
`operations are required. In the standard systolic process, 15
`shifts of the moving array must be carried out, while, for
`Half-Orrery, 17 shifts are to be performed [15].
`4.2 Hyper-Systolic Breviary
`Next, we give a general prescription to implement (3) accord-
`ing to the ideas exposed in the last section. The new algorithm
`is called hyper-systolic, the name referring to the particular
`features of the underlying communicational structure.
`1) For a given array x of length p, k replicas are gener-
`ated by shifting the array x k times by a stride at and
`storing the resulting arrays as xt, 1 £ t £ k. The vari-
`able strides at of the shifts are chosen such that
`a) all pairs of data elements (taken within columns
`of ) occur at least once and the number of equal
`pairings is minimized,
`b) the number of shifts k is minimized, under the con-
`straint that the strides at can be realized efficiently
`by the given parallel computer’s connectivity.
`2) The p(p - 1)/2 results yi, according to (3), are succes-
`sively computed and are added to k arrays yt.
`3) Finally, applying the inverse shift sequence, a collec-
`tor array y is shifted back by the inverse sequence of
`shifts, while the intermediate results in the array yt is
`added to y in step k - t.
`4.3 Graphical Representation
`The concept of traditional systolic automata models can be
`extended to the hyper-systolic parallel computing structure.
`For the graphical representation of the latter, we use the
`objects introduced in Section 3. Fig. 2 shows the ith hyper-
`systolic cell unit. The DEs again represent input and output.
`Additionally, we have drawn open rectangles to symbolize
`the memory locations for the stored data elements xt and
`
`Petitioner Microsoft Corporation - Ex. 1056, p. 100
`
`4 HYPER-SYSTOLIC ALGORITHM
`The main focus of this work is on the reduction of the total
`time expense for the interprocessor communication in the
`parallel computation of (3). We will show for the generic
`case, (3), that for the communicational part a complexity of
`)32
`O p(
` can be achieved by reorganization of the data-
`movement. Going to the case n > p, the conventional com-
`plexity can be reduced from O(np) to O np(
`)12
` for the hyper-
`systolic algorithm. The amount of processing operations is
`not altered compared to conventional methods.
`4.1 Motivation
`Consider the matrix $, called systolic matrix, which explic-
`itly shows the data elements of the shift-array xt, (4), for the
`clock steps t = 0, 1, 2, …, p - 1, as delivered successively in
`the systolic implementation, Section 3:
`1 2 3
`p
`
`2 3 4
`1
`p
`3 4 5
`2
`1
`$ = ◊
`◊
`◊
`◊
`◊
`◊
`1
`.
`p
`1
`2
`1 2
`p
`p
`p
`-
`-
`-
`
`
`We observe in (8) that all combinations of two different
`elements within the columns of $ actually occur p times, if
`each shifted array xt is stored for t = 0, 1, 2, …, p - 1 and
`combinations between all the rows are allowed. Note that
`storing the shifted arrays is not required in the systolic con-
`cept with only two arrays, one fixed and one moving array.
`Here, as time proceeds, an increasing number of “resident”
`arrays is stored, with only one array in movement.
`The matrix $ shows that a p-fold redundancy of pairings
`is encountered if all shifted arrays are stored. It appears
`natural to reduce this redundancy of pairings and, thus, to
`save on interprocessor communication by storing some of
`the arrays xt only, for a suitably selected subset of time steps t.
`To illustrate the idea, let us consider a simple example
`for p = 16; from matrix $ in (8) we extract only five rows:
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10 11 12 13 14 15 16
` (9)
`
`2
`3
`4
`5
`6
`7
`8
`9
`10 11 12 13 14 15 16
`1
`4
`5
`6
`7
`8
`9
`10 11 12 13 14 15 16
`1
`2
`3
`6
`7
`8
`9
`10 11 12 13 14 15 16
`1
`2
`3
`4
`5
`9 .
`10 11 12 13 14 15 16
`1
`2
`3
`4
`5
`6
`7
`8
`
` (8)
`
`0123
`
`==== =
`
`tttt t
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
` =
`
`
`
`LIPPERT ET AL.: HYPER-SYSTOLIC PARALLEL COMPUTING
`
`101
`
`p
`
`Fig. 2. Processing element (PE) (open triangle), delay element (DE) (black square), and store elements (SE) (open squares), arranged in a hyper-
`systolic automaton cell (large dotted square). The element xi resides stationary. In each clock step t, the DE delivers a new data element from loca-
`- 1),modp + 1] to be stored in the local SE and transmitted further to the next DE. Subsequently, the computation is performed on elements
`tion [(i + at
`yi at
`stored in the SE. The results are added to memory locations [
`]
`. Note that each DE is equipped with r connectors to account for r different
`+
`-
`+
`(
`1)mod
`1
`strides at. Second, the results yi at
`+ are step by step added to the array y that is shifted by the inverse sequence of strides.
`1]
`
`+
`
`[
`
`-
`
`1mod
`
`p
`
`Fig. 3. Three abstract automaton cells as part of a hyper-systolic array with r= 2 The data location on a part of the hyper-systolic array is shown
`after the second systolic cycle.
`
`yt, with 1 £ t £ k. The DEs now have one input and r output
`connectors, one separate connector for each required shift
`width at.
`Again, the cells are arranged as a uniform grid; in addi-
`tion to the next-neighbor connectivity, the r new nonlocal
`regular connections between the processors are indicated in
`Fig. 3. They correspond to the shifts of data elements by
`strides at. According to the number r of different strides at, r
`direct connections would be optimal. From the point of
`view of the original one-dimensional systolic array struc-
`ture—due to these additional communication connec-
`tions—the (one-dimensional) space of processing elements
`has acquired topologically nontrivial interconnections. Fast
`regular interprocessor data transfer to distant cells can be
`achieved in one clock step via these interconnections that
`are used as shortcuts. Thus, the characterization of our new
`algorithm as hyper-systolic. We emphasize that hyper-
`systolic data movement and storing extends the standard-
`
`systolic concept. From the uniform structure of the hyper-
`systolic automaton model and the regularity of the inter-
`processor communication, it is evident that the PEs can per-
`form functions of equal load in each hyper-systolic step.
`The amount of processing operations that is equal for each
`processor increases with an increasing number t of stored
`1
`.
`
`arrays ,y ¢
`t
`t
`£ ¢ £
`t
`Fig. 3 represents the graphical structural counterpart of
`(3) for the hyper-systolic algorithm. At clock step t = 0, a
`sequence x of p data elements is distributed over p proc-
`essing elements. In every clock step t = 0, 1, 2, …, k, a copy
`xt of the array xt-1 is generated, and, in the next clock step,
`the array is shifted by a stride at. Subsequently, the new
`array xt is stored in the SEs. Then, the processing can be
`done by the PEs involving ever more of the stored arrays
`for increasing t. Fig. 3 shows the situation after two clock
`steps. Note, however, that only one array is moving just as
`in standard-systolic computation.
`
`Petitioner Microsoft Corporation - Ex. 1056, p. 101
`
`
`
`102
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 2, FEBRUARY 1998
`
`The formulation in terms of an h-range problem appears
`to provide quite a general framework covering a whole
`class of algorithms: Any chosen base A induces its specific
`new algorithm; the realization of Ak with k = p - 1 brings us
`back to the standard-systolic algorithm with the base
` (14)
`
`0{ ,
`,
`,
`},
`1
`
`1,
`
`,
`1,
`A
`a
`a
`a
`i
`p
`=
`=
`" =
`-
`1
`1
`1
`p
`p
`i
`the realization
`p
`1
`0
`},
`,
`,
`{ ,
`,, (15)
`a
`a
`a
`A
`=
`=
`2
`1
`i
`leads to the Half-Orrery algorithm.
`A lower bound on the minimal k, with optimal complex-
`ity, can be derived by the following consideration:
`The total number of combinations required is p(p - 1)/2.
`Let the matrix of (10) be realized by k shifts, i.e., by k + 1
`arrays. The number of possible combinations between the
`12
`elements of a given column of follows as k +
`. Thus,
`
`the following inequality holds:
`(
`)
`1
`p p
`12
`k
`-
`+
`£
`2
`
`
` (16)
`
`
`
`p
`
`,
`
`
`
`therefore,
`
`34
`
`12
`
` (17)
`.
`p
`k
`≥ -
`+
`-
`with k integer. Obviously, a sequence Ak with less than kmin
`elements cannot solve the h-range problem presented above.
`In other words, the complexity for the inter-processor com-
`munication of any hyper-systolic algorithm to solve (3) is
`bounded by p p from below for n = p and by n p for n > p.
`4.5 Shortest Bases
`In absence of analytical solutions, for various numbers of
`processors p ranging from four to 64, we could construct
`shortest bases by straightforward computation according to
`the above minimization prescription. We tried to prefer bases
`with elements being a power of two, as these strides might
`lead to fastest interprocessor communication on most parallel
`machine’s networks. Computing these bases, we assumed
`that shifts with different stride require equal communication
`time. The general case, treating different communication
`times for different strides, will be discussed elsewhere [22].
`Table 1 collects our selection of extremal bases. For large
`p, it becomes exponentially difficult to find extremal bases.
`For illustration of the hyper-systolic shift pattern, let us
`present the hyper-systolic matrix , as constructed by the
`extremal base for p = 32, see Fig. 4.
`The realization of the hyper-systolic computation is shown
`in Algorithm 2: We emphasize that, in general, one cannot just
`combine all the rows; rather, we must take care for the fact that
`a redundancy can occur. Since the minimality condition, (17),
`, more than p2
`is unlikely to be fulfilled in general, i.e., k
`p≥
`elements can be computed. Therefore, we define a
`“multiplicity table” m(i, j), a k ¥ k-matrix (since has k rows),
`to ensure any pair (i, j) to be computed once and only once.
`The matrix element m(i, j) = 1 if pair (i, j) was not yet treated
`and 0 if (i, j) or (j, i) have been computed before. As we only
`
`-
`
`p
`2
`
`
`
`-
`
`
`
`
`
`p
`2
`
`i
`" =
`
`1
`,
`
`
`
`
`
`Petitioner Microsoft Corporation - Ex. 1056, p. 102
`
`In systolic cycle t, the results as computed in the ith cell
`are added to the corresponding elements yt¢, 1 £ t¢ £ t.
`Finally, when p(p - 1)/2 processing operations have been
`performed, the collector array y is shifted back in k steps by
`strides inverse to the above sequence, while the intermedi-
`ate results stored on the arrays yt are step-wise added to y.
`4.4 Optimization of Hyper-Systolic Computing
`Our task is to construct a sequence of strides Ak = {a0 = 0, a1,
`a2, a3, …, ak} that—in the course of the hyper-systolic proc-
`ess—provides us with k + 1 sequences xt, t = 0, 1, 2, …, k,
`represented as rectangular matrix . Note that we have
`introduced the zero-element.
`x
`x
`x
`x
`
`1
`2
`3
`a
`a
`p a
`+
`+
`x
`x
`x
`x
`0
`0
`0
`1
`2
`3
`a
`a
`p a
`+
`+
`1
`1
`1
`◊
`◊
`◊
`◊
`◊
`◊
`.
`10
`)
`(
`x
`x
`x
`x
`
`
`1
`2
`3
`a
`a
`a
`p a
`+
`+
`In (10), all indices are understood as [(i + at - 1), modp + 1].
`It is required that all combinations of elements from one to
`p can be constructed within the columns of . k should be
`chosen as small as possible.
`This recipe can be cast in terms of Additive Number
`Theory.
`.
`Let I be the set of integers m
`
`
`,{ ,0 1 2
`,
`
`}1
`,
`p
`pp
`Œ
`- Œ
`=
`0
`
`
`Find the ordered set A
`,
`(
`,
` of
`,
`0
`,
`,
`)
`a
`a a
`a
`a
`1
`k
`Œ +
`=
`=
`0
`3
`0
`1
`2
`k
`k
`k + 1 integers, with k minimal, where each m Œ I, (0 £ m £
`p - 1), can be represented at least once as the ordered par-
`tial sum
` (11)
`m a
`a
`a
`=
`i
`i
`i
`
`
`
`
`
`k
`
`
`
`
`
`+
`
`+
`
`a
`0
`a
`1
`
`+
`
`+
`
`k
`
`+
`
`k
`
`
`
`+
`
`k
`
`
`
` =
`
`+
`
`+
`1
`
`+
`
`+
`
`+
`
`j
`
`or as
`
`m p
`=
`
`-
`
`(
`
`a
`i
`
`+
`
`a
`i
`
`+
`1
`
`+
`
`+
`
`a
`i
`
`+
`
`j
`
`)
`
`,
`
` (12)
`
`with
`
` (13)
` .
`0
`,
`,
`j
`i
`k
`i
`j
`£ + £
`Œ
`0
`This optimization problem, with the extremal base Ak as so-
`lution, is reminiscent of the “postage stamp problem,” a fa-
`mous problem in Additive Number Theory [17], [20]. In con-
`trast to the original postage stamp problem, here we deal
`with sequences Ak instead of sets and with ordered partial
`sums. In these terms, we are looking for the h-range p(h, Ah)
`of the extremal basis Ah with only ordered partial sums of the
`elements of Ah allowed, as given by (11), (12), and (13).
`Unfortunately, no analytical solution to this optimization
`task exists. Therefore, we have to recourse to numerical
`methods to find optimal (with least overall communica-
`tional expense) hyper-systolic bases. In this paper, we em-
`ploy direct computation to find these bases. Furthermore,
`we will propose an approximate solution (a so-called regu-
`lar base) that carries optimal asymptotic scaling properties
`and is well-suited for the hyper-systolic realization on par-
`allel SIMD and MIMD machines. Elsewhere, we will dis-
`cuss Simulated Annealing methods [22].
`
`
`
`LIPPERT ET AL.: HYPER-SYSTOLIC PARALLEL COMPUTING
`
`103
`
`32123711
`
`19
`
`IK??AIIELAIDEBJI
`
`
`
`
`
`1
`
`mod
`
`1
`
`+
`
`p
`
`?FKJ=JE
`
`il
`
`il
`
`
`
`
`
`il
`
`
`
`il
`
`
`
`
`
`y
`1
`j
`-
`(
`i a
`+ -
`j
`
`
`
`+
`
`p
`
`>=?IDEBJ
`j
`+
`i
`1
`1
`)mod
`
`+
`
`p
`
`y
`
`
`
`TABLE 1
`SHORTEST BASES FOR VARIOUS MACHINE SIZES
`p
`k Ak
`k Ak
`p
`2
`1 1
`5
`2
`1 1
`4
`2
`1 2
`7
`2
`1 2
`6
`3
`1 1 2
`9
`3
`1 1 2
`8
`3
`1 2 2
`11
`3
`1 2 2
`10
`3
`1 2 4
`13
`3
`2 1 4
`12
`4
`1 1 1 4
`15
`4
`1 1 1 4
`14
`4
`1 2 2 4
`17
`4
`1 1 2 8
`16
`4
`2 1 8 4
`19
`4
`3 6 8 4
`18
`5
`1 1 1 4 4
`21
`4
`3 10 2 5
`20
`5
`1 1 1 4 4
`23
`5
`1 1 1 4 4
`22
`5
`1 1 1 4 8
`25
`5
`1 1 2 8 8
`24
`5
`1 2 3 4 4
`27
`5
`1 2 2 8 8
`26
`6
`1 1 1 4 4 4
`29
`6
`1 1 1 4 4 4
`28
`6
`1 1 1 4 4 4
`31
`6
`1 1 1 4 4 4
`30
`6
`1 1 1 4 4 8
`36
`6
`3 6 1 8 4 12
`32
`7
`3 1 11 7 16 4 4
`64
`8
`1 1 12 3 10 8 20 4
`48
`10 11 12 13 14 15 16 17 18 19 20
`21 22 23 24 25 26 27 28 29 30 31
`9
`8
`7
`6
`5
`4
`3
`2
`1
`10 11 12 13 14 15 16 17 18 19 20
`9
`21 22 23 24 25 26 27
`28 29 30 31 32
`8
`7
`6
`5
`4
`3
`2
`11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
`10
`28 29 30 31 32
`1
`9
`8
`7
`6
`5
`4
`3
`11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
`28 29 30
`31 32
`1
`2
`10
`9
`8
`7
`6
`5
`4
`10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
`31 32
`1
`2
`3
`4
`5
`6
`9
`8
`12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
`31 32
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`20 21 22 23 24 25 26 27 28 29 30 31 32
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11 12 13 14 15 16 17 18
`Fig. 4. Matrix for n= 32 using the shortest base A6 = (1 1 1 4 4 8).
`want to loop over i with j > i to spare operations in Algo-
`rithm 2, we further define a “symmetry table” s(i, j) that
`tells us whether one and the same combination will occur
`twice.
`Algorithm 2. Hyper-systolic algorithm for shortest base.
` foreach processor i = 1 : p Œ systolic ring
`x
`x
`0 =
`i
`i
`for j = 1 : k
`x
`x
`1
`j
`j
`-
`=
`i
`i a
`+ -
`j
` end for
` for j = 0 : k
` for l = j + 1 : k
` if m(j, l) == true
`
`( , )f x x
`y
`y
`j
`j
`j
`=
`+
`i
`i
`i
`