`Chen et al.
`
`[19J
`
`[54] METHOD AND APPARATUS FOR A
`UNIFIED PARALLEL PROCESSING
`ARCHITECTURE
`
`[75] Inventors: Steve S. Chen, Eau Claire; Douglas R.
`Beard, Eleva; George A. Spix, Eau
`Claire; Edward C. Priest, Eau Claire;
`John M. Wastlick, Eau Claire; James
`M. VanDyke, Eau Claire, all of Wis.
`
`[73] Assignee: Cray Research, Inc., Chippewa Falls,
`Wis.
`
`[21] Appl. No.: 912,964
`
`[22] Filed:
`
`Jul. 10, 1992
`
`[51] Int. Cl.6 ................................................ G06F 3/60
`[52] U.S. Cl . .......................... : ......... 395/800; 395/200;
`395/425; 364/DIG. 1; 364/229; 364/243
`[58] Field of Search ........................ 395/200, 800, 425
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`4,130,865 12/1978 Heart et al .......................... 395/200
`4,365,292 12/1982 Barnes et al. ....................... 395/800
`4,400,768 8/1983 Tomlinson .......................... 395/800
`4,445,171 4/1984 Neches ................................ 395/325
`4,636,942 1/1987 Chen et al ........................... 395/725
`4,707,781 11/1987 Sullivan et al ...................... 395/425
`4,720,780 1/1988 Dolecek .............................. 395/800
`4,745,545 5/1988 Schiffieger .......................... 395/325
`4,754,398 6/1989 Pribnow .............................. 395/200
`4,827,403 5/1989 Steele, Jr. et al. .................. 395/800
`4,834,483 5/1989 Arthurs et al ........................ 385/46
`4,873,626 10/1989 Gifford ................................ 395/325
`4,891,751 1/1990 Call et al ............................. 395/800
`4,901,230 2/1990 Chen et al ........................... 395/325
`5,055,999 10/1991 Frank et al .......................... 395/425
`5,056,000 10/1991 Chang ................................. 395/325
`5,072,371 12/1991 Benner et al. ....................... 395/200
`5,081,575 1/1992 Hiller et al. ......................... 395/325
`5,113,523 5/1992 Colley et al ......................... 395/800
`5,165,038 11/1992 Beard et al .......................... 395/800
`5,179,702 1/1993 Spix et al ............................ 395/650
`5,197,130 3/1993 Chen et al ........................... 395/325
`5,208,914 5/1993 Wilson et al. ....................... 395/275
`
`I 111111111111111111111111111111111111 IIIII 111111111111111111111111111111111
`
`US005428803A
`[11] Patent Number:
`
`[45] Date of Patent:
`
`5,428,803
`Jun. 27, 1995
`
`OTHER PUBLICATIONS
`Fast Interrupt Mechanism for a Multiprocessor System,
`Ser. No.: 07/536,199, filed on Jun. 11, 1990.
`(List continued on next page.)
`
`Primary Examiner-Alyssa H. Bowler
`Assistant Examiner-G. Donaghue
`Attorney, Agent, or Firm-Schwegman, Lundberg &
`Woessner
`
`[57]
`ABSTRACT
`A unified parallel processing architecture connects to(cid:173)
`gether an extendible number of clusters of multiple
`numbers of processors to create a high performance
`parallel processing computer system. Multiple proces(cid:173)
`sors are grouped together into four or more physically
`separable clusters, each cluster having a common clus(cid:173)
`ter shared memory that is symmetrically accessible by
`all of the processors in that cluster; however, only some
`of the clusters are adjacently interconnected. Clusters
`are adjacently interconnected to form a floating shared
`memory if certain memory access conditions relating to
`relative memory latency and relative data locality can
`create an effe.ctive shared memory parallel program(cid:173)
`ming environment. A shared memory model can be
`used with programs that can be executed in the cluster
`shared memory of a single cluster, or in the floating
`shared memory that is defined across an extended
`shared memory space comprised of the cluster shared
`memories of any set of adjacently interconnected clus(cid:173)
`ters. A distributed memory model can be used with any
`programs that are to be executed in the cluster shared
`memories of any non-adjacently interconnected clus(cid:173)
`ters. The adjacent interconnection of multiple clusters
`of processors to a create a floating shared memory ef(cid:173)
`fectively combines all three type of memory models,
`pure shared memory, extended shared memory and
`distributed shared memory, into a unified parallel pro(cid:173)
`cessing architecture.
`
`24 Claims, 12 Drawing Sheets
`
`IPR2023-00037
`Apple EX1005 Page 1
`
`
`
`5,428,803
`
`Page 2
`
`OTHER PUBLICATIONS
`Almasi, G. and Gottlieb, A., Highly parallel Computing,
`Benjamin Cummings 1989, Chpt. 1, "Overview", pp.
`2-29, Chap. 8, Interconnection Networks pp. 278-299,
`Chapt. 10 MIMD Parallel Architectures pp. 354-475.
`Gajski, D., Milutinovic, V., Siegel, H., and· Furht, B.
`Computer Architecture, The Computer .Society of the
`IEEE, (1987), Chpt. 2, "Topics in Parallel Processing
`and Multiprocessing", pp. 81-171.
`Hennesy, J. and Patterson, D., Computer Architecture:
`A Quantative Approach, Morgan Kaufman (1990), Chap.
`10, "Future Directions", pp. 570-592.
`Hwang, K. and DeGroot, D., Parallel Processing for
`Supercomputers and Artificial Intelligence, McGraw
`Hill Publ., (1989) Chpt. 2, pp. 31-67.
`Kain, R., Computer Architecture, Prentice Hall, (1989),
`vol. 1, Chpt. 3, "Shared Resource Synchronization",
`pp. 178-250.
`
`ETA 10 System Overview: EOS, Tech. Note, Publ.
`1006, ETA Systems, Sep. 30, 1988.
`Clementi, E., Logan, D., Saaminen, J., "ICAP/3090:
`Parallel Processing For Large Scale Scientific and En(cid:173)
`gineering Problems", IBM Systems Journal, vol. 27, No.
`4 (1988) pp. 475-509.
`Pfister, G., "The IBM Research Parallel Processor
`Prototype (RP3): Introduction and Architecture", Int'l
`Conf. on Parallel Processing, Aug. 1985, pp. 764-771.
`Murakami, K., Akira, F., Sueyoshi, T. and Tomita, S.,
`"An Overview of the Kyushi University Reconfigura(cid:173)
`ble Parallel Processor", Aug. 1988, pp. 130-137.
`Kuck, D., Davidson, E., Lawrie, D. and Sameh, A.,
`"Parallel Supercomputing Today and the Cedar Ap(cid:173)
`proach", Science, vol. 231, Feb. 1986, pp. 967-974.
`Goodman, J. and Woest, P., "The Wisconsin Mul(cid:173)
`ticube: A New Large Scale Cache-Coherent Multipro(cid:173)
`cessor", Proc. of the 1988 Int'! Conf. on Parallel Pro(cid:173)
`cessing, IEEE, Feb. 1988, pp. 422-431.
`
`IPR2023-00037
`Apple EX1005 Page 2
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 1 of 12
`
`5,428,803
`
`Message Passing
`Connection
`
`10
`
`12
`
`LM
`
`LM
`
`LM
`
`LM
`
`LM i----,
`
`__ 14
`
`PRIOR ART
`DISTRIBUTED MEMORY MODEL
`
`Fig.
`
`la
`
`Common
`Shared Memory
`
`Direct
`Connection
`
`·cb p
`
`p
`
`20
`
`22
`
`p
`
`p
`
`PRIOR ART
`SHARED MEMORY MODEL
`
`Fig.
`
`lb
`
`IPR2023-00037
`Apple EX1005 Page 3
`
`
`
`u~s. Patent
`
`June 27, 1995
`
`Sheet 2 of 12
`
`5,428,803
`
`Network
`Switching
`Connection
`
`30
`
`LM
`
`LM
`
`LM
`
`LM
`
`LM
`i-----,c-34
`
`p
`
`p
`
`p
`
`p
`
`p
`
`i-----<...--
`
`32
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`SWITCHING NETWORK MODEL
`Fig. 2a
`
`Domain 1
`Connection
`
`Segment o·
`Connection
`
`0 11
`Segment
`Connection
`
`40
`
`LM
`
`LM
`
`LM
`
`LM
`
`LM
`____,__44
`
`p
`
`p
`
`p
`
`p
`
`p
`
`---~42
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`HEIRARCHICAL RING MODEL
`Fig. 2b
`
`IPR2023-00037
`Apple EX1005 Page 4
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 3 of 12
`
`5,428,803
`
`Network
`Control
`Connection
`
`GM
`
`LM
`
`GM
`
`LM
`
`GM
`
`LM
`
`GM
`
`LM
`
`GM
`
`LM
`
`50
`
`54
`
`p
`
`p
`
`p
`
`p
`
`p
`
`52
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`RECONFIGURABLE MODEL
`
`Fig. 2c
`
`Global Memory
`
`64
`
`I----------
`
`!
`I
`I
`:
`I
`
`Cluster
`Shared
`Memory
`
`.-f---_-::_-_-_-_-_-+--,=-v----60
`I
`Cluster
`I
`Shared
`Memory
`
`66
`
`p
`I
`._____.
`L ___________
`
`• •
`
`•
`
`p
`......_ ___
`
`p
`I .____.
`I _______
`
`• • •
`
`p
`--~
`
`62
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`CLUSTER/GLOBAL MEMORY INTERCONNECT
`Fig. 2d
`
`IPR2023-00037
`Apple EX1005 Page 5
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 4 of 12
`
`5,428,803
`
`,-.--=-__,-
`I
`P
`I
`__..___,
`I ,___,_ __
`Arbitration
`I
`Node
`I
`I
`I
`
`I
`.I
`I
`I
`
`--Ii-+-~
`l
`I
`
`-
`
`....--=----.
`P
`
`-
`
`-
`
`e e •
`
`--=~~
`P
`
`.....--------___.___,
`Arbitration
`Node
`
`,--___.., ___ __,_ _ _,
`
`Cluster
`Shared
`Memory
`
`I
`
`72
`I 70
`I /
`/
`76
`
`I
`I
`I 74
`
`-
`• •
`
`-,_~___,-
`•
`P
`
`Cluster
`Shared
`Memory
`
`1- --
`
`----
`Cluster
`Shared
`I
`Memory
`I .__--,---~____.
`I r-----'-----------.
`Arbitration
`I
`Node
`I ....__,_ ___
`
`I P
`I ...._____, • • •
`, ______
`
`p
`
`I
`I
`--
`_____. I
`I
`I
`_J _______
`
`Cluster
`Shared
`Memory
`
`Arbitration
`Node
`
`•••
`
`---1
`
`p
`
`J
`
`74
`
`76
`
`72
`
`p
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`CLUSTER MODEL
`DIRECT INTERCONNECT
`
`Fig. 2e
`
`IPR2023-00037
`Apple EX1005 Page 6
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 5 of 12
`
`5,428,803
`
`1
`
`Parallel
`Processing
`Efficiency
`
`0
`
`1
`
`Parallel
`Processing
`Efficiency
`
`0
`
`Fig. 3a
`
`1
`
`0 .1
`
`0.01
`
`0.001
`
`Memory
`Relative
`Latency
`
`Fig. 3b
`
`1
`
`0.1
`
`0.01
`
`0.001
`
`Memory
`Relative
`Locality
`
`IPR2023-00037
`Apple EX1005 Page 7
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 6 of 12
`
`5,428,803
`
`IOOc
`
`F• 4 I02d OOd
`1g.
`
`IPR2023-00037
`Apple EX1005 Page 8
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 7 of 12
`
`5,428,803
`
`IPR2023-00037
`Apple EX1005 Page 9
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 8 of 12
`
`5,428,803
`
`100
`
`Fig. Sc
`
`IPR2023-00037
`Apple EX1005 Page 10
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 9 of 12
`
`5,428,803
`
`110
`
`100
`
`H
`
`Fig. Sb
`
`Fig. 5d
`
`IPR2023-00037
`Apple EX1005 Page 11
`
`
`
`CH
`0
`00
`...
`00
`"' ~
`
`01
`
`N
`~
`
`Q e.
`(!) a
`00 =(cid:173)
`
`~
`
`01
`1.0
`1.0
`~
`
`§
`
`~-..l
`N
`(I)
`
`"""'
`~
`t-c a
`•
`rJJ.
`d •
`
`140
`
`Xfer
`by-Word
`Word(cid:173)
`
`140
`
`Xfer
`by-Block
`Block(cid:173)
`
`Fig. 6 a
`
`by-W ord
`
`Xfer
`
`Word-
`
`130a~
`
`I
`
`7
`
`i.1,,S i
`
`e"-f
`
`e"e
`
`Vl\1f\11'----.••
`
`'-9._ e"e._
`
`I 107
`
`I
`
`✓, ~ <
`
`',
`
`I <
`
`I
`
`I
`
`1020
`
`I ~ ~ r A iN/1
`I -----=
`
`I
`
`106a-,.
`
`1200 ---
`
`122a
`
`IOOo
`
`IPR2023-00037
`Apple EX1005 Page 12
`
`
`
`0 w
`00
`...
`00
`N
`.J::a,.
`....
`U1
`
`e. ....
`....
`a ....
`g'
`00
`
`N
`
`....
`(I) --~
`§
`
`OJ
`\C
`\C
`
`~ ~
`~
`I-cl
`•
`C/1
`•
`c::
`
`I04c
`
`122c
`
`122b
`
`104b
`
`IOOc
`
`106c
`
`~120c
`
`eee
`
`I ,_
`I
`I
`I
`
`I32b~
`
`I30c
`
`•••
`
`I34b
`
`130b
`
`140
`
`140
`
`Fig. 6b
`
`IOOb
`
`106b
`
`IPR2023-00037
`Apple EX1005 Page 13
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 12 of 12
`
`5,428,803
`
`136
`
`138
`
`Xmit
`Latch
`
`Xmit
`Clock
`
`132
`
`Cluster
`
`...--------a----------~Boundar
`Dis ab I e -----+------.
`
`144
`
`154
`
`Receive
`Latch
`
`150
`
`Buffer
`RAM
`
`Clock
`Gen
`
`Write
`Pointer
`
`Read
`Pointer
`
`152
`
`Common
`Clock
`Oscillator
`
`156
`
`141
`
`158
`
`Freq
`Locked
`Local
`Clock
`
`142
`
`134
`
`Fig. 7
`
`IPR2023-00037
`Apple EX1005 Page 14
`
`
`
`1
`
`5,428,803
`
`METHOD AND APPARATUS FOR A UNIFIED
`PARALLEL PROCESSING ARCHITECTURE
`
`TECHNICAL FIELD
`The present invention relates generally to parallel
`processing computer systems for performing multiple- 30
`instruction-multiple-data (MIMD) parallel processing.
`More particularly, the present invention relates to a
`method and apparatus for a unified parallel processing
`architecture for high performance MIMD multiproces(cid:173)
`sors that organizes the multiprocessors into four or 35
`more physically separable clusters, only some of which
`are adjacently
`interconnected, and provides for a
`shared memory model to be used with programs exe(cid:173)
`cuted in the floating shared memory space that is de(cid:173)
`fined across any adjacently interconnected clusters, and 40
`a distributed memory model to be used with any pro(cid:173)
`grams executed across non-adjacently interconnected
`clusters.
`
`45
`
`PRIOR ART
`The field of parallel computing has received increas-
`ing attention in recent years as computer designers seek
`to increase the effective processing capabilities of com(cid:173)
`puter processing systems. Most of the current work
`relating to high performance parallel processing has 50
`focused on multiple-instruction-multiple-data (MIMD)
`parallel processing systems. Current architectures for
`MIMD parallel processing systems use one of two dif(cid:173)
`ferent memory models for sharing data within a pro(cid:173)
`gram that is to be executed in parallel: distributed mem- 55
`ory vs. shared memory. In a distributed memory model,
`data is stored in the private local memory of each pro(cid:173)
`cessor and is communicated among processors by some
`type of message passing scheme. In a shared memory
`model, data is stored in a common shared memory that 60
`is equally accessible to all processors in the parallel
`processing system. An excellent summary of current
`MIMD parallel architectures is set forth in Almasi and
`Gottlieb, Highly Parallel Computing, (1989) Chpt. 10,
`pgs. 354-475.
`From the programmer's perspective, the most attrac(cid:173)
`tive parallel processing system is one that uses a shared
`memory model having a globally shared physical ad-
`
`RELATED APPLICATIONS
`This application is related to the following co-pend(cid:173)
`ing applications, all of which are assigned to the as(cid:173)
`signee of the present invention, the disclosure of all of
`which are hereby incorporated by reference in the pres-
`ent application.
`CLUSTER ARCHITECTURE FOR A HIGHLY
`PARALLEL
`SCALARNECTOR
`MULTIPRO(cid:173)
`CESSOR SYSTEM, U.S. Pat. No. 5,197,130, issued
`Mar. 23, 1993;
`FOR NON- 15
`METHOD AND APPARATUS
`SEQUENTIAL RESOURCE ACCESS, U.S. Pat. No.
`5,208,914, issued on May 4, 1993;
`GLOBAL REGISTERS FOR A MUL TIPROCES(cid:173)
`SOR SYSTEM, U.S. Pat. No. 5,165,038, issued Nov.
`17, 1992;
`FAST INTERRUPT MECHANISM FOR A MUL(cid:173)
`TIPROCESSOR SYSTEM, Ser. No. 07/536,199, filed
`on Jun. 11, 1990; and INTEGRATED SOFTWARE
`ARCHITECTURE FOR A HIGHLY PARALLEL
`MULTIPROCESSOR
`SYSTEM, U.S. Pat. No. 25
`5,179,702 issued Jan. 12, 1993.
`
`20
`
`10
`
`2
`dress space. A shared address space leaves the choice of
`the parallel programming model up to the programmer
`and provides a more flexible problem solution space. In
`addition, most present software programs are written
`5 for a shared memory model, as opposed to a distributed
`or private memory model. To effectively port an exist(cid:173)
`ing software program from a shared memory model to
`a private memory model can require significant repro-
`gramming effort. One of the primary reasons why it is
`easier to program a shared memory model is that a
`distributed memory model necessarily requires that the
`processes and data which comprise a parallel program(cid:173)
`ming problem be partitioned into separate tasks that are
`each small enough to fit into the private local memory
`of a processor and at the same time are large enough to
`recover the overhead required to establish a parallel
`task in a message passing environment. In a shared
`memory model, there are fewer restrictions on the size
`of the parallel tasks and the overhead required to estab(cid:173)
`lish a parallel task can be significantly less than the
`overhead associated with the more complicated mes-
`sage passing scheme required by a distributed memory
`model. Thus, it would be desirable to have a parallel
`processing system which could provide a large and
`flexible problem solution space for parallel processing
`wherein a large number of processors all have equal and
`symmetric access to any and all memory locations in a
`common shared memory.
`Unfortunately, the hardware required to implement
`such a true shared memory model for a parallel process(cid:173)
`ing system having a large number of processors can be
`expensive and complicated. While it may be practically
`possible to directly wire each processor to each bank of
`memory to allow for equal and symmetric access to the
`shared memory in a minimally parallel processing sys(cid:173)
`tem having two, four or eight processors, it is not prac(cid:173)
`tically possible to directly wire each processor to each
`bank of memory in parallel processing systems when
`the number of processors is greater than about eight. As
`a result, several types of extended shared memory sys-
`tems have been proposed when the number of proces(cid:173)
`sors in the parallel processing system is greater than
`about eight.
`Currently, most extended shared memory systems use
`a variety of multi-stage interconnection techniques to
`connect a local memory that is closely associated with
`each processor with the local memories of all other
`processors such that all of the local memories together
`comprise the globally shared address space of the sys(cid:173)
`tem, e.g., the BBN Butterfly+,
`the NYU Ultracom-
`puter and the Thinking Machines, Inc. CM-1. Some
`extended shared memory systems combine a local mem(cid:173)
`ory for each processor with a global memory that oper(cid:173)
`ates as a cache write-through or duplicate address
`space, e.g., the Carnegie-Mellon CM.*. Still other sys-
`tems provide a reconfigurable memory model where
`each processor has a memory that can be accessed ei(cid:173)
`ther locally or globally, and where the boundary bet(cid:173)
`ween local and global memory can be dynamically
`adjusted, e.g., the IBM RP3 and U.S. Pat. No.
`5,056,000.
`The problem with such extended shared memory
`systems is that memory access is not equal and symmet-
`65 ric from all processors to all memory locations in the
`extended shared memory. Instead, each processor may
`encounter different memory latencies when accessing
`different memory locations in different parts of the
`
`IPR2023-00037
`Apple EX1005 Page 15
`
`
`
`5,428,803
`
`3
`extended shared memory address space. This is espe(cid:173)
`cially true when comparing the access of a processor to
`its local memory with the access of that processor to a
`remote local memory or to a global memory. In essence,
`only those accesses which are to the local memory of 5
`the processor are equal and symmetric; all other mem(cid:173)
`ory accesses are variable and, in some cases, even inde(cid:173)
`terminate. As a result, these systems perform parallel
`processing best when nearly all of the memory requests
`for the related data to be operated on by each parallel 10
`task are made to the individual local memory of the
`processor executing that parallel task. When even a
`small percentage of the memory requests for the related
`data required by a parallel task must access memory
`outside the individual local memory, significant degra- 15
`dation of parallel processing performance can occur.
`In an effort to increase the portion of an extended
`shared memory that is not affected by the problems of
`different latencies in accessing remotely shared mem(cid:173)
`ory, at least three different parallel processing systems 20
`have incorporated the concept of grouping the shared
`memory for a multiple number of processors together
`such that all of the processors in a group have similar
`access to a shared memory for that group. These groups
`of processors are then all connected to each other to 25
`provide the total extended shared memory space for the
`parallel processing system.
`In the Kendall Square Research KSR-1 supercom(cid:173)
`puter, the local memories of a group of processors are
`connected together via a ring bus interconnection to 30
`form one of a plurality of segments. The various seg(cid:173)
`ments are then connected to each other by a pyramidal
`hierarchy of "higher" segments, with each higher-level
`segment having fewer segments than the preceding
`lower-level segment. Each of the "higher" segments of 35
`the domains uses a plurality of transfer units that are
`also arranged into a ring bus interconnection to transfer
`memory requests in a distributive hierarchical basis.
`Although there is a more uniform memory latency
`among requests to the local memories of other proces- 40
`sors in that segment for the processors in the KSR-1
`supercomputer, there are still significant differences in
`memory latencies between a local memory access, a
`memory access within that segment, and a memory
`access that is outside of that segment. As a result, it is 45
`still necessary to provide for some memory hierarchy
`optimization of memory requests for each parallel task
`in order to have a computer program run most effi(cid:173)
`ciently on the KSR-1 supercomputer.
`In the University of Illinois Cedar supercomputer, 50
`each cluster has eight processors linked by an 8 X 8
`crossbar switch to a 4-way interleaved write-back
`cache, and on through to a local cluster memory having
`32 Mbytes per cluster. Eight clusters are then linked by
`another 8 X 8 crossbar switch to a global memory hav- 55
`ing 8 Mbytes per processor. Transfers from global
`memory to a cluster memory are accomplished by a
`block transfer of 32 byte blocks from the global memory
`through the cache and on to the cluster memory. Each
`processor may also directly access a portion of the 60
`global memory on a byte-by-byte basis. In the case of
`the Cedar supercomputer, inter-cluster communication
`is part of a memory-hierarchy model and can be accom(cid:173)
`plished either by having processors in both clusters
`write to a common section of the global shared mem- 65
`ory, or by a block transfer of the cluster memory,
`through the global memory, to the other cluster-shared
`memory. As with the KSR-1 supercomputer, the use
`
`4
`memory hierarchy optimization is a prerequisite to suc(cid:173)
`cessfully programming a computer program to run
`most efficiently on the Cedar supercomputer.
`In the original supercomputer developed by the as(cid:173)
`signee of the present invention that is the subject of the
`previously identified patent application entitled CLUS(cid:173)
`TER ARCHITECTURE FOR A HIGHLY PARAL(cid:173)
`a
`LEL SCALAR/VECTOR MULTIPROCESSOR,
`four cluster parallel processing supercomputer is dis(cid:173)
`closed. Each cluster includes between 4 and 256 high
`performance scalar/vector processors which are con(cid:173)
`nected via a unique arbitration node network to a set of
`interleaved sections of memory that form the cluster
`shared memory. The processors in each cluster are also
`directly connected to the clustered shared memory of
`all three of the other clusters by an additional port in the
`arbitration node network. Using the cluster architecture
`of the original supercomputer developed by the as(cid:173)
`signee of the present invention, a single common ad(cid:173)
`dress space is defined across the cluster memories of all
`four clusters, with the only difference between intra(cid:173)
`cluster memory requests and inter-cluster memory re(cid:173)
`quest being the different memory latency primarily due
`to the physical distance which separates the clusters.
`Consequently, it is possible to provide for parallel pro(cid:173)
`cessing of program throughout the entire address space
`of the four clusters without the need for a memory
`hierarchy optimization.
`The problem with these cluster-type parallel process(cid:173)
`ing architectures is that all of the clusters must be di(cid:173)
`rectly connected together as part of the same single
`shared address space. Unfortunately, the number of
`clusters that can be directly interconnected is limited by
`the cost and complexity _required for such interconnec(cid:173)
`tions. In the case of the Cedar supercomputer, the com(cid:173)
`plexity and costs of interconnecting additional clusters
`increases exponentially as more memory is required for
`the global memory with the addition of each processor,
`and as larger and larger crossbar switches are required
`to directly interconnect each cluster to the global. The
`original supercomputer developed by the assignee of
`the present invention does not require ever increasing
`amounts of global memory as the number of processors
`increases; however, the physical limitations imposed by
`directly interconnecting each cluster to every other
`cluster in such a way so as to not significantly increase
`the memory latency for inter-cluster memory requests
`make the problem of directly interconnecting a large
`number of clusters very difficult. An additional physical
`limitation that restricts the number of clusters which
`can be interconnected in either of these parallel process(cid:173)
`ing supercomputers is that all of the clusters in either of
`these cluster-type supercomputers must be physically
`dose enough to one another such that the clock signals
`for all of the clusters can be synchronized or suffer the
`corresponding performance penalties associated with
`asychronous communication schemes.
`Although cluster-type architectures offer a partial
`solution to the problem of varying memory latencies,
`cluster-type architectures are themselves limited by the
`physical problems associated with directly connecting
`together all of clusters in order to provide a single com(cid:173)
`mon address space for the parallel processing system.
`Thus, while cluster-type parallel processing architec(cid:173)
`tures, such as the original supercomputer developed by
`the assignee of the present invention, represent a signifi(cid:173)
`cant improvement in the design and architecture of high
`performance parallel processors, it would be advanta-
`
`IPR2023-00037
`Apple EX1005 Page 16
`
`
`
`5
`geous to provide for a unified parallel processing archi(cid:173)
`tecture that can allow an effectively unlimited number
`of clusters of multiprocessors to be interconnected to(cid:173)
`gether, and that can overcome the present physical
`limitations imposed on interconnecting clustered multi- 5
`processors.
`
`SUMMARY OF THE INVENTION
`The present invention provides a unified parallel
`processing architecture for connecting together an ex- 10
`tendible number of clusters of multiple numbers of pro(cid:173)
`cessors to create a high performance parallel processing
`computer system. The present invention organizes mul(cid:173)
`tiple processors into four or more physically separable
`clusters, each cluster having a common cluster shared 15
`memory that is symmetrically accessible by all of the
`processors
`in that cluster. Unlike present extended
`shared memory cluster-type architectures, only some of
`the clusters are adjacently interconnected in the present
`invention. A shared memory model can be used with 20
`programs that can be executed in the cluster shared
`memory of a single cluster, or in a floating shared mem(cid:173)
`ory that is defined across an extended shared memory
`space comprised of the cluster shared memories of any
`set of adjacently interconnected clusters. A distributed 25
`memory model can be used with any programs that are
`to be executed in the cluster shared memories of any
`non-adjacently interconnected clusters.
`In the present invention, clusters are adjacently inter(cid:173)
`connected to form a floating shared memory if certain 30
`memory access conditions relating to relative memory
`latency and relative data locality can create an effective
`shared memory parallel programming environment. By
`adjacently connecting
`together only those clusters
`which can satisfy these conditions, the floating shared 35
`memory created by the present invention does not have
`to sacrifice performance for the increased interconnec(cid:173)
`tion necessary to increase the size of the problem solu(cid:173)
`tion space provided by an extended shared memory
`model where every processor can access the single 40
`shared address space of the floating shared memory.
`The adjacent interconnection of multiple clusters of
`processors to create a floating shared memory effec(cid:173)
`tively combines all three type of memory models, pure
`shared memory, extended shared memory and distrib- 45
`uted shared memory, into a unified parallel processing
`architecture. Thus, the present invention accommo(cid:173)
`dates parallel processing performance first and flexibil-
`ity of the interconnection of processors and memory
`second, with the tradeoff between these two consider- 50
`ations balanced at the point at which the performance of
`inter-cluster memory accesses start to significantly de(cid:173)
`grade.
`The four or more clusters of multiprocessors may be
`adjacently interconnected using a variety of hardware 55
`interconnection
`topologies. In the preferred embodi(cid:173)
`ment, each of the clusters is a physically separable unit
`that includes two or more processors, and one or more
`input/output ports, all of which are symmetrically con(cid:173)
`nected by a node switching mechanism to a homage- 60
`nous cluster shared memory. For each of the clusters,
`two or more external cluster connections provide a
`word-by-word communication path for adjacently con(cid:173)
`necting the processors in any one cluster to the cluster
`shared memory of two or more adjacently intercon- 65
`nected clusters. A distributed memory communication
`mechanism is also used for providing a message passing
`communication among any non-adjacently
`intercon-
`
`5,428,803
`
`6
`nected clusters so as to define an extendible address
`space encompassing the cluster shared memory of all
`clusters.
`To further enhance the parallel processing capabili(cid:173)
`ties of the preferred embodiment of the present inven(cid:173)
`tion, each multiprocessor cluster is provided with an
`extendible clock mechanism and an extendible control
`mechanism to coordinate parallel processing among
`clusters. The extendible clock mechanism solves the
`problem of differing phases and duty cycles in clock
`signals among physically separable components of the
`parallel processing system. The extendible control
`mechanism provides control mechanisms that can be
`used by a distributively-based operating system to coor(cid:173)
`dinate and communicate among processes executing in
`parallel in the computer processing system.
`Accordingly, it is a primary objective of the present
`invention to provide a unified parallel processing archi(cid:173)
`tecture for connecting together an extendible number of
`clusters of multiple numbers of processors to create a
`high performance parallel processing computer system.
`Another primary objective of the present invention is
`to provide a unified parallel processing architecture that
`can overcome the present physical limitations imposed
`on interconnecting clustered multiprocessors.
`Still another primary objective of the present inven(cid:173)
`tion is to provide a unified memory model that uniquely
`combines the prior art memory models for a shared
`memory, extended shared memory and distributed
`memory models.
`A further objective of the present invention is to
`provide a parallel processing computer system that
`allows for a shared memory model to be used with
`programs executed in the floating shared memory space
`that is defined across any adjacently interconnected
`clusters, and a distributed memory model to be used
`with any programs executed across non-adjacently in(cid:173)
`terconnected clusters.
`Still another objective of the present invention is to
`provide a clustered multiprocessor system where the
`physically separable clusters are all interconnected with
`an extendible clock mechanism that solves the physical
`problems required by a synchronous clock system in a
`multiprocessor system with physically separable com(cid:173)
`ponents.
`These and other objectives of the present invention
`will become apparent with reference to the drawings,
`the detailed description of the preferred embodiment
`and the appended claims.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIGS. la and lb are simplified block diagrams show(cid:173)
`ing the two different basic memory models for parallel
`processing systems in the prior art, an ideal distributed
`memory model and an ideal shared memory model,
`respectively.
`FIGS. 2a, 2b, 2c, 2d and 2e are simplified block dia(cid:173)
`grams showing several different extended shared mem(cid:173)
`ory models for parallel processing systems in the prior
`art.
`FIGS. 3a and 3b are graphs showing the relative
`memory latency and relative data locality of an ex(cid:173)
`tended shared memory system, respectively, as a func(cid:173)
`tion of parallel processing efficiency.
`FIG. 4 is a simplified block diagram of the preferred
`embodiment of the unified parallel processing architec(cid:173)
`ture of the present invention.
`
`IPR2023-00037
`Apple EX1005 Page 17
`
`
`
`7
`FIGS. Sa, Sb, Sc and Sd are block diagrams showing
`four different cluster connection topologies for adja(cid:173)
`cently connecting multiple clusters together in accor(cid:173)
`dance with the present invention.
`FIG. 6 is a more detailed block diagram of a single
`dimension ring-type cluster interconnection of the pre(cid:173)
`ferred embodiment of the present invention.
`FIG. 7 is a schematic diagram of the preferred em(cid:173)
`bodiment of the adjacent cluster connection mechanism
`of the present invention, including the circuitry for
`implementing the extendible clock mechanism.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENT
`To better understand how the present invention pro- 15
`vides for a novel unified parallel processing architec(cid:173)
`ture,