throbber
United States Patent
`Chen et al.
`
`[19J
`
`[54] METHOD AND APPARATUS FOR A
`UNIFIED PARALLEL PROCESSING
`ARCHITECTURE
`
`[75] Inventors: Steve S. Chen, Eau Claire; Douglas R.
`Beard, Eleva; George A. Spix, Eau
`Claire; Edward C. Priest, Eau Claire;
`John M. Wastlick, Eau Claire; James
`M. VanDyke, Eau Claire, all of Wis.
`
`[73] Assignee: Cray Research, Inc., Chippewa Falls,
`Wis.
`
`[21] Appl. No.: 912,964
`
`[22] Filed:
`
`Jul. 10, 1992
`
`[51] Int. Cl.6 ................................................ G06F 3/60
`[52] U.S. Cl . .......................... : ......... 395/800; 395/200;
`395/425; 364/DIG. 1; 364/229; 364/243
`[58] Field of Search ........................ 395/200, 800, 425
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`4,130,865 12/1978 Heart et al .......................... 395/200
`4,365,292 12/1982 Barnes et al. ....................... 395/800
`4,400,768 8/1983 Tomlinson .......................... 395/800
`4,445,171 4/1984 Neches ................................ 395/325
`4,636,942 1/1987 Chen et al ........................... 395/725
`4,707,781 11/1987 Sullivan et al ...................... 395/425
`4,720,780 1/1988 Dolecek .............................. 395/800
`4,745,545 5/1988 Schiffieger .......................... 395/325
`4,754,398 6/1989 Pribnow .............................. 395/200
`4,827,403 5/1989 Steele, Jr. et al. .................. 395/800
`4,834,483 5/1989 Arthurs et al ........................ 385/46
`4,873,626 10/1989 Gifford ................................ 395/325
`4,891,751 1/1990 Call et al ............................. 395/800
`4,901,230 2/1990 Chen et al ........................... 395/325
`5,055,999 10/1991 Frank et al .......................... 395/425
`5,056,000 10/1991 Chang ................................. 395/325
`5,072,371 12/1991 Benner et al. ....................... 395/200
`5,081,575 1/1992 Hiller et al. ......................... 395/325
`5,113,523 5/1992 Colley et al ......................... 395/800
`5,165,038 11/1992 Beard et al .......................... 395/800
`5,179,702 1/1993 Spix et al ............................ 395/650
`5,197,130 3/1993 Chen et al ........................... 395/325
`5,208,914 5/1993 Wilson et al. ....................... 395/275
`
`I 111111111111111111111111111111111111 IIIII 111111111111111111111111111111111
`
`US005428803A
`[11] Patent Number:
`
`[45] Date of Patent:
`
`5,428,803
`Jun. 27, 1995
`
`OTHER PUBLICATIONS
`Fast Interrupt Mechanism for a Multiprocessor System,
`Ser. No.: 07/536,199, filed on Jun. 11, 1990.
`(List continued on next page.)
`
`Primary Examiner-Alyssa H. Bowler
`Assistant Examiner-G. Donaghue
`Attorney, Agent, or Firm-Schwegman, Lundberg &
`Woessner
`
`[57]
`ABSTRACT
`A unified parallel processing architecture connects to(cid:173)
`gether an extendible number of clusters of multiple
`numbers of processors to create a high performance
`parallel processing computer system. Multiple proces(cid:173)
`sors are grouped together into four or more physically
`separable clusters, each cluster having a common clus(cid:173)
`ter shared memory that is symmetrically accessible by
`all of the processors in that cluster; however, only some
`of the clusters are adjacently interconnected. Clusters
`are adjacently interconnected to form a floating shared
`memory if certain memory access conditions relating to
`relative memory latency and relative data locality can
`create an effe.ctive shared memory parallel program(cid:173)
`ming environment. A shared memory model can be
`used with programs that can be executed in the cluster
`shared memory of a single cluster, or in the floating
`shared memory that is defined across an extended
`shared memory space comprised of the cluster shared
`memories of any set of adjacently interconnected clus(cid:173)
`ters. A distributed memory model can be used with any
`programs that are to be executed in the cluster shared
`memories of any non-adjacently interconnected clus(cid:173)
`ters. The adjacent interconnection of multiple clusters
`of processors to a create a floating shared memory ef(cid:173)
`fectively combines all three type of memory models,
`pure shared memory, extended shared memory and
`distributed shared memory, into a unified parallel pro(cid:173)
`cessing architecture.
`
`24 Claims, 12 Drawing Sheets
`
`IPR2023-00037
`Apple EX1005 Page 1
`
`

`

`5,428,803
`
`Page 2
`
`OTHER PUBLICATIONS
`Almasi, G. and Gottlieb, A., Highly parallel Computing,
`Benjamin Cummings 1989, Chpt. 1, "Overview", pp.
`2-29, Chap. 8, Interconnection Networks pp. 278-299,
`Chapt. 10 MIMD Parallel Architectures pp. 354-475.
`Gajski, D., Milutinovic, V., Siegel, H., and· Furht, B.
`Computer Architecture, The Computer .Society of the
`IEEE, (1987), Chpt. 2, "Topics in Parallel Processing
`and Multiprocessing", pp. 81-171.
`Hennesy, J. and Patterson, D., Computer Architecture:
`A Quantative Approach, Morgan Kaufman (1990), Chap.
`10, "Future Directions", pp. 570-592.
`Hwang, K. and DeGroot, D., Parallel Processing for
`Supercomputers and Artificial Intelligence, McGraw
`Hill Publ., (1989) Chpt. 2, pp. 31-67.
`Kain, R., Computer Architecture, Prentice Hall, (1989),
`vol. 1, Chpt. 3, "Shared Resource Synchronization",
`pp. 178-250.
`
`ETA 10 System Overview: EOS, Tech. Note, Publ.
`1006, ETA Systems, Sep. 30, 1988.
`Clementi, E., Logan, D., Saaminen, J., "ICAP/3090:
`Parallel Processing For Large Scale Scientific and En(cid:173)
`gineering Problems", IBM Systems Journal, vol. 27, No.
`4 (1988) pp. 475-509.
`Pfister, G., "The IBM Research Parallel Processor
`Prototype (RP3): Introduction and Architecture", Int'l
`Conf. on Parallel Processing, Aug. 1985, pp. 764-771.
`Murakami, K., Akira, F., Sueyoshi, T. and Tomita, S.,
`"An Overview of the Kyushi University Reconfigura(cid:173)
`ble Parallel Processor", Aug. 1988, pp. 130-137.
`Kuck, D., Davidson, E., Lawrie, D. and Sameh, A.,
`"Parallel Supercomputing Today and the Cedar Ap(cid:173)
`proach", Science, vol. 231, Feb. 1986, pp. 967-974.
`Goodman, J. and Woest, P., "The Wisconsin Mul(cid:173)
`ticube: A New Large Scale Cache-Coherent Multipro(cid:173)
`cessor", Proc. of the 1988 Int'! Conf. on Parallel Pro(cid:173)
`cessing, IEEE, Feb. 1988, pp. 422-431.
`
`IPR2023-00037
`Apple EX1005 Page 2
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 1 of 12
`
`5,428,803
`
`Message Passing
`Connection
`
`10
`
`12
`
`LM
`
`LM
`
`LM
`
`LM
`
`LM i----,
`
`__ 14
`
`PRIOR ART
`DISTRIBUTED MEMORY MODEL
`
`Fig.
`
`la
`
`Common
`Shared Memory
`
`Direct
`Connection
`
`·cb p
`
`p
`
`20
`
`22
`
`p
`
`p
`
`PRIOR ART
`SHARED MEMORY MODEL
`
`Fig.
`
`lb
`
`IPR2023-00037
`Apple EX1005 Page 3
`
`

`

`u~s. Patent
`
`June 27, 1995
`
`Sheet 2 of 12
`
`5,428,803
`
`Network
`Switching
`Connection
`
`30
`
`LM
`
`LM
`
`LM
`
`LM
`
`LM
`i-----,c-34
`
`p
`
`p
`
`p
`
`p
`
`p
`
`i-----<...--
`
`32
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`SWITCHING NETWORK MODEL
`Fig. 2a
`
`Domain 1
`Connection
`
`Segment o·
`Connection
`
`0 11
`Segment
`Connection
`
`40
`
`LM
`
`LM
`
`LM
`
`LM
`
`LM
`____,__44
`
`p
`
`p
`
`p
`
`p
`
`p
`
`---~42
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`HEIRARCHICAL RING MODEL
`Fig. 2b
`
`IPR2023-00037
`Apple EX1005 Page 4
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 3 of 12
`
`5,428,803
`
`Network
`Control
`Connection
`
`GM
`
`LM
`
`GM
`
`LM
`
`GM
`
`LM
`
`GM
`
`LM
`
`GM
`
`LM
`
`50
`
`54
`
`p
`
`p
`
`p
`
`p
`
`p
`
`52
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`RECONFIGURABLE MODEL
`
`Fig. 2c
`
`Global Memory
`
`64
`
`I----------
`
`!
`I
`I
`:
`I
`
`Cluster
`Shared
`Memory
`
`.-f---_-::_-_-_-_-_-+--,=-v----60
`I
`Cluster
`I
`Shared
`Memory
`
`66
`
`p
`I
`._____.
`L ___________
`
`• •
`
`•
`
`p
`......_ ___
`
`p
`I .____.
`I _______
`
`• • •
`
`p
`--~
`
`62
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`CLUSTER/GLOBAL MEMORY INTERCONNECT
`Fig. 2d
`
`IPR2023-00037
`Apple EX1005 Page 5
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 4 of 12
`
`5,428,803
`
`,-.--=-__,-
`I
`P
`I
`__..___,
`I ,___,_ __
`Arbitration
`I
`Node
`I
`I
`I
`
`I
`.I
`I
`I
`
`--Ii-+-~
`l
`I
`
`-
`
`....--=----.
`P
`
`-
`
`-
`
`e e •
`
`--=~~
`P
`
`.....--------___.___,
`Arbitration
`Node
`
`,--___.., ___ __,_ _ _,
`
`Cluster
`Shared
`Memory
`
`I
`
`72
`I 70
`I /
`/
`76
`
`I
`I
`I 74
`
`-
`• •
`
`-,_~___,-
`•
`P
`
`Cluster
`Shared
`Memory
`
`1- --
`
`----
`Cluster
`Shared
`I
`Memory
`I .__--,---~____.
`I r-----'-----------.
`Arbitration
`I
`Node
`I ....__,_ ___
`
`I P
`I ...._____, • • •
`, ______
`
`p
`
`I
`I
`--
`_____. I
`I
`I
`_J _______
`
`Cluster
`Shared
`Memory
`
`Arbitration
`Node
`
`•••
`
`---1
`
`p
`
`J
`
`74
`
`76
`
`72
`
`p
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`CLUSTER MODEL
`DIRECT INTERCONNECT
`
`Fig. 2e
`
`IPR2023-00037
`Apple EX1005 Page 6
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 5 of 12
`
`5,428,803
`
`1
`
`Parallel
`Processing
`Efficiency
`
`0
`
`1
`
`Parallel
`Processing
`Efficiency
`
`0
`
`Fig. 3a
`
`1
`
`0 .1
`
`0.01
`
`0.001
`
`Memory
`Relative
`Latency
`
`Fig. 3b
`
`1
`
`0.1
`
`0.01
`
`0.001
`
`Memory
`Relative
`Locality
`
`IPR2023-00037
`Apple EX1005 Page 7
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 6 of 12
`
`5,428,803
`
`IOOc
`
`F• 4 I02d OOd
`1g.
`
`IPR2023-00037
`Apple EX1005 Page 8
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 7 of 12
`
`5,428,803
`
`IPR2023-00037
`Apple EX1005 Page 9
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 8 of 12
`
`5,428,803
`
`100
`
`Fig. Sc
`
`IPR2023-00037
`Apple EX1005 Page 10
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 9 of 12
`
`5,428,803
`
`110
`
`100
`
`H
`
`Fig. Sb
`
`Fig. 5d
`
`IPR2023-00037
`Apple EX1005 Page 11
`
`

`

`CH
`0
`00
`...
`00
`"' ~
`
`01
`
`N
`~
`
`Q e.
`(!) a
`00 =(cid:173)
`
`~
`
`01
`1.0
`1.0
`~
`

`
`~-..l
`N
`(I)
`
`"""'
`~
`t-c a
`•
`rJJ.
`d •
`
`140
`
`Xfer
`by-Word
`Word(cid:173)
`
`140
`
`Xfer
`by-Block
`Block(cid:173)
`
`Fig. 6 a
`
`by-W ord
`
`Xfer
`
`Word-
`
`130a~
`
`I
`
`7
`
`i.1,,S i
`
`e"-f
`
`e"e
`
`Vl\1f\11'----.••
`
`'-9._ e"e._
`
`I 107
`
`I
`
`✓, ~ <
`
`',
`
`I <
`
`I
`
`I
`
`1020
`
`I ~ ~ r A iN/1
`I -----=
`
`I
`
`106a-,.
`
`1200 ---
`
`122a
`
`IOOo
`
`IPR2023-00037
`Apple EX1005 Page 12
`
`

`

`0 w
`00
`...
`00
`N
`.J::a,.
`....
`U1
`
`e. ....
`....
`a ....
`g'
`00
`
`N
`
`....
`(I) --~

`
`OJ
`\C
`\C
`
`~ ~
`~
`I-cl
`•
`C/1
`•
`c::
`
`I04c
`
`122c
`
`122b
`
`104b
`
`IOOc
`
`106c
`
`~120c
`
`eee
`
`I ,_
`I
`I
`I
`
`I32b~
`
`I30c
`
`•••
`
`I34b
`
`130b
`
`140
`
`140
`
`Fig. 6b
`
`IOOb
`
`106b
`
`IPR2023-00037
`Apple EX1005 Page 13
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 12 of 12
`
`5,428,803
`
`136
`
`138
`
`Xmit
`Latch
`
`Xmit
`Clock
`
`132
`
`Cluster
`
`...--------a----------~Boundar
`Dis ab I e -----+------.
`
`144
`
`154
`
`Receive
`Latch
`
`150
`
`Buffer
`RAM
`
`Clock
`Gen
`
`Write
`Pointer
`
`Read
`Pointer
`
`152
`
`Common
`Clock
`Oscillator
`
`156
`
`141
`
`158
`
`Freq
`Locked
`Local
`Clock
`
`142
`
`134
`
`Fig. 7
`
`IPR2023-00037
`Apple EX1005 Page 14
`
`

`

`1
`
`5,428,803
`
`METHOD AND APPARATUS FOR A UNIFIED
`PARALLEL PROCESSING ARCHITECTURE
`
`TECHNICAL FIELD
`The present invention relates generally to parallel
`processing computer systems for performing multiple- 30
`instruction-multiple-data (MIMD) parallel processing.
`More particularly, the present invention relates to a
`method and apparatus for a unified parallel processing
`architecture for high performance MIMD multiproces(cid:173)
`sors that organizes the multiprocessors into four or 35
`more physically separable clusters, only some of which
`are adjacently
`interconnected, and provides for a
`shared memory model to be used with programs exe(cid:173)
`cuted in the floating shared memory space that is de(cid:173)
`fined across any adjacently interconnected clusters, and 40
`a distributed memory model to be used with any pro(cid:173)
`grams executed across non-adjacently interconnected
`clusters.
`
`45
`
`PRIOR ART
`The field of parallel computing has received increas-
`ing attention in recent years as computer designers seek
`to increase the effective processing capabilities of com(cid:173)
`puter processing systems. Most of the current work
`relating to high performance parallel processing has 50
`focused on multiple-instruction-multiple-data (MIMD)
`parallel processing systems. Current architectures for
`MIMD parallel processing systems use one of two dif(cid:173)
`ferent memory models for sharing data within a pro(cid:173)
`gram that is to be executed in parallel: distributed mem- 55
`ory vs. shared memory. In a distributed memory model,
`data is stored in the private local memory of each pro(cid:173)
`cessor and is communicated among processors by some
`type of message passing scheme. In a shared memory
`model, data is stored in a common shared memory that 60
`is equally accessible to all processors in the parallel
`processing system. An excellent summary of current
`MIMD parallel architectures is set forth in Almasi and
`Gottlieb, Highly Parallel Computing, (1989) Chpt. 10,
`pgs. 354-475.
`From the programmer's perspective, the most attrac(cid:173)
`tive parallel processing system is one that uses a shared
`memory model having a globally shared physical ad-
`
`RELATED APPLICATIONS
`This application is related to the following co-pend(cid:173)
`ing applications, all of which are assigned to the as(cid:173)
`signee of the present invention, the disclosure of all of
`which are hereby incorporated by reference in the pres-
`ent application.
`CLUSTER ARCHITECTURE FOR A HIGHLY
`PARALLEL
`SCALARNECTOR
`MULTIPRO(cid:173)
`CESSOR SYSTEM, U.S. Pat. No. 5,197,130, issued
`Mar. 23, 1993;
`FOR NON- 15
`METHOD AND APPARATUS
`SEQUENTIAL RESOURCE ACCESS, U.S. Pat. No.
`5,208,914, issued on May 4, 1993;
`GLOBAL REGISTERS FOR A MUL TIPROCES(cid:173)
`SOR SYSTEM, U.S. Pat. No. 5,165,038, issued Nov.
`17, 1992;
`FAST INTERRUPT MECHANISM FOR A MUL(cid:173)
`TIPROCESSOR SYSTEM, Ser. No. 07/536,199, filed
`on Jun. 11, 1990; and INTEGRATED SOFTWARE
`ARCHITECTURE FOR A HIGHLY PARALLEL
`MULTIPROCESSOR
`SYSTEM, U.S. Pat. No. 25
`5,179,702 issued Jan. 12, 1993.
`
`20
`
`10
`
`2
`dress space. A shared address space leaves the choice of
`the parallel programming model up to the programmer
`and provides a more flexible problem solution space. In
`addition, most present software programs are written
`5 for a shared memory model, as opposed to a distributed
`or private memory model. To effectively port an exist(cid:173)
`ing software program from a shared memory model to
`a private memory model can require significant repro-
`gramming effort. One of the primary reasons why it is
`easier to program a shared memory model is that a
`distributed memory model necessarily requires that the
`processes and data which comprise a parallel program(cid:173)
`ming problem be partitioned into separate tasks that are
`each small enough to fit into the private local memory
`of a processor and at the same time are large enough to
`recover the overhead required to establish a parallel
`task in a message passing environment. In a shared
`memory model, there are fewer restrictions on the size
`of the parallel tasks and the overhead required to estab(cid:173)
`lish a parallel task can be significantly less than the
`overhead associated with the more complicated mes-
`sage passing scheme required by a distributed memory
`model. Thus, it would be desirable to have a parallel
`processing system which could provide a large and
`flexible problem solution space for parallel processing
`wherein a large number of processors all have equal and
`symmetric access to any and all memory locations in a
`common shared memory.
`Unfortunately, the hardware required to implement
`such a true shared memory model for a parallel process(cid:173)
`ing system having a large number of processors can be
`expensive and complicated. While it may be practically
`possible to directly wire each processor to each bank of
`memory to allow for equal and symmetric access to the
`shared memory in a minimally parallel processing sys(cid:173)
`tem having two, four or eight processors, it is not prac(cid:173)
`tically possible to directly wire each processor to each
`bank of memory in parallel processing systems when
`the number of processors is greater than about eight. As
`a result, several types of extended shared memory sys-
`tems have been proposed when the number of proces(cid:173)
`sors in the parallel processing system is greater than
`about eight.
`Currently, most extended shared memory systems use
`a variety of multi-stage interconnection techniques to
`connect a local memory that is closely associated with
`each processor with the local memories of all other
`processors such that all of the local memories together
`comprise the globally shared address space of the sys(cid:173)
`tem, e.g., the BBN Butterfly+,
`the NYU Ultracom-
`puter and the Thinking Machines, Inc. CM-1. Some
`extended shared memory systems combine a local mem(cid:173)
`ory for each processor with a global memory that oper(cid:173)
`ates as a cache write-through or duplicate address
`space, e.g., the Carnegie-Mellon CM.*. Still other sys-
`tems provide a reconfigurable memory model where
`each processor has a memory that can be accessed ei(cid:173)
`ther locally or globally, and where the boundary bet(cid:173)
`ween local and global memory can be dynamically
`adjusted, e.g., the IBM RP3 and U.S. Pat. No.
`5,056,000.
`The problem with such extended shared memory
`systems is that memory access is not equal and symmet-
`65 ric from all processors to all memory locations in the
`extended shared memory. Instead, each processor may
`encounter different memory latencies when accessing
`different memory locations in different parts of the
`
`IPR2023-00037
`Apple EX1005 Page 15
`
`

`

`5,428,803
`
`3
`extended shared memory address space. This is espe(cid:173)
`cially true when comparing the access of a processor to
`its local memory with the access of that processor to a
`remote local memory or to a global memory. In essence,
`only those accesses which are to the local memory of 5
`the processor are equal and symmetric; all other mem(cid:173)
`ory accesses are variable and, in some cases, even inde(cid:173)
`terminate. As a result, these systems perform parallel
`processing best when nearly all of the memory requests
`for the related data to be operated on by each parallel 10
`task are made to the individual local memory of the
`processor executing that parallel task. When even a
`small percentage of the memory requests for the related
`data required by a parallel task must access memory
`outside the individual local memory, significant degra- 15
`dation of parallel processing performance can occur.
`In an effort to increase the portion of an extended
`shared memory that is not affected by the problems of
`different latencies in accessing remotely shared mem(cid:173)
`ory, at least three different parallel processing systems 20
`have incorporated the concept of grouping the shared
`memory for a multiple number of processors together
`such that all of the processors in a group have similar
`access to a shared memory for that group. These groups
`of processors are then all connected to each other to 25
`provide the total extended shared memory space for the
`parallel processing system.
`In the Kendall Square Research KSR-1 supercom(cid:173)
`puter, the local memories of a group of processors are
`connected together via a ring bus interconnection to 30
`form one of a plurality of segments. The various seg(cid:173)
`ments are then connected to each other by a pyramidal
`hierarchy of "higher" segments, with each higher-level
`segment having fewer segments than the preceding
`lower-level segment. Each of the "higher" segments of 35
`the domains uses a plurality of transfer units that are
`also arranged into a ring bus interconnection to transfer
`memory requests in a distributive hierarchical basis.
`Although there is a more uniform memory latency
`among requests to the local memories of other proces- 40
`sors in that segment for the processors in the KSR-1
`supercomputer, there are still significant differences in
`memory latencies between a local memory access, a
`memory access within that segment, and a memory
`access that is outside of that segment. As a result, it is 45
`still necessary to provide for some memory hierarchy
`optimization of memory requests for each parallel task
`in order to have a computer program run most effi(cid:173)
`ciently on the KSR-1 supercomputer.
`In the University of Illinois Cedar supercomputer, 50
`each cluster has eight processors linked by an 8 X 8
`crossbar switch to a 4-way interleaved write-back
`cache, and on through to a local cluster memory having
`32 Mbytes per cluster. Eight clusters are then linked by
`another 8 X 8 crossbar switch to a global memory hav- 55
`ing 8 Mbytes per processor. Transfers from global
`memory to a cluster memory are accomplished by a
`block transfer of 32 byte blocks from the global memory
`through the cache and on to the cluster memory. Each
`processor may also directly access a portion of the 60
`global memory on a byte-by-byte basis. In the case of
`the Cedar supercomputer, inter-cluster communication
`is part of a memory-hierarchy model and can be accom(cid:173)
`plished either by having processors in both clusters
`write to a common section of the global shared mem- 65
`ory, or by a block transfer of the cluster memory,
`through the global memory, to the other cluster-shared
`memory. As with the KSR-1 supercomputer, the use
`
`4
`memory hierarchy optimization is a prerequisite to suc(cid:173)
`cessfully programming a computer program to run
`most efficiently on the Cedar supercomputer.
`In the original supercomputer developed by the as(cid:173)
`signee of the present invention that is the subject of the
`previously identified patent application entitled CLUS(cid:173)
`TER ARCHITECTURE FOR A HIGHLY PARAL(cid:173)
`a
`LEL SCALAR/VECTOR MULTIPROCESSOR,
`four cluster parallel processing supercomputer is dis(cid:173)
`closed. Each cluster includes between 4 and 256 high
`performance scalar/vector processors which are con(cid:173)
`nected via a unique arbitration node network to a set of
`interleaved sections of memory that form the cluster
`shared memory. The processors in each cluster are also
`directly connected to the clustered shared memory of
`all three of the other clusters by an additional port in the
`arbitration node network. Using the cluster architecture
`of the original supercomputer developed by the as(cid:173)
`signee of the present invention, a single common ad(cid:173)
`dress space is defined across the cluster memories of all
`four clusters, with the only difference between intra(cid:173)
`cluster memory requests and inter-cluster memory re(cid:173)
`quest being the different memory latency primarily due
`to the physical distance which separates the clusters.
`Consequently, it is possible to provide for parallel pro(cid:173)
`cessing of program throughout the entire address space
`of the four clusters without the need for a memory
`hierarchy optimization.
`The problem with these cluster-type parallel process(cid:173)
`ing architectures is that all of the clusters must be di(cid:173)
`rectly connected together as part of the same single
`shared address space. Unfortunately, the number of
`clusters that can be directly interconnected is limited by
`the cost and complexity _required for such interconnec(cid:173)
`tions. In the case of the Cedar supercomputer, the com(cid:173)
`plexity and costs of interconnecting additional clusters
`increases exponentially as more memory is required for
`the global memory with the addition of each processor,
`and as larger and larger crossbar switches are required
`to directly interconnect each cluster to the global. The
`original supercomputer developed by the assignee of
`the present invention does not require ever increasing
`amounts of global memory as the number of processors
`increases; however, the physical limitations imposed by
`directly interconnecting each cluster to every other
`cluster in such a way so as to not significantly increase
`the memory latency for inter-cluster memory requests
`make the problem of directly interconnecting a large
`number of clusters very difficult. An additional physical
`limitation that restricts the number of clusters which
`can be interconnected in either of these parallel process(cid:173)
`ing supercomputers is that all of the clusters in either of
`these cluster-type supercomputers must be physically
`dose enough to one another such that the clock signals
`for all of the clusters can be synchronized or suffer the
`corresponding performance penalties associated with
`asychronous communication schemes.
`Although cluster-type architectures offer a partial
`solution to the problem of varying memory latencies,
`cluster-type architectures are themselves limited by the
`physical problems associated with directly connecting
`together all of clusters in order to provide a single com(cid:173)
`mon address space for the parallel processing system.
`Thus, while cluster-type parallel processing architec(cid:173)
`tures, such as the original supercomputer developed by
`the assignee of the present invention, represent a signifi(cid:173)
`cant improvement in the design and architecture of high
`performance parallel processors, it would be advanta-
`
`IPR2023-00037
`Apple EX1005 Page 16
`
`

`

`5
`geous to provide for a unified parallel processing archi(cid:173)
`tecture that can allow an effectively unlimited number
`of clusters of multiprocessors to be interconnected to(cid:173)
`gether, and that can overcome the present physical
`limitations imposed on interconnecting clustered multi- 5
`processors.
`
`SUMMARY OF THE INVENTION
`The present invention provides a unified parallel
`processing architecture for connecting together an ex- 10
`tendible number of clusters of multiple numbers of pro(cid:173)
`cessors to create a high performance parallel processing
`computer system. The present invention organizes mul(cid:173)
`tiple processors into four or more physically separable
`clusters, each cluster having a common cluster shared 15
`memory that is symmetrically accessible by all of the
`processors
`in that cluster. Unlike present extended
`shared memory cluster-type architectures, only some of
`the clusters are adjacently interconnected in the present
`invention. A shared memory model can be used with 20
`programs that can be executed in the cluster shared
`memory of a single cluster, or in a floating shared mem(cid:173)
`ory that is defined across an extended shared memory
`space comprised of the cluster shared memories of any
`set of adjacently interconnected clusters. A distributed 25
`memory model can be used with any programs that are
`to be executed in the cluster shared memories of any
`non-adjacently interconnected clusters.
`In the present invention, clusters are adjacently inter(cid:173)
`connected to form a floating shared memory if certain 30
`memory access conditions relating to relative memory
`latency and relative data locality can create an effective
`shared memory parallel programming environment. By
`adjacently connecting
`together only those clusters
`which can satisfy these conditions, the floating shared 35
`memory created by the present invention does not have
`to sacrifice performance for the increased interconnec(cid:173)
`tion necessary to increase the size of the problem solu(cid:173)
`tion space provided by an extended shared memory
`model where every processor can access the single 40
`shared address space of the floating shared memory.
`The adjacent interconnection of multiple clusters of
`processors to create a floating shared memory effec(cid:173)
`tively combines all three type of memory models, pure
`shared memory, extended shared memory and distrib- 45
`uted shared memory, into a unified parallel processing
`architecture. Thus, the present invention accommo(cid:173)
`dates parallel processing performance first and flexibil-
`ity of the interconnection of processors and memory
`second, with the tradeoff between these two consider- 50
`ations balanced at the point at which the performance of
`inter-cluster memory accesses start to significantly de(cid:173)
`grade.
`The four or more clusters of multiprocessors may be
`adjacently interconnected using a variety of hardware 55
`interconnection
`topologies. In the preferred embodi(cid:173)
`ment, each of the clusters is a physically separable unit
`that includes two or more processors, and one or more
`input/output ports, all of which are symmetrically con(cid:173)
`nected by a node switching mechanism to a homage- 60
`nous cluster shared memory. For each of the clusters,
`two or more external cluster connections provide a
`word-by-word communication path for adjacently con(cid:173)
`necting the processors in any one cluster to the cluster
`shared memory of two or more adjacently intercon- 65
`nected clusters. A distributed memory communication
`mechanism is also used for providing a message passing
`communication among any non-adjacently
`intercon-
`
`5,428,803
`
`6
`nected clusters so as to define an extendible address
`space encompassing the cluster shared memory of all
`clusters.
`To further enhance the parallel processing capabili(cid:173)
`ties of the preferred embodiment of the present inven(cid:173)
`tion, each multiprocessor cluster is provided with an
`extendible clock mechanism and an extendible control
`mechanism to coordinate parallel processing among
`clusters. The extendible clock mechanism solves the
`problem of differing phases and duty cycles in clock
`signals among physically separable components of the
`parallel processing system. The extendible control
`mechanism provides control mechanisms that can be
`used by a distributively-based operating system to coor(cid:173)
`dinate and communicate among processes executing in
`parallel in the computer processing system.
`Accordingly, it is a primary objective of the present
`invention to provide a unified parallel processing archi(cid:173)
`tecture for connecting together an extendible number of
`clusters of multiple numbers of processors to create a
`high performance parallel processing computer system.
`Another primary objective of the present invention is
`to provide a unified parallel processing architecture that
`can overcome the present physical limitations imposed
`on interconnecting clustered multiprocessors.
`Still another primary objective of the present inven(cid:173)
`tion is to provide a unified memory model that uniquely
`combines the prior art memory models for a shared
`memory, extended shared memory and distributed
`memory models.
`A further objective of the present invention is to
`provide a parallel processing computer system that
`allows for a shared memory model to be used with
`programs executed in the floating shared memory space
`that is defined across any adjacently interconnected
`clusters, and a distributed memory model to be used
`with any programs executed across non-adjacently in(cid:173)
`terconnected clusters.
`Still another objective of the present invention is to
`provide a clustered multiprocessor system where the
`physically separable clusters are all interconnected with
`an extendible clock mechanism that solves the physical
`problems required by a synchronous clock system in a
`multiprocessor system with physically separable com(cid:173)
`ponents.
`These and other objectives of the present invention
`will become apparent with reference to the drawings,
`the detailed description of the preferred embodiment
`and the appended claims.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIGS. la and lb are simplified block diagrams show(cid:173)
`ing the two different basic memory models for parallel
`processing systems in the prior art, an ideal distributed
`memory model and an ideal shared memory model,
`respectively.
`FIGS. 2a, 2b, 2c, 2d and 2e are simplified block dia(cid:173)
`grams showing several different extended shared mem(cid:173)
`ory models for parallel processing systems in the prior
`art.
`FIGS. 3a and 3b are graphs showing the relative
`memory latency and relative data locality of an ex(cid:173)
`tended shared memory system, respectively, as a func(cid:173)
`tion of parallel processing efficiency.
`FIG. 4 is a simplified block diagram of the preferred
`embodiment of the unified parallel processing architec(cid:173)
`ture of the present invention.
`
`IPR2023-00037
`Apple EX1005 Page 17
`
`

`

`7
`FIGS. Sa, Sb, Sc and Sd are block diagrams showing
`four different cluster connection topologies for adja(cid:173)
`cently connecting multiple clusters together in accor(cid:173)
`dance with the present invention.
`FIG. 6 is a more detailed block diagram of a single
`dimension ring-type cluster interconnection of the pre(cid:173)
`ferred embodiment of the present invention.
`FIG. 7 is a schematic diagram of the preferred em(cid:173)
`bodiment of the adjacent cluster connection mechanism
`of the present invention, including the circuitry for
`implementing the extendible clock mechanism.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENT
`To better understand how the present invention pro- 15
`vides for a novel unified parallel processing architec(cid:173)
`ture,

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket