`
`United States Patent (19)
`Chen et al.
`
`54 METHOD AND APPARATUS FOR A
`UNIFED PARALLEL PROCESSING
`ARCHITECTURE
`
`||||||||||III
`11
`Patent Number:
`5,428,803
`45
`Date of Patent:
`Jun. 27, 1995
`
`US005428803A
`
`OTHER PUBLICATIONS
`Fast Interrupt Mechanism for a Multiprocessor System,
`Ser. No.: 07/536,199, filed on Jun. 11, 1990.
`(List continued on next page.)
`
`75 Inventors: Steve S. Chen, Eau Claire; Douglas R.
`Beard, Eleva; George A. Spix, Eau
`Claire; Edward C. Priest, Eau Claire;
`John M. Wastlick, Eau Claire; James
`M. VanDyke, Eau Claire, all of Wis.
`73 Assignee: Cray Research, Inc., Chippewa Falls,
`Wis.
`
`21 Appl. No.: 912,964
`
`22 Filed:
`
`Jul. 10, 1992
`
`Primary Examiner-Alyssa H. Bowler
`Assistant Examiner-G. Donaghue
`Attorney, Agent, or Firm-Schwegman, Lundberg &
`Woessner
`ABSTRACT
`57
`A unified parallel processing architecture connects to
`gether an extendible number of clusters of multiple
`numbers of processors to create a high performance
`parallel processing computer system. Multiple proces
`sors are grouped together into four or more physically
`- GF/60 separable clusters, each cluster having a common clus
`-
`- - -
`- - - - -
`- - - -
`a
`Int. C.6 ..........
`51
`ter shared memory that is symmetrically accessible by
`52 U.S. C. .................................... 395/800; 395/200;
`395/425; 364/DIG. 1; 364/229; 364/243
`all of the processors in that cluster; however, only some
`(58) Field of Search ........................ 395/200, 800, 425
`of the clusters are adjacently interconnected. Clusters
`are adjacently interconnected to form a floating shared
`memory if certain memory access conditions relating to
`References Cited
`relative memory latency and relative data locality can
`U.S. PATENT DOCUMENTS
`create an effective shared memory parallel program
`4,130,865 12/1978 Heart et al. ......................... 395/200
`ming environment. A shared memory model can be
`4,365,292 12/1982 Barnes et al. .
`... 395/800
`used with programs that can be executed in the cluster
`4400,768 8/1983 Tomlinson ...
`... 395/800
`shared memory of a single cluster, or in the floating
`4,445.17 4/1984 Neches.
`... 39532s
`shared memory that is defined across an extended
`4,636,942 1/1987 Chen et al. .......
`... 395/725
`shared memory space comprised of the cluster shared
`4,707,78 11/1987 Sullivan et al. ..
`... 395/425
`memories of any set of adjacently interconnected clus
`4,720,780 / 1988 Doiecek ...........
`... 395/800
`ters. A distributed memory model can be used with any
`4,745,545 5/1988 Schiffleger ...
`35/32
`--
`4,754,398 6/1989 Pribnow ...........
`.395/200
`4,827,403 5/i989 Steele, Jr. et al. .................. g; programs that are to be executed in the cluster shared
`4,834,483 5/1989 Arthurs et al. ....................... 35
`memories of any non-adjacently interconnected clus
`4,873,626 10/1989 Gifford .........
`... 395/325
`ters. The adjacent interconnection of multiple clusters
`4,891,751 1/1990 Call et al. .....
`... 395/800
`of processors to a create a floating shared memory ef
`4,901,230 2/1990 Chen et al. ...
`... 395/325
`fectively combines all three type of memory models,
`5,055,999 10/1991 Frank et al. ..
`•r 395/2
`pure shared memory, extended shared memory and
`5,056,000 10/1991 Chang ....... -
`... 395/325
`-
`5,072,371 12/1991 Benner et al.
`... 395/200
`distributed shared memory, into a unified parallel pro
`5,081,575 l/1992 Hiller et al. ..
`... 36373
`cessing architecture.
`5,113,523 5/1992 Colley et al..
`... 395/800
`5,165,038 11/1992 Beard et al. ..
`... 395/800
`5,179,702 1/1993 Spix et al. ...
`... 395/650
`5,197,130 3/1993 Chen et al. ...
`... 395/325
`5,208.914 5/1993 wiisonet al... 395/275
`
`(56)
`
`
`
`24 Claims, 12 Drawing Sheets
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 1
`
`
`
`5,428,803
`Page 2
`
`OTHER PUBLICATIONS
`Almasi, G. and Gottlieb, A., Highly parallel Computing,
`Benjamin Cummings 1989, Chpt. 1, “Overview", pp.
`2–29, Chap. 8, Interconnection Networks pp. 278-299,
`Chapt. 10 MIMD Parallel Architectures pp. 354-475.
`Gajski, D., Milutinovic, V., Siegel, H., and Furht, B.
`Computer Architecture, The Computer Society of the
`IEEE, (1987), Chpt. 2, “Topics in Parallel Processing
`and Multiprocessing', pp. 81-171.
`Hennesy, J. and Patterson, D., Computer Architecture:
`A Quantative Approach, Morgan Kaufman (1990), Chap.
`10, "Future Directions”, pp. 570-592.
`Hwang, K. and DeGroot, D., Parallel Processing for
`Supercomputers and Artificial Intelligence, McGraw
`Hill Publ., (1989) Chpt. 2, pp. 31-67.
`Kain, R., Computer Architecture, Prentice Hall, (1989),
`vol. 1, Chpt. 3, "Shared Resource Synchronization',
`pp. 178-250.
`
`ETA 10 System Overview: EOS, Tech. Note, Publ.
`1006, ETA Systems, Sep. 30, 1988.
`Clementi, E., Logan, D., Saarninen, J., “ICAP/3090:
`Parallel Processing For Large Scale Scientific and En
`gineering Problems', IBM Systems Journal, vol. 27, No.
`4 (1988) pp. 475-509.
`Pfister, G., “The IBM Research Parallel Processor
`Prototype (RP3): Introduction and Architecture”, Int’l
`Conf. on Parallel Processing, Aug. 1985, pp. 764-771.
`Murakami, K., Akira, F., Sueyoshi, T. and Tomita, S.,
`“An Overview of the Kyushi University Reconfigura
`ble Parallel Processor”, Aug. 1988, pp. 130-137.
`Kuck, D., Davidson, E., Lawrie, D. and Sameh, A.,
`“Parallel Supercomputing Today and the Cedar Ap
`proach”, Science, vol. 231, Feb. 1986, pp. 967–974.
`Goodman, J. and Woest, P., “The Wisconsin Mul
`ticube: A New Large Scale Cache-Coherent Multipro
`cessor', Proc. of the 1988 Int'l Conf. on Parallel Pro
`cessing, IEEE, Feb. 1988, pp. 422-431.
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 2
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 1 of 12
`
`5,428,803
`
`Message Passing
`Connection
`
`O
`
`P
`LM
`
`P
`LM
`
`P
`LM
`
`P-12
`P
`LM, LM-4
`
`PRIOR ART
`DISTRIBUTED MEMORY MODEL
`
`Fig. 1 a
`
`
`
`Common
`Shared Memory
`
`u-24
`
`PRIOR ART
`SHARED MEMORY MODEL
`
`Fig. 1b
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 3
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 2 of 12
`
`5,428,803
`
`PROR ART
`EXTENDED SHARED MEMORY
`SWITCHING NETWORK MODEL
`Fig. 2a
`
`Domain 1
`Connection
`
`Segment O'
`Connection
`
`
`
`
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`HEIRARCHICAL RING MODEL
`Fig. 2b
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 4
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 3 of 12
`
`5,428,803
`
`Control Network
`Connection
`
`5O
`
`54
`
`GM
`
`GM
`
`LM
`
`M
`
`P
`
`P
`
`GM
`
`w
`
`us
`
`LM
`
`P
`
`GM
`
`GM
`
`M
`
`P
`
`P --52
`
`
`
`
`
`
`
`
`
`PROR ART
`EXTENDED SHARED MEMORY
`RECONFIGURABLE MODEL
`
`Fig. 2c
`
`Global Memory
`
`Custer
`Shared
`
`62
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`CLUSTER/GLOBAL MEMORY INTER CONNECT
`Fig. 2d
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 5
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 4 of 12
`
`5,428,803
`
`Arbitration
`
`Arbitration
`
`O
`
`
`
`
`
`
`
`Custer
`Shared
`
`74
`
`Cluster
`Shared
`
`Custer
`Shared
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`CLUSTER MODE
`DIRECT INTERCONNECT
`Fig. 2e
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 6
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 5 of 12
`
`5,428,803
`
`Fig. 3a
`
`O.
`
`O. O.
`Relative Memory
`Laten cy
`
`O. O. O.
`
`Parale
`Processing
`Efficiency
`
`O
`
`
`
`Parallel
`Processing
`Efficiency
`
`1
`
`O.
`
`O. O.
`Relative Memory
`Locality
`
`O. O. O.
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 7
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 6 of 12
`
`5,428,803
`
`©
`
`
`
`
`
`
`
`U Ster
`Shared
`emory
`
`CU Ster
`Shared
`Men O
`
`<<<<\
`|- #########\\
`SOEN7\]~ 8
`
`Fig. 4 “ Odd
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 8
`
`
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 9
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 8 of 12
`
`5,428,803
`
`EH, -
`ATA A)
`fi/21/1.
`{AAAAA
`/ / / / .
`GAA (2///)
`D// | 1 || ||
`(AAAAA/
`74.7/17. As O
`(AAAYAD
`/, / y/y,
`C// / /
`WL/ /
`/
`(KAAA/)
`71 / 1 /
`flighly
`
`Fig. 5c
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 10
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`et 9 of 12
`She
`
`5,428,803
`
`
`
`O
`
`Fig. 5d
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 11
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 10 of 12
`
`5,428,803
`
`
`
`I 3 4 X \\ 0 O | E- Kq
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 12
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 11 of 12
`
`5,428,803
`
`
`
`SOO|
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 13
`
`
`
`U.S. Patent
`
`June 27, 1995
`
`Sheet 12 of 12
`
`5,428,803
`
`32
`
`36
`
`38
`
`Xn it
`Latch
`
`Xn it
`Cock
`
`Boundar
`
`
`
`
`
`
`
`Common
`Clock
`Oscillator
`
`4.
`
`
`
`42
`
`34
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 14
`
`
`
`O
`
`RELATED APPLICATIONS
`This application is related to the following co-pend
`ing applications, all of which are assigned to the as
`signee of the present invention, the disclosure of all of
`which are hereby incorporated by reference in the pres
`ent application.
`CLUSTER ARCHITECTURE FOR A HIGHLY
`PARALLEL SCALAR/VECTOR MULTIPRO
`CESSOR SYSTEM, U.S. Pat. No. 5,197,130, issued
`Mar. 23, 1993;
`15
`METHOD AND APPARATUS FOR NON
`SEQUENTIAL RESOURCE ACCESS, U.S. Pat. No.
`5,208,914, issued on May 4, 1993;
`GLOBAL REGISTERS FOR A MULTIPROCES
`SOR SYSTEM, U.S. Pat. No. 5,165,038, issued Nov.
`17, 1992;
`FAST INTERRUPT MECHANISMFORA MUL
`TIPROCESSOR SYSTEM, Ser. No. 07/536,199, filed
`on Jun. 11, 1990; and INTEGRATED SOFTWARE
`ARCHITECTURE FOR A HIGHLY PARALLEL
`25
`MULTIPROCESSOR SYSTEM, U.S. Pat. No.
`5,179,702 issued Jan. 12, 1993.
`TECHNICAL FIELD
`The present invention relates generally to parallel
`processing computer systems for performing multiple
`instruction-multiple-data (MIMD) parallel processing.
`More particularly, the present invention relates to a
`method and apparatus for a unified parallel processing
`architecture for high performance MIMD multiproces
`sors that organizes the multiprocessors into four or
`35
`more physically separable clusters, only some of which
`are adjacently interconnected, and provides for a
`shared memory model to be used with programs exe
`cuted in the floating shared memory space that is de
`fined across any adjacently interconnected clusters, and
`a distributed memory model to be used with any pro
`grams executed across non-adjacently interconnected
`clusters.
`
`30
`
`5,428,803
`2
`dress space. A shared address space leaves the choice of
`the parallel programming model up to the programmer
`and provides a more flexible problem solution space. In
`addition, most present software programs are written
`for a shared memory model, as opposed to a distributed
`or private memory model. To effectively port an exist
`ing software program from a shared memory model to
`a private memory model can require significant repro
`gramming effort. One of the primary reasons why it is
`easier to program a shared memory model is that a
`distributed memory model necessarily requires that the
`processes and data which comprise a parallel program
`ming problem be partitioned into separate tasks that are
`each small enough to fit into the private local memory
`of a processor and at the same time are large enough to
`recover the overhead required to establish a parallel
`task in a message passing environment. In a shared
`memory model, there are fewer restrictions on the size
`of the parallel tasks and the overhead required to estab
`lish a parallel task can be significantly less than the
`overhead associated with the more complicated mes
`sage passing scheme required by a distributed memory
`model. Thus, it would be desirable to have a parallel
`processing system which could provide a large and
`flexible problem solution space for parallel processing
`wherein a large number of processors all have equal and
`Symmetric access to any and all memory locations in a
`common shared memory.
`Unfortunately, the hardware required to implement
`such a true shared memory model for a parallel process
`ing system having a large number of processors can be
`expensive and complicated. While it may be practically
`possible to directly wire each processor to each bank of
`memory to allow for equal and symmetric access to the
`shared memory in a minimally parallel processing sys
`ten having two, four or eight processors, it is not prac
`tically possible to directly wire each processor to each
`bank of memory in parallel processing systems when
`the number of processors is greater than about eight. As
`a result, several types of extended shared memory sys
`tems have been proposed when the number of proces
`sors in the parallel processing system is greater than
`about eight.
`Currently, most extended shared memory systems use
`a variety of multi-stage interconnection techniques to
`connect a local memory that is closely associated with
`each processor with the local memories of all other
`processors such that all of the local memories together
`comprise the globally shared address space of the sys
`tem, e.g., the BBN Butterfly --, the NYU Ultracom
`puter and the Thinking Machines, Inc. CM-1. Some
`extended shared memory systems combine a local mem
`ory for each processor with a global memory that oper
`ates as a cache write-through or duplicate address
`space, e.g., the Carnegie-Mellon CM.*. Still other sys
`tems provide a reconfigurable memory model where
`each processor has a memory that can be accessed ei
`ther locally or globally, and where the boundary bet
`ween local and global memory can be dynamically
`adjusted, e.g., the IBM RP3 and U.S. Pat. No.
`5,056,000.
`The problem with such extended shared memory
`systems is that memory access is not equal and symmet
`ric from all processors to all memory locations in the
`extended shared memory. Instead, each processor may
`encounter different memory latencies when accessing
`different memory locations in different parts of the
`
`1.
`
`METHOD AND APPARATUS FOR A UNIFIED
`PARALLEL PROCESSING ARCHITECTURE
`
`PRIOR ART
`45
`The field of parallel computing has received increas
`ing attention in recent years as computer designers seek
`to increase the effective processing capabilities of com
`puter processing systems. Most of the current work
`relating to high performance parallel processing has
`50
`focused on multiple-instruction-multiple-data (MIMD)
`parallel processing systems. Current architectures for
`MIMD parallel processing systems use one of two dif
`ferent memory models for sharing data within a pro
`gram that is to be executed in parallel: distributed mem
`55
`ory vs. shared memory. In a distributed memory model,
`data is stored in the private local memory of each pro
`cessor and is communicated among processors by some
`type of message passing scheme. In a shared memory
`model, data is stored in a common shared memory that
`is equally accessible to all processors in the parallel
`processing system. An excellent summary of current
`MIMD parallel architectures is set forth in Almasi and
`Gottlieb, Highly Parallel Computing, (1989) Chpt. 10,
`pgs. 354-475.
`From the programmer's perspective, the most attrac
`tive parallel processing system is one that uses a shared
`memory model having a globally shared physical ad
`
`65
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 15
`
`
`
`5,428,803
`3.
`4.
`memory hierarchy optimization is a prerequisite to suc
`extended shared memory address space. This is espe
`cially true when comparing the access of a processor to
`cessfully programming a computer program to run
`its local memory with the access of that processor to a
`most efficiently on the Cedar supercomputer.
`In the original supercomputer developed by the as
`remote local memory or to a global memory. In essence,
`only those accesses which are to the local memory of 5
`signee of the present invention that is the subject of the
`the processor are equal and symmetric; all other mem
`previously identified patent application entitled CLUS
`ory accesses are variable and, in some cases, even inde
`TER ARCHITECTURE FOR A HIGHLY PARAL
`terminate. As a result, these systems perform parallel
`LEL SCALAR/VECTOR MULTIPROCESSOR, a
`processing best when nearly all of the memory requests
`four cluster parallel processing supercomputer is dis
`for the related data to be operated on by each parallel 10
`closed. Each cluster includes between 4 and 256 high
`performance scalar/vector processors which are con
`task are made to the individual local memory of the
`processor executing that parallel task. When even a
`nected via a unique arbitration node network to a set of
`small percentage of the memory requests for the related
`interleaved sections of memory that form the cluster
`data required by a parallel task must access memory
`shared memory. The processors in each cluster are also
`outside the individual local memory, significant degra- 15
`directly connected to the clustered shared memory of
`dation of parallel processing performance can occur.
`all three of the other clusters by an additional port in the
`In an effort to increase the portion of an extended
`arbitration node network. Using the cluster architecture
`of the original supercomputer developed by the as
`shared memory that is not affected by the problems of
`different latencies in accessing remotely shared mem
`signee of the present invention, a single common ad
`ory, at least three different parallel processing systems 20
`dress space is defined across the cluster memories of all
`have incorporated the concept of grouping the shared
`four clusters, with the only difference between intra
`memory for a multiple number of processors together
`cluster memory requests and inter-cluster memory re
`quest being the different memory latency primarily due
`such that all of the processors in a group have similar
`access to a shared memory for that group. These groups
`to the physical distance which separates the clusters.
`Consequently, it is possible to provide for parallel pro
`of processors are then all connected to each other to 25
`cessing of program throughout the entire address space
`provide the total extended shared memory space for the
`parallel processing system.
`of the four clusters without the need for a memory
`hierarchy optimization.
`In the Kendall Square Research KSR-1 supercom
`puter, the local memories of a group of processors are
`The problem with these cluster-type parallel process
`connected together via a ring bus interconnection to 30
`ing architectures is that all of the clusters must be di
`form one of a plurality of segments. The various seg
`rectly connected together as part of the same single
`ments are then connected to each other by a pyramidal
`shared address space. Unfortunately, the number of
`hierarchy of "higher' segments, with each higher-level
`clusters that can be directly interconnected is limited by
`segment having fewer segments than the preceding
`the cost and complexity required for such interconnec
`lower-level segment. Each of the "higher' segments of 35
`tions. In the case of the Cedar supercomputer, the com
`the domains uses a plurality of transfer units that are
`plexity and costs of interconnecting additional clusters
`also arranged into a ring bus interconnection to transfer
`increases exponentially as more memory is required for
`memory requests in a distributive hierarchical basis.
`the global memory with the addition of each processor,
`Although there is a more uniform memory latency
`and as larger and larger crossbar switches are required
`among requests to the local memories of other proces- 40
`to directly interconnect each cluster to the global. The
`original supercomputer developed by the assignee of
`sors in that segment for the processors in the KSR-1
`supercomputer, there are still significant differences in
`the present invention does not require ever increasing
`memory latencies between a local memory access, a
`amounts of global memory as the number of processors
`memory access within that segment, and a memory
`increases; however, the physical limitations imposed by
`directly interconnecting each cluster to every other
`access that is outside of that segment. As a result, it is 45
`still necessary to provide for some memory hierarchy
`cluster in such a way so as to not significantly increase
`optimization of memory requests for each parallel task
`the memory latency for inter-cluster memory requests
`in order to have a computer program run most effi
`make the problem of directly interconnecting a large
`ciently on the KSR-1 supercomputer.
`number of clusters very difficult. An additional physical
`In the University of Illinois Cedar supercomputer, 50
`limitation that restricts the number of clusters which
`each cluster has eight processors linked by an 8X8
`can be interconnected in either of these parallel process
`crossbar switch to a 4-way interleaved write-back
`ing supercomputers is that all of the clusters in either of
`these cluster-type supercomputers must be physically
`cache, and on through to a local cluster memory having
`32 Mbytes per cluster. Eight clusters are then linked by
`dose enough to one another such that the clock signals
`another 8x8 crossbar switch to a global memory hav- 55
`for all of the clusters can be synchronized or suffer the
`ing 8 Mbytes per processor. Transfers from global
`corresponding performance penalties associated with
`memory to a cluster memory are accomplished by a
`asychronous communication schemes.
`block transfer of 32 byte blocks from the global memory
`Although cluster-type architectures offer a partial
`through the cache and on to the cluster memory. Each
`solution to the problem of varying memory latencies,
`processor may also directly access a portion of the 60
`cluster-type architectures are themselves limited by the
`global memory on a byte-by-byte basis. In the case of
`physical problems associated with directly connecting
`the Cedar supercomputer, inter-cluster communication
`together all of clusters in order to provide a single com
`is part of a memory-hierarchy model and can be accom
`mon address space for the parallel processing system.
`plished either by having processors in both clusters
`Thus, while cluster-type parallel processing architec
`tures, such as the original supercomputer developed by
`write to a common section of the global shared mem- 65
`ory, or by a block transfer of the cluster memory,
`the assignee of the present invention, represent a signifi
`through the global memory, to the other cluster-shared
`cant improvement in the design and architecture of high
`memory. As with the KSR-1 supercomputer, the use
`performance parallel processors, it would be advanta
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 16
`
`
`
`O
`
`15
`
`5,428,803
`5
`6
`geous to provide for a unified parallel processing archi
`nected clusters so as to define an extendible address
`tecture that can allow an effectively unlimited number
`space encompassing the cluster shared memory of all
`of clusters of multiprocessors to be interconnected to
`clusters.
`gether, and that can overcome the present physical
`To further enhance the parallel processing capabili
`limitations imposed on interconnecting clustered multi
`ties of the preferred embodiment of the present inven
`processors.
`tion, each multiprocessor cluster is provided with an
`extendible clock mechanism and an extendible control
`SUMMARY OF THE INVENTION
`mechanism to coordinate parallel processing among
`The present invention provides a unified parallel
`clusters. The extendible clock mechanism solves the
`processing architecture for connecting together an ex
`problem of differing phases and duty cycles in clock
`tendible number of clusters of multiple numbers of pro
`signals among physically separable components of the
`cessors to create a high performance parallel processing
`parallel processing system. The extendible control
`computer system. The present invention organizes mul
`mechanism provides control mechanisms that can be
`tiple processors into four or more physically separable
`used by a distributively-based operating system to coor
`clusters, each cluster having a common cluster shared
`dinate and communicate among processes executing in
`memory that is symmetrically accessible by all of the
`parallel in the computer processing system.
`processors in that cluster. Unlike present extended
`Accordingly, it is a primary objective of the present
`shared memory cluster-type architectures, only some of
`invention to provide a unified parallel processing archi
`the clusters are adjacently interconnected in the present
`tecture for connecting together an extendible number of
`invention. A shared memory model can be used with
`clusters of multiple numbers of processors to create a
`programs that can be executed in the cluster shared
`high performance parallel processing computer system.
`memory of a single cluster, or in a floating shared mem
`Another primary objective of the present invention is
`ory that is defined across an extended shared memory
`to provide a unified parallel processing architecture that
`space comprised of the cluster shared memories of any
`can overcome the present physical limitations imposed
`set of adjacently interconnected clusters. A distributed
`25
`on interconnecting clustered multiprocessors.
`memory model can be used with any programs that are
`Still another primary objective of the present inven
`to be executed in the cluster shared memories of any
`tion is to provide a unified memory model that uniquely
`non-adjacently interconnected clusters.
`combines the prior art memory models for a shared
`In the present invention, clusters are adjacently inter
`memory, extended shared memory and distributed
`connected to form a floating shared memory if certain
`30
`memory models.
`memory access conditions relating to relative memory
`A further objective of the present invention is to
`latency and relative data locality can create an effective
`provide a parallel processing computer system that
`shared memory parallel programming environment. By
`allows for a shared memory model to be used with
`adjacently connecting together only those clusters
`programs executed in the floating shared memory space
`which can satisfy these conditions, the floating shared
`35
`that is defined across any adjacently interconnected
`memory created by the present invention does not have
`clusters, and a distributed memory model to be used
`to sacrifice performance for the increased interconnec
`with any programs executed across non-adjacently in
`tion necessary to increase the size of the problem solu
`terconnected clusters.
`tion space provided by an extended shared memory
`Still another objective of the present invention is to
`model where every processor can access the single
`40
`provide a clustered multiprocessor system where the
`shared address space of the floating shared memory.
`physically separable clusters are all interconnected with
`The adjacent interconnection of multiple clusters of
`an extendible clock mechanism that solves the physical
`processors to create a floating shared memory effec
`problems required by a synchronous clock system in a
`tively combines all three type of memory models, pure
`multiprocessor system with physically separable com
`shared memory, extended shared memory and distrib
`45
`ponents.
`uted shared memory, into a unified parallel processing
`These and other objectives of the present invention
`architecture. Thus, the present invention accommo
`dates parallel processing performance first and flexibil
`will become apparent with reference to the drawings,
`the detailed description of the preferred embodiment
`ity of the interconnection of processors and memory
`and the appended claims.
`second, with the tradeoff between these two consider
`ations balanced at the point at which the performance of
`BRIEF DESCRIPTION OF THE DRAWINGS
`inter-cluster memory accesses start to significantly de
`grade.
`FIGS. 1a and 1b are simplified block diagrams show
`ing the two different basic memory models for parallel
`The four or more clusters of multiprocessors may be
`processing systems in the prior art, an ideal distributed
`adjacently interconnected using a variety of hardware
`55
`interconnection topologies. In the preferred embodi
`memory model and an ideal shared memory model,
`respectively.
`ment, each of the clusters is a physically separable unit
`that includes two or more processors, and one or more
`FIGS. 2a, 2b, 2c, 2d and 2e are simplified block dia
`input/output ports, all of which are symmetrically con
`grams showing several different extended shared mem
`ory models for parallel processing systems in the prior
`nected by a node switching mechanism to a homoge
`60
`nous cluster shared memory. For each of the clusters,
`art.
`two or more external cluster connections provide a
`FIGS. 3a and 3b are graphs showing the relative
`word-by-word communication path for adjacently con
`memory latency and relative data locality of an ex
`necting the processors in any one cluster to the cluster
`tended shared memory system, respectively, as a func
`tion of parallel processing efficiency.
`shared memory of two or more adjacently intercon
`65
`nected clusters. A distributed memory communication
`FIG. 4 is a simplified block diagram of the preferred
`mechanism is also used for providing a message passing
`embodiment of the unified parallel processing architec
`communication among any non-adjacently intercon
`ture of the present invention.
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 17
`
`
`
`O
`
`15
`
`5,428,803
`7
`8
`FIGS. 5a, 5b, 5c and 5d are block diagrams showing
`work 30 to interconnect individual processors 32 that
`four different cluster connection topologies for adja
`each have a local memory 34 such that all of the local
`cently connecting multiple clusters together in accor
`memories 34 are interconnected together to form an
`dance with the present invention.
`extended shared memory. FIG. 2b shows the Kendall
`Square Research KSR-1 supercomputer as an example
`FIG. 6 is a more detailed block diagram of a single
`dimension ring-type cluster interconnection of the pre
`of an extended shared memory system that uses a plural
`ferred embodiment of the present invention.
`ity of segments 40 that are all connected together ac
`FIG. 7 is a schematic diagram of the preferred em
`cording to a hierarchical ring structure. Each segment
`bodiment of the adjacent cluster connection mechanism
`40 groups together the local memories 44 of a set of
`of the present invention, including the circuitry for
`processors 42. The segments 40 are themselves grouped
`implementing the extendible clock mechanism.
`together by one or more higher domains 46 that operate
`only as information transfer processors. FIG.2c shows
`DETAILED DESCRIPTION OF THE
`the IBM RP3 parallel processing system as an example
`PREFERRED EMBODIMENT
`of a reconfigurable memory system that uses a control
`network 50 to selective configure the memory compo
`To better understand how the present invention pro
`vides for a novel unified parallel processing architec
`nents 54 of each of thirty-two processors 52 as either
`ture, the various architectures of the prior art parallel
`local memory or global memory, or some combination
`processing systems will be briefly described. Basically,
`thereof. FIG. 2d shows the University of Illinois Cedar
`the memory models for prior art parallel processing
`supercomputer as an example of an extended shared
`systems can be divided into either distributed memory
`memory system that directly connects all eight clusters
`60 of processors 62 together via a global memory 64.
`models, shared memory models or extended shared
`memory models. Each of these models has certain ad
`Each cluster 60 includes eight processors 62 that are
`vantages and disadvantages.
`directly connected to a cluster shared memory 66, and
`FIGS. 1a and 1b show a simplified block diagram of
`also to the global memory 64. Finally, FIG. 2e shows
`the original supercomputer of the assignee of the pres
`the interconnection of processors and memory in the
`25
`two basic memory models for parallel processing sys
`ent invention that directly connects all four clusters 70
`tems in the prior art. FIG. 1a shows a message passing
`together. Each cluster 70 includes sixteen processors 72
`connection 10 for interconnecting a plurality of proces
`that symmetrically access a cluster shared memory 74
`sors 12, each having a local memory 14, to form an ideal
`via a unique arbitration node network 76.
`distributed memory model parallel processing system.
`While different extended shared memory parallel
`An example of a distributed memory model parallel
`processing architectures work well for different kinds
`processing system is the Intel Corp. iPSC supercom
`of parallel programming problems, none has provided a
`puter. Although the distributed memory model can link
`generally workable solution for most