throbber
O
`
`United States Patent (19)
`Chen et al.
`
`54 METHOD AND APPARATUS FOR A
`UNIFED PARALLEL PROCESSING
`ARCHITECTURE
`
`||||||||||III
`11
`Patent Number:
`5,428,803
`45
`Date of Patent:
`Jun. 27, 1995
`
`US005428803A
`
`OTHER PUBLICATIONS
`Fast Interrupt Mechanism for a Multiprocessor System,
`Ser. No.: 07/536,199, filed on Jun. 11, 1990.
`(List continued on next page.)
`
`75 Inventors: Steve S. Chen, Eau Claire; Douglas R.
`Beard, Eleva; George A. Spix, Eau
`Claire; Edward C. Priest, Eau Claire;
`John M. Wastlick, Eau Claire; James
`M. VanDyke, Eau Claire, all of Wis.
`73 Assignee: Cray Research, Inc., Chippewa Falls,
`Wis.
`
`21 Appl. No.: 912,964
`
`22 Filed:
`
`Jul. 10, 1992
`
`Primary Examiner-Alyssa H. Bowler
`Assistant Examiner-G. Donaghue
`Attorney, Agent, or Firm-Schwegman, Lundberg &
`Woessner
`ABSTRACT
`57
`A unified parallel processing architecture connects to
`gether an extendible number of clusters of multiple
`numbers of processors to create a high performance
`parallel processing computer system. Multiple proces
`sors are grouped together into four or more physically
`- GF/60 separable clusters, each cluster having a common clus
`-
`- - -
`- - - - -
`- - - -
`a
`Int. C.6 ..........
`51
`ter shared memory that is symmetrically accessible by
`52 U.S. C. .................................... 395/800; 395/200;
`395/425; 364/DIG. 1; 364/229; 364/243
`all of the processors in that cluster; however, only some
`(58) Field of Search ........................ 395/200, 800, 425
`of the clusters are adjacently interconnected. Clusters
`are adjacently interconnected to form a floating shared
`memory if certain memory access conditions relating to
`References Cited
`relative memory latency and relative data locality can
`U.S. PATENT DOCUMENTS
`create an effective shared memory parallel program
`4,130,865 12/1978 Heart et al. ......................... 395/200
`ming environment. A shared memory model can be
`4,365,292 12/1982 Barnes et al. .
`... 395/800
`used with programs that can be executed in the cluster
`4400,768 8/1983 Tomlinson ...
`... 395/800
`shared memory of a single cluster, or in the floating
`4,445.17 4/1984 Neches.
`... 39532s
`shared memory that is defined across an extended
`4,636,942 1/1987 Chen et al. .......
`... 395/725
`shared memory space comprised of the cluster shared
`4,707,78 11/1987 Sullivan et al. ..
`... 395/425
`memories of any set of adjacently interconnected clus
`4,720,780 / 1988 Doiecek ...........
`... 395/800
`ters. A distributed memory model can be used with any
`4,745,545 5/1988 Schiffleger ...
`35/32
`--
`4,754,398 6/1989 Pribnow ...........
`.395/200
`4,827,403 5/i989 Steele, Jr. et al. .................. g; programs that are to be executed in the cluster shared
`4,834,483 5/1989 Arthurs et al. ....................... 35
`memories of any non-adjacently interconnected clus
`4,873,626 10/1989 Gifford .........
`... 395/325
`ters. The adjacent interconnection of multiple clusters
`4,891,751 1/1990 Call et al. .....
`... 395/800
`of processors to a create a floating shared memory ef
`4,901,230 2/1990 Chen et al. ...
`... 395/325
`fectively combines all three type of memory models,
`5,055,999 10/1991 Frank et al. ..
`•r 395/2
`pure shared memory, extended shared memory and
`5,056,000 10/1991 Chang ....... -
`... 395/325
`-
`5,072,371 12/1991 Benner et al.
`... 395/200
`distributed shared memory, into a unified parallel pro
`5,081,575 l/1992 Hiller et al. ..
`... 36373
`cessing architecture.
`5,113,523 5/1992 Colley et al..
`... 395/800
`5,165,038 11/1992 Beard et al. ..
`... 395/800
`5,179,702 1/1993 Spix et al. ...
`... 395/650
`5,197,130 3/1993 Chen et al. ...
`... 395/325
`5,208.914 5/1993 wiisonet al... 395/275
`
`(56)
`
`
`
`24 Claims, 12 Drawing Sheets
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 1
`
`

`

`5,428,803
`Page 2
`
`OTHER PUBLICATIONS
`Almasi, G. and Gottlieb, A., Highly parallel Computing,
`Benjamin Cummings 1989, Chpt. 1, “Overview", pp.
`2–29, Chap. 8, Interconnection Networks pp. 278-299,
`Chapt. 10 MIMD Parallel Architectures pp. 354-475.
`Gajski, D., Milutinovic, V., Siegel, H., and Furht, B.
`Computer Architecture, The Computer Society of the
`IEEE, (1987), Chpt. 2, “Topics in Parallel Processing
`and Multiprocessing', pp. 81-171.
`Hennesy, J. and Patterson, D., Computer Architecture:
`A Quantative Approach, Morgan Kaufman (1990), Chap.
`10, "Future Directions”, pp. 570-592.
`Hwang, K. and DeGroot, D., Parallel Processing for
`Supercomputers and Artificial Intelligence, McGraw
`Hill Publ., (1989) Chpt. 2, pp. 31-67.
`Kain, R., Computer Architecture, Prentice Hall, (1989),
`vol. 1, Chpt. 3, "Shared Resource Synchronization',
`pp. 178-250.
`
`ETA 10 System Overview: EOS, Tech. Note, Publ.
`1006, ETA Systems, Sep. 30, 1988.
`Clementi, E., Logan, D., Saarninen, J., “ICAP/3090:
`Parallel Processing For Large Scale Scientific and En
`gineering Problems', IBM Systems Journal, vol. 27, No.
`4 (1988) pp. 475-509.
`Pfister, G., “The IBM Research Parallel Processor
`Prototype (RP3): Introduction and Architecture”, Int’l
`Conf. on Parallel Processing, Aug. 1985, pp. 764-771.
`Murakami, K., Akira, F., Sueyoshi, T. and Tomita, S.,
`“An Overview of the Kyushi University Reconfigura
`ble Parallel Processor”, Aug. 1988, pp. 130-137.
`Kuck, D., Davidson, E., Lawrie, D. and Sameh, A.,
`“Parallel Supercomputing Today and the Cedar Ap
`proach”, Science, vol. 231, Feb. 1986, pp. 967–974.
`Goodman, J. and Woest, P., “The Wisconsin Mul
`ticube: A New Large Scale Cache-Coherent Multipro
`cessor', Proc. of the 1988 Int'l Conf. on Parallel Pro
`cessing, IEEE, Feb. 1988, pp. 422-431.
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 2
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 1 of 12
`
`5,428,803
`
`Message Passing
`Connection
`
`O
`
`P
`LM
`
`P
`LM
`
`P
`LM
`
`P-12
`P
`LM, LM-4
`
`PRIOR ART
`DISTRIBUTED MEMORY MODEL
`
`Fig. 1 a
`
`
`
`Common
`Shared Memory
`
`u-24
`
`PRIOR ART
`SHARED MEMORY MODEL
`
`Fig. 1b
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 3
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 2 of 12
`
`5,428,803
`
`PROR ART
`EXTENDED SHARED MEMORY
`SWITCHING NETWORK MODEL
`Fig. 2a
`
`Domain 1
`Connection
`
`Segment O'
`Connection
`
`
`
`
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`HEIRARCHICAL RING MODEL
`Fig. 2b
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 4
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 3 of 12
`
`5,428,803
`
`Control Network
`Connection
`
`5O
`
`54
`
`GM
`
`GM
`
`LM
`
`M
`
`P
`
`P
`
`GM
`
`w
`
`us
`
`LM
`
`P
`
`GM
`
`GM
`
`M
`
`P
`
`P --52
`
`
`
`
`
`
`
`
`
`PROR ART
`EXTENDED SHARED MEMORY
`RECONFIGURABLE MODEL
`
`Fig. 2c
`
`Global Memory
`
`Custer
`Shared
`
`62
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`CLUSTER/GLOBAL MEMORY INTER CONNECT
`Fig. 2d
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 5
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 4 of 12
`
`5,428,803
`
`Arbitration
`
`Arbitration
`
`O
`
`
`
`
`
`
`
`Custer
`Shared
`
`74
`
`Cluster
`Shared
`
`Custer
`Shared
`
`PRIOR ART
`EXTENDED SHARED MEMORY
`CLUSTER MODE
`DIRECT INTERCONNECT
`Fig. 2e
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 6
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 5 of 12
`
`5,428,803
`
`Fig. 3a
`
`O.
`
`O. O.
`Relative Memory
`Laten cy
`
`O. O. O.
`
`Parale
`Processing
`Efficiency
`
`O
`
`
`
`Parallel
`Processing
`Efficiency
`
`1
`
`O.
`
`O. O.
`Relative Memory
`Locality
`
`O. O. O.
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 7
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 6 of 12
`
`5,428,803
`

`
`
`
`
`
`
`
`U Ster
`Shared
`emory
`
`CU Ster
`Shared
`Men O
`
`<<<<\
`|- #########\\
`SOEN7\]~ 8
`
`Fig. 4 “ Odd
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 8
`
`

`

`Amazon / Zentian Limited
`Exhibit 1005
`Page 9
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 8 of 12
`
`5,428,803
`
`EH, -
`ATA A)
`fi/21/1.
`{AAAAA
`/ / / / .
`GAA (2///)
`D// | 1 || ||
`(AAAAA/
`74.7/17. As O
`(AAAYAD
`/, / y/y,
`C// / /
`WL/ /
`/
`(KAAA/)
`71 / 1 /
`flighly
`
`Fig. 5c
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 10
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`et 9 of 12
`She
`
`5,428,803
`
`
`
`O
`
`Fig. 5d
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 11
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 10 of 12
`
`5,428,803
`
`
`
`I 3 4 X \\ 0 O | E- Kq
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 12
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 11 of 12
`
`5,428,803
`
`
`
`SOO|
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 13
`
`

`

`U.S. Patent
`
`June 27, 1995
`
`Sheet 12 of 12
`
`5,428,803
`
`32
`
`36
`
`38
`
`Xn it
`Latch
`
`Xn it
`Cock
`
`Boundar
`
`
`
`
`
`
`
`Common
`Clock
`Oscillator
`
`4.
`
`
`
`42
`
`34
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 14
`
`

`

`O
`
`RELATED APPLICATIONS
`This application is related to the following co-pend
`ing applications, all of which are assigned to the as
`signee of the present invention, the disclosure of all of
`which are hereby incorporated by reference in the pres
`ent application.
`CLUSTER ARCHITECTURE FOR A HIGHLY
`PARALLEL SCALAR/VECTOR MULTIPRO
`CESSOR SYSTEM, U.S. Pat. No. 5,197,130, issued
`Mar. 23, 1993;
`15
`METHOD AND APPARATUS FOR NON
`SEQUENTIAL RESOURCE ACCESS, U.S. Pat. No.
`5,208,914, issued on May 4, 1993;
`GLOBAL REGISTERS FOR A MULTIPROCES
`SOR SYSTEM, U.S. Pat. No. 5,165,038, issued Nov.
`17, 1992;
`FAST INTERRUPT MECHANISMFORA MUL
`TIPROCESSOR SYSTEM, Ser. No. 07/536,199, filed
`on Jun. 11, 1990; and INTEGRATED SOFTWARE
`ARCHITECTURE FOR A HIGHLY PARALLEL
`25
`MULTIPROCESSOR SYSTEM, U.S. Pat. No.
`5,179,702 issued Jan. 12, 1993.
`TECHNICAL FIELD
`The present invention relates generally to parallel
`processing computer systems for performing multiple
`instruction-multiple-data (MIMD) parallel processing.
`More particularly, the present invention relates to a
`method and apparatus for a unified parallel processing
`architecture for high performance MIMD multiproces
`sors that organizes the multiprocessors into four or
`35
`more physically separable clusters, only some of which
`are adjacently interconnected, and provides for a
`shared memory model to be used with programs exe
`cuted in the floating shared memory space that is de
`fined across any adjacently interconnected clusters, and
`a distributed memory model to be used with any pro
`grams executed across non-adjacently interconnected
`clusters.
`
`30
`
`5,428,803
`2
`dress space. A shared address space leaves the choice of
`the parallel programming model up to the programmer
`and provides a more flexible problem solution space. In
`addition, most present software programs are written
`for a shared memory model, as opposed to a distributed
`or private memory model. To effectively port an exist
`ing software program from a shared memory model to
`a private memory model can require significant repro
`gramming effort. One of the primary reasons why it is
`easier to program a shared memory model is that a
`distributed memory model necessarily requires that the
`processes and data which comprise a parallel program
`ming problem be partitioned into separate tasks that are
`each small enough to fit into the private local memory
`of a processor and at the same time are large enough to
`recover the overhead required to establish a parallel
`task in a message passing environment. In a shared
`memory model, there are fewer restrictions on the size
`of the parallel tasks and the overhead required to estab
`lish a parallel task can be significantly less than the
`overhead associated with the more complicated mes
`sage passing scheme required by a distributed memory
`model. Thus, it would be desirable to have a parallel
`processing system which could provide a large and
`flexible problem solution space for parallel processing
`wherein a large number of processors all have equal and
`Symmetric access to any and all memory locations in a
`common shared memory.
`Unfortunately, the hardware required to implement
`such a true shared memory model for a parallel process
`ing system having a large number of processors can be
`expensive and complicated. While it may be practically
`possible to directly wire each processor to each bank of
`memory to allow for equal and symmetric access to the
`shared memory in a minimally parallel processing sys
`ten having two, four or eight processors, it is not prac
`tically possible to directly wire each processor to each
`bank of memory in parallel processing systems when
`the number of processors is greater than about eight. As
`a result, several types of extended shared memory sys
`tems have been proposed when the number of proces
`sors in the parallel processing system is greater than
`about eight.
`Currently, most extended shared memory systems use
`a variety of multi-stage interconnection techniques to
`connect a local memory that is closely associated with
`each processor with the local memories of all other
`processors such that all of the local memories together
`comprise the globally shared address space of the sys
`tem, e.g., the BBN Butterfly --, the NYU Ultracom
`puter and the Thinking Machines, Inc. CM-1. Some
`extended shared memory systems combine a local mem
`ory for each processor with a global memory that oper
`ates as a cache write-through or duplicate address
`space, e.g., the Carnegie-Mellon CM.*. Still other sys
`tems provide a reconfigurable memory model where
`each processor has a memory that can be accessed ei
`ther locally or globally, and where the boundary bet
`ween local and global memory can be dynamically
`adjusted, e.g., the IBM RP3 and U.S. Pat. No.
`5,056,000.
`The problem with such extended shared memory
`systems is that memory access is not equal and symmet
`ric from all processors to all memory locations in the
`extended shared memory. Instead, each processor may
`encounter different memory latencies when accessing
`different memory locations in different parts of the
`
`1.
`
`METHOD AND APPARATUS FOR A UNIFIED
`PARALLEL PROCESSING ARCHITECTURE
`
`PRIOR ART
`45
`The field of parallel computing has received increas
`ing attention in recent years as computer designers seek
`to increase the effective processing capabilities of com
`puter processing systems. Most of the current work
`relating to high performance parallel processing has
`50
`focused on multiple-instruction-multiple-data (MIMD)
`parallel processing systems. Current architectures for
`MIMD parallel processing systems use one of two dif
`ferent memory models for sharing data within a pro
`gram that is to be executed in parallel: distributed mem
`55
`ory vs. shared memory. In a distributed memory model,
`data is stored in the private local memory of each pro
`cessor and is communicated among processors by some
`type of message passing scheme. In a shared memory
`model, data is stored in a common shared memory that
`is equally accessible to all processors in the parallel
`processing system. An excellent summary of current
`MIMD parallel architectures is set forth in Almasi and
`Gottlieb, Highly Parallel Computing, (1989) Chpt. 10,
`pgs. 354-475.
`From the programmer's perspective, the most attrac
`tive parallel processing system is one that uses a shared
`memory model having a globally shared physical ad
`
`65
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 15
`
`

`

`5,428,803
`3.
`4.
`memory hierarchy optimization is a prerequisite to suc
`extended shared memory address space. This is espe
`cially true when comparing the access of a processor to
`cessfully programming a computer program to run
`its local memory with the access of that processor to a
`most efficiently on the Cedar supercomputer.
`In the original supercomputer developed by the as
`remote local memory or to a global memory. In essence,
`only those accesses which are to the local memory of 5
`signee of the present invention that is the subject of the
`the processor are equal and symmetric; all other mem
`previously identified patent application entitled CLUS
`ory accesses are variable and, in some cases, even inde
`TER ARCHITECTURE FOR A HIGHLY PARAL
`terminate. As a result, these systems perform parallel
`LEL SCALAR/VECTOR MULTIPROCESSOR, a
`processing best when nearly all of the memory requests
`four cluster parallel processing supercomputer is dis
`for the related data to be operated on by each parallel 10
`closed. Each cluster includes between 4 and 256 high
`performance scalar/vector processors which are con
`task are made to the individual local memory of the
`processor executing that parallel task. When even a
`nected via a unique arbitration node network to a set of
`small percentage of the memory requests for the related
`interleaved sections of memory that form the cluster
`data required by a parallel task must access memory
`shared memory. The processors in each cluster are also
`outside the individual local memory, significant degra- 15
`directly connected to the clustered shared memory of
`dation of parallel processing performance can occur.
`all three of the other clusters by an additional port in the
`In an effort to increase the portion of an extended
`arbitration node network. Using the cluster architecture
`of the original supercomputer developed by the as
`shared memory that is not affected by the problems of
`different latencies in accessing remotely shared mem
`signee of the present invention, a single common ad
`ory, at least three different parallel processing systems 20
`dress space is defined across the cluster memories of all
`have incorporated the concept of grouping the shared
`four clusters, with the only difference between intra
`memory for a multiple number of processors together
`cluster memory requests and inter-cluster memory re
`quest being the different memory latency primarily due
`such that all of the processors in a group have similar
`access to a shared memory for that group. These groups
`to the physical distance which separates the clusters.
`Consequently, it is possible to provide for parallel pro
`of processors are then all connected to each other to 25
`cessing of program throughout the entire address space
`provide the total extended shared memory space for the
`parallel processing system.
`of the four clusters without the need for a memory
`hierarchy optimization.
`In the Kendall Square Research KSR-1 supercom
`puter, the local memories of a group of processors are
`The problem with these cluster-type parallel process
`connected together via a ring bus interconnection to 30
`ing architectures is that all of the clusters must be di
`form one of a plurality of segments. The various seg
`rectly connected together as part of the same single
`ments are then connected to each other by a pyramidal
`shared address space. Unfortunately, the number of
`hierarchy of "higher' segments, with each higher-level
`clusters that can be directly interconnected is limited by
`segment having fewer segments than the preceding
`the cost and complexity required for such interconnec
`lower-level segment. Each of the "higher' segments of 35
`tions. In the case of the Cedar supercomputer, the com
`the domains uses a plurality of transfer units that are
`plexity and costs of interconnecting additional clusters
`also arranged into a ring bus interconnection to transfer
`increases exponentially as more memory is required for
`memory requests in a distributive hierarchical basis.
`the global memory with the addition of each processor,
`Although there is a more uniform memory latency
`and as larger and larger crossbar switches are required
`among requests to the local memories of other proces- 40
`to directly interconnect each cluster to the global. The
`original supercomputer developed by the assignee of
`sors in that segment for the processors in the KSR-1
`supercomputer, there are still significant differences in
`the present invention does not require ever increasing
`memory latencies between a local memory access, a
`amounts of global memory as the number of processors
`memory access within that segment, and a memory
`increases; however, the physical limitations imposed by
`directly interconnecting each cluster to every other
`access that is outside of that segment. As a result, it is 45
`still necessary to provide for some memory hierarchy
`cluster in such a way so as to not significantly increase
`optimization of memory requests for each parallel task
`the memory latency for inter-cluster memory requests
`in order to have a computer program run most effi
`make the problem of directly interconnecting a large
`ciently on the KSR-1 supercomputer.
`number of clusters very difficult. An additional physical
`In the University of Illinois Cedar supercomputer, 50
`limitation that restricts the number of clusters which
`each cluster has eight processors linked by an 8X8
`can be interconnected in either of these parallel process
`crossbar switch to a 4-way interleaved write-back
`ing supercomputers is that all of the clusters in either of
`these cluster-type supercomputers must be physically
`cache, and on through to a local cluster memory having
`32 Mbytes per cluster. Eight clusters are then linked by
`dose enough to one another such that the clock signals
`another 8x8 crossbar switch to a global memory hav- 55
`for all of the clusters can be synchronized or suffer the
`ing 8 Mbytes per processor. Transfers from global
`corresponding performance penalties associated with
`memory to a cluster memory are accomplished by a
`asychronous communication schemes.
`block transfer of 32 byte blocks from the global memory
`Although cluster-type architectures offer a partial
`through the cache and on to the cluster memory. Each
`solution to the problem of varying memory latencies,
`processor may also directly access a portion of the 60
`cluster-type architectures are themselves limited by the
`global memory on a byte-by-byte basis. In the case of
`physical problems associated with directly connecting
`the Cedar supercomputer, inter-cluster communication
`together all of clusters in order to provide a single com
`is part of a memory-hierarchy model and can be accom
`mon address space for the parallel processing system.
`plished either by having processors in both clusters
`Thus, while cluster-type parallel processing architec
`tures, such as the original supercomputer developed by
`write to a common section of the global shared mem- 65
`ory, or by a block transfer of the cluster memory,
`the assignee of the present invention, represent a signifi
`through the global memory, to the other cluster-shared
`cant improvement in the design and architecture of high
`memory. As with the KSR-1 supercomputer, the use
`performance parallel processors, it would be advanta
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 16
`
`

`

`O
`
`15
`
`5,428,803
`5
`6
`geous to provide for a unified parallel processing archi
`nected clusters so as to define an extendible address
`tecture that can allow an effectively unlimited number
`space encompassing the cluster shared memory of all
`of clusters of multiprocessors to be interconnected to
`clusters.
`gether, and that can overcome the present physical
`To further enhance the parallel processing capabili
`limitations imposed on interconnecting clustered multi
`ties of the preferred embodiment of the present inven
`processors.
`tion, each multiprocessor cluster is provided with an
`extendible clock mechanism and an extendible control
`SUMMARY OF THE INVENTION
`mechanism to coordinate parallel processing among
`The present invention provides a unified parallel
`clusters. The extendible clock mechanism solves the
`processing architecture for connecting together an ex
`problem of differing phases and duty cycles in clock
`tendible number of clusters of multiple numbers of pro
`signals among physically separable components of the
`cessors to create a high performance parallel processing
`parallel processing system. The extendible control
`computer system. The present invention organizes mul
`mechanism provides control mechanisms that can be
`tiple processors into four or more physically separable
`used by a distributively-based operating system to coor
`clusters, each cluster having a common cluster shared
`dinate and communicate among processes executing in
`memory that is symmetrically accessible by all of the
`parallel in the computer processing system.
`processors in that cluster. Unlike present extended
`Accordingly, it is a primary objective of the present
`shared memory cluster-type architectures, only some of
`invention to provide a unified parallel processing archi
`the clusters are adjacently interconnected in the present
`tecture for connecting together an extendible number of
`invention. A shared memory model can be used with
`clusters of multiple numbers of processors to create a
`programs that can be executed in the cluster shared
`high performance parallel processing computer system.
`memory of a single cluster, or in a floating shared mem
`Another primary objective of the present invention is
`ory that is defined across an extended shared memory
`to provide a unified parallel processing architecture that
`space comprised of the cluster shared memories of any
`can overcome the present physical limitations imposed
`set of adjacently interconnected clusters. A distributed
`25
`on interconnecting clustered multiprocessors.
`memory model can be used with any programs that are
`Still another primary objective of the present inven
`to be executed in the cluster shared memories of any
`tion is to provide a unified memory model that uniquely
`non-adjacently interconnected clusters.
`combines the prior art memory models for a shared
`In the present invention, clusters are adjacently inter
`memory, extended shared memory and distributed
`connected to form a floating shared memory if certain
`30
`memory models.
`memory access conditions relating to relative memory
`A further objective of the present invention is to
`latency and relative data locality can create an effective
`provide a parallel processing computer system that
`shared memory parallel programming environment. By
`allows for a shared memory model to be used with
`adjacently connecting together only those clusters
`programs executed in the floating shared memory space
`which can satisfy these conditions, the floating shared
`35
`that is defined across any adjacently interconnected
`memory created by the present invention does not have
`clusters, and a distributed memory model to be used
`to sacrifice performance for the increased interconnec
`with any programs executed across non-adjacently in
`tion necessary to increase the size of the problem solu
`terconnected clusters.
`tion space provided by an extended shared memory
`Still another objective of the present invention is to
`model where every processor can access the single
`40
`provide a clustered multiprocessor system where the
`shared address space of the floating shared memory.
`physically separable clusters are all interconnected with
`The adjacent interconnection of multiple clusters of
`an extendible clock mechanism that solves the physical
`processors to create a floating shared memory effec
`problems required by a synchronous clock system in a
`tively combines all three type of memory models, pure
`multiprocessor system with physically separable com
`shared memory, extended shared memory and distrib
`45
`ponents.
`uted shared memory, into a unified parallel processing
`These and other objectives of the present invention
`architecture. Thus, the present invention accommo
`dates parallel processing performance first and flexibil
`will become apparent with reference to the drawings,
`the detailed description of the preferred embodiment
`ity of the interconnection of processors and memory
`and the appended claims.
`second, with the tradeoff between these two consider
`ations balanced at the point at which the performance of
`BRIEF DESCRIPTION OF THE DRAWINGS
`inter-cluster memory accesses start to significantly de
`grade.
`FIGS. 1a and 1b are simplified block diagrams show
`ing the two different basic memory models for parallel
`The four or more clusters of multiprocessors may be
`processing systems in the prior art, an ideal distributed
`adjacently interconnected using a variety of hardware
`55
`interconnection topologies. In the preferred embodi
`memory model and an ideal shared memory model,
`respectively.
`ment, each of the clusters is a physically separable unit
`that includes two or more processors, and one or more
`FIGS. 2a, 2b, 2c, 2d and 2e are simplified block dia
`input/output ports, all of which are symmetrically con
`grams showing several different extended shared mem
`ory models for parallel processing systems in the prior
`nected by a node switching mechanism to a homoge
`60
`nous cluster shared memory. For each of the clusters,
`art.
`two or more external cluster connections provide a
`FIGS. 3a and 3b are graphs showing the relative
`word-by-word communication path for adjacently con
`memory latency and relative data locality of an ex
`necting the processors in any one cluster to the cluster
`tended shared memory system, respectively, as a func
`tion of parallel processing efficiency.
`shared memory of two or more adjacently intercon
`65
`nected clusters. A distributed memory communication
`FIG. 4 is a simplified block diagram of the preferred
`mechanism is also used for providing a message passing
`embodiment of the unified parallel processing architec
`communication among any non-adjacently intercon
`ture of the present invention.
`
`Amazon / Zentian Limited
`Exhibit 1005
`Page 17
`
`

`

`O
`
`15
`
`5,428,803
`7
`8
`FIGS. 5a, 5b, 5c and 5d are block diagrams showing
`work 30 to interconnect individual processors 32 that
`four different cluster connection topologies for adja
`each have a local memory 34 such that all of the local
`cently connecting multiple clusters together in accor
`memories 34 are interconnected together to form an
`dance with the present invention.
`extended shared memory. FIG. 2b shows the Kendall
`Square Research KSR-1 supercomputer as an example
`FIG. 6 is a more detailed block diagram of a single
`dimension ring-type cluster interconnection of the pre
`of an extended shared memory system that uses a plural
`ferred embodiment of the present invention.
`ity of segments 40 that are all connected together ac
`FIG. 7 is a schematic diagram of the preferred em
`cording to a hierarchical ring structure. Each segment
`bodiment of the adjacent cluster connection mechanism
`40 groups together the local memories 44 of a set of
`of the present invention, including the circuitry for
`processors 42. The segments 40 are themselves grouped
`implementing the extendible clock mechanism.
`together by one or more higher domains 46 that operate
`only as information transfer processors. FIG.2c shows
`DETAILED DESCRIPTION OF THE
`the IBM RP3 parallel processing system as an example
`PREFERRED EMBODIMENT
`of a reconfigurable memory system that uses a control
`network 50 to selective configure the memory compo
`To better understand how the present invention pro
`vides for a novel unified parallel processing architec
`nents 54 of each of thirty-two processors 52 as either
`ture, the various architectures of the prior art parallel
`local memory or global memory, or some combination
`processing systems will be briefly described. Basically,
`thereof. FIG. 2d shows the University of Illinois Cedar
`the memory models for prior art parallel processing
`supercomputer as an example of an extended shared
`systems can be divided into either distributed memory
`memory system that directly connects all eight clusters
`60 of processors 62 together via a global memory 64.
`models, shared memory models or extended shared
`memory models. Each of these models has certain ad
`Each cluster 60 includes eight processors 62 that are
`vantages and disadvantages.
`directly connected to a cluster shared memory 66, and
`FIGS. 1a and 1b show a simplified block diagram of
`also to the global memory 64. Finally, FIG. 2e shows
`the original supercomputer of the assignee of the pres
`the interconnection of processors and memory in the
`25
`two basic memory models for parallel processing sys
`ent invention that directly connects all four clusters 70
`tems in the prior art. FIG. 1a shows a message passing
`together. Each cluster 70 includes sixteen processors 72
`connection 10 for interconnecting a plurality of proces
`that symmetrically access a cluster shared memory 74
`sors 12, each having a local memory 14, to form an ideal
`via a unique arbitration node network 76.
`distributed memory model parallel processing system.
`While different extended shared memory parallel
`An example of a distributed memory model parallel
`processing architectures work well for different kinds
`processing system is the Intel Corp. iPSC supercom
`of parallel programming problems, none has provided a
`puter. Although the distributed memory model can link
`generally workable solution for most

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket