`Luick et al.
`
`[54] MULTIPROCESSOR CACHE COHERENCE
`DIRECTED BY COMBINED LOCAL AND
`GLOBAL TABLES
`
`[75]
`
`Inventors: David Arnold Luick; John
`Christopher Willis; Philip Braun
`Winterfield, all of Rochester, Minn.
`
`[73] Assignee: International Business Machines
`Corporation, Armonk, N.Y.
`
`[ *] Notice:
`
`This patent issued on a continued pros(cid:173)
`ecution application filed under 37 CFR
`1.53( d), and is subject to the twenty year
`patent term provisions of 35 U.S.C.
`154(a)(2).
`
`[21] Appl. No.: 08/724,628
`
`[22] Filed:
`
`Oct. 1, 1996
`
`[51]
`[52]
`
`[58]
`
`[56]
`
`Int. Cl? ...................................................... G06F 12/00
`U.S. Cl. ........................... 711/141; 711!118; 711/146;
`711/144; 711!124
`Field of Search ..................................... 711!118, 130,
`711/165, 141, 146-148, 124, 152, 144
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,747,043
`4,843,542
`4,928,225
`5,025,365
`5,303,362
`5,490,261
`5,537,569
`5,577,204
`5,590,308
`5,604,882
`5,623,628
`5,710,907
`5,749,095
`5,751,995
`5,778,429
`5,822,763
`5,864,854
`5,892,970
`
`5/1988 Rodman .................................. 395/925
`6/1989 Dashiell et a!.
`.................. ...... 364/200
`5/1990 McCarthy eta!. ...................... 364/200
`6/1991 Mathur et a!. .......................... 711!121
`4/1994 Butts, Jr. et a!. ....................... 711!121
`2/1996 Bean et a!. .............................. 711!121
`7/1996 Masubushi ............... ............... 395/448
`11/1996 Brewer et a!. .......................... 710/132
`12/1996 Shih ........................................ 395/463
`2/1997 Hoover et a!. .......... ................ 395/448
`4/1997 Brayton et a!. ......................... 395/468
`1!1998 Hagersten eta!. ...................... 711!148
`5/1998 Hagersten ................ ............... 711!141
`5/1998 Sarangdhar ............................. 395/472
`7/1998 Sukegawa eta!. ..................... 711!129
`10/1998 Baylor et a!. ........... ................ 711!118
`1!1999 Boyle ........................................ 707/10
`4/1999 Hagersten ............................... 711!141
`
`111111
`
`1111111111111111111111111111111111111111111111111111111111111
`US006088769A
`[11] Patent Number:
`[45] Date of Patent:
`
`6,088,769
`*Jul. 11, 2000
`
`5,895,487
`5,897,664
`
`4/1999 Boyd et a!. ............................. 711!122
`4/1999 Nesheim et a!. .................. ...... 711!206
`
`OTHER PUBLICATIONS
`
`Leonidas I. Kontothanassis et al. "Lazy Release Consistency
`for Hardware-Coherent Multiprocessors", University of
`Rochester, 1994.
`Sandra Johnson Baylor et al., "An Evaluation of Cache
`Coherence Protocols for MIN-Based Multiprocessors",
`International Symposium on Shared Memory Multiprocess(cid:173)
`ing, pp. 230-241, Apr. 1991.
`Proceedings of the 17th International Symposium on Com(cid:173)
`puter Architecture, Jun. 1990, pp. 148-159, "The Directo(cid:173)
`ry-Based Cache Coherence Protocol for the DASH Multi(cid:173)
`processor" by Daniel Lenoski et al.
`IEEE Computer, vol. 25, No. 03, Mar. 1992, pp. 63-79, "The
`Stanford Dash Multiprocessor" by Daniel Lenoski et al.
`
`Primary Examiner-John W. Cabeca
`Assistant Examiner-Pierre-Michel Bataille
`Attorney, Agent, or Firm-Terrance A. Meador; David A.
`Hall; Karuna Ojanen
`
`[57]
`
`ABSTRACT
`
`A method and apparatus for maintaining coherence between
`shared data stored within a plurality of memory devices,
`each memory device residing in a different node within a
`tightly coupled multiprocessor system. Each node includes
`a "local coherence unit" and an associated processor. A
`cache unit is associated with each memory/processor pair.
`Each local coherence unit maintains a table which indicates
`whether the most current copy of data stored within the node
`resides in the local memory, in the local cache, or in a
`non-local cache. The present invention includes a "global
`coherence" unit coupled to each node via the logical inter(cid:173)
`connect. The global coherence unit includes a interconnect
`monitoring device and a global coherence table. When data
`which resides within the memory of a first node is trans(cid:173)
`ferred to a second node, the interconnect monitoring device
`updates the global coherence table to indicate that the data
`is being shared. The global coherence table also preferably
`indicates in which node a copy of the most current data
`resides.
`
`6 Claims, 7 Drawing Sheets
`
`Petition for Inter Partes Review of
`U.S. Pat. No. 7,296,121
`IPR2015‐00158
`EXHIBIT
`Sony‐
`
`1
`
`
`
`/
`~------------L--l
`r;-;;;r- 103
`I
`~
`- -!- / 113 --'i
`.
`:
`I
`CACHE
`I
`CONTROLLER
`._11s r- f - _____ _ j
`I -
`-
`-
`I
`I
`-
`v
`I
`L2 r-- I
`t- - _j
`I PROCESSOR
`I
`'L 119 1
`I
`I
`I
`'I
`.)
`\
`I I
`I
`I
`TABLE
`L
`l
`I - - - - - -
`L ___ ~
`'-121
`L ______ ~o~ 1 _______ _j
`
`101
`
`101
`
`NO~II NO~I
`
`100
`
`/
`
`I INTERCONNECT
`124
`I
`BUS
`-- 123
`I r- _ _ __ .L _ --"l
`I
`I
`GCU
`I
`I
`BUS ~
`I
`I
`
`INTERFACE
`
`125
`
`I
`I
`
`I
`
`I
`I
`
`-
`
`-
`
`-
`
`-
`
`-
`
`1
`L
`
`I
`I
`I
`
`1
`
`117\
`
`1
`
`I
`L -
`
`MEMORY I ..
`.....
`
`LCU
`CONTROL
`
`1
`
`LCU
`
`-
`
`101
`
`-107
`
`BUS
`INTERFACE
`
`._ 111
`
`109
`
`FIG. 1
`
`1
`
`I
`
`BUS
`MONITOR
`
`GCU
`TABLE
`
`I
`
`I
`I
`_j
`
`127
`
`129
`
`d •
`\Jl
`•
`~
`~ ......
`~ = ......
`
`~ F-
`'"""'
`'"""' ~
`
`N c c c
`
`'JJ. =(cid:173)~
`~ .....
`'"""' 0 ......,
`
`-..J
`
`0\
`
`.... = 00
`
`00
`....
`......::.
`0\
`\C
`
`2
`
`
`
`FIG. 2A
`
`FIG. 2B
`
`CACHE LINE 1
`
`CACHE LINE 2 CACHE LINE 3
`
`• • •
`
`CACHE LINE n
`
`'
`
`0
`
`201
`
`1
`
`201
`
`•••
`
`1
`
`201
`
`0
`
`201
`
`CACHE LINE 1
`
`CACHE LINE 2 CACHE LINE 3
`
`•••
`
`CACHE LINE n
`
`01
`
`203
`VIRTUAL
`ADDRESS
`000000
`•
`•
`•
`•
`•
`010100
`010101
`•
`•
`•
`
`0
`
`011110
`
`00
`
`203
`
`•••
`
`10
`
`203
`
`00
`
`203
`
`NODE 1
`1
`
`NODE 2
`0
`
`••••
`0
`
`NODE n
`1
`
`1
`1
`
`1
`0
`
`0
`
`,
`
`1
`
`,
`
`0
`0
`
`0
`
`!
`
`0
`1
`
`1
`
`FIG. 2C
`
`205
`
`205
`
`205
`
`205
`
`d •
`\Jl
`•
`~
`~ ......
`~ = ......
`
`~
`
`~
`'"""'
`'"""' ~
`
`N c c c
`
`'JJ. =(cid:173)~
`~ .....
`N
`0 ......,
`-..J
`
`0\
`
`.... = 00
`
`00
`....
`......::.
`0\
`\C
`
`3
`
`
`
`U.S. Patent
`
`Jul. 11, 2000
`
`Sheet 3 of 7
`
`6,088,769
`
`YES
`
`READ MCD
`FROM LOCAL
`CACHE
`
`307
`
`313
`
`315
`
`317
`
`319
`
`321
`
`SEND REQUEST FOR
`MCD OVER GLOBAL
`INTERCONNECT
`
`RECEIVE REQUEST
`IN GCU
`
`READ MCD
`FROM LOCAL
`MEMORY &
`UPDATE LCU
`
`311
`
`SEND REQUEST FROM
`GCU TO NODE
`
`RESPOND BY
`SENDING MCD TO
`REQUESING NODE
`
`FIG. 3
`
`4
`
`
`
`U.S. Patent
`
`Jul. 11, 2000
`
`Sheet 4 of 7
`
`6,088,769
`
`301
`
`READ MCD
`FROM LOCAL
`MEMORY &
`UPDATE LCU
`
`311
`
`@
`
`YES
`
`READ MCD
`FROM LOCAL
`CACHE
`
`307
`
`313
`
`315
`
`401
`
`403
`
`405
`
`SEND REQUEST FOR
`MCD OVER GLOBAL
`INTERCONNECT
`
`RECEIVE REQUEST
`IN GCU
`
`DETERMINE NODE
`CONTAINING MCD
`
`SEND REQUEST TO
`TRANSFER MCD
`
`TRANSFER DATA
`
`FIG. 4
`
`5
`
`
`
`U.S. Patent
`
`Jul. 11, 2000
`
`Sheet 5 of 7
`
`6,088,769
`
`READ MCD
`FROM LOCAL
`MEMORY &
`UPDATE LCU
`
`311
`
`YES
`
`READ MCD
`FROM LOCAL
`CACHE
`
`307
`
`SEND REQUEST FOR
`MCD OVER GLOBAL
`INTERCONNECT
`
`313
`
`315
`
`317
`
`501
`
`503
`
`TO
`
`GCU TO TRANSFER
`DATA TO NODE
`
`FIG. 5
`
`6
`
`
`
`U.S. Patent
`
`Jul. 11, 2000
`
`Sheet 6 of 7
`
`6,088,769
`
`NO
`
`UPDATE LCU
`
`611
`
`607
`
`COPY DATA
`FROM CACHE TO •-----------+-------'
`LOCAL MEMORY
`COPY DATA
`TO NON-LOCAL
`CACHE
`IN
`NODE 2
`
`621
`
`617
`
`COPY DATA TO
`EACH OTHER
`SUCH NODE
`
`623
`
`COPY DATA TO
`MEMORY
`IN NODE 2
`
`619
`
`FIG. 6
`
`7
`
`
`
`U.S. Patent
`
`Jul. 11, 2000
`
`Sheet 7 of 7
`
`6,088,769
`
`WRITE TO
`LOCAL CACHE
`
`701
`
`UPDATE LCU
`
`703
`
`UPDATE GCU
`
`705
`
`UPDATE
`LCUs
`IN SHARED
`NODES
`
`709
`
`707
`
`FIG. 7
`
`8
`
`
`
`1
`MULTIPROCESSOR CACHE COHERENCE
`DIRECTED BY COMBINED LOCAL AND
`GLOBAL TABLES
`
`6,088,769
`
`2
`the width of directory entries within such tables, making the
`table relatively large and complex.
`Accordingly, it is an object of the present invention to
`provide a system and method for maintaining coherence of
`5 data stored in multiple caches within a multiprocessor
`system by storing relatively small and simple entries in
`coherence tables.
`
`BACKGROUND OF THE INVENTION
`1. Field of the Invention
`The present invention relates generally to computer cache
`schemes, and more particularly to a method and apparatus
`for maintaining coherence between memories within a sys(cid:173)
`tem having distributed shared memory.
`2. Description of the Related Art
`Due to the demand for greater processing power, and due
`to the desire to have relatively small processors work
`cooperatively as a multi-processing system, there have been
`many attempts over the last several years to solve the
`problems inherent in maintaining coherence between
`memory devices which are accessible to more than one
`processing device. The coherence problem is exemplified by
`a system in which a single logical interconnect channel is
`shared by a plurality of memory devices, each paired with an
`associated processor to form a "tightly-coupled" multi(cid:173)
`processor system. Reference from a processor may be to
`addresses that are within the memory associated with the
`requesting processor or within a memory associated with
`another of the processors within the system.
`Each processor is also associated with a local cache. The
`cache is a relatively small high speed memory device which
`allows data to be accessed rapidly. The cache is typically
`relatively small because the cost of such high speed memory
`is relatively high. However, in systems in which the pro(cid:173)
`cessors are very long instruction word processors (VLIWs)
`the caches are unusually large, increasing the likelihood that
`information stored in a cache associated with a first proces(cid:173)
`sor may also be stored in the cache associated with at least 35
`a second processor. It should be understood that there is a
`tradeoff between the size and the speed of the cache.
`Consider a cache that is a two level cache in which the
`first level of the cache is smaller and faster than the second
`level of the cache. Therefore, the data which is most likely 40
`to be required is maintained in the first level of the cache,
`and data which is less likely to be required is maintained in
`the second level of the cache. If data is required which is not
`present in either the first or the second level of the cache,
`then the data must be retrieved from the slower memory 45
`device. A number of different algorithms have been devised
`for determining what data to maintain in each level of the
`cache (i.e., anticipating what data is most likely to be
`required soon). However, a detailed understanding of such
`algorithms is not necessary for the purposes of the present 50
`discussion.
`Since each processor has an associated cache, and each
`cache may have more than one level, care must be taken to
`ensure the coherence of the data that is maintained through(cid:173)
`out the system. That is, if more than one cache may contain 55
`a copy of the same interval of shared, writable memory, a
`reasonable approach to parallel programming requires some
`provision for ensuring that all copies reflect a coherent
`value. One way to ensure that coherence is maintained is to
`maintain a directory (or table) which points to each non- 60
`local cache in which the data resides. By knowing the
`location of each copy of the data, each copy can either be
`updated, or a notation can be made within the table to
`indicate that the data at one or more locations is out-of-date.
`Such tables require pointers to multiple nodes which are 65
`caching data. However, maintaining one table which points
`to each copy within each node increases the complexity and
`
`10
`
`15
`
`20
`
`25
`
`30
`
`SUMMARY OF THE INVENTION
`
`The present invention is a method and apparatus for
`maintaining coherence between shared data stored within a
`plurality of memory devices, each memory device residing
`in a different node within a tightly coupled multiprocessor
`system. Each of the nodes are coupled via a logical inter(cid:173)
`connect (such as a bus). Each node further includes a "local
`coherence unit", and a processor associated with the
`memory device and forming a memory/processor pair there(cid:173)
`with. In accordance with one embodiment of the present
`invention, a cache unit is associated with each memory/
`processor pair to aid in reading and writing to each memory
`device. Each processor is capable of sharing data with each
`other processor, and reading and writing to addresses within
`any of the memory devices within the system. Data is
`considered to be "shared" when that data is stored in more
`than one node. For example, shared data may be stored in a
`memory device within a first node and within a cache within
`a second node. In addition, shared data may be stored in a
`memory device within a first node and within a memory
`device within a second node. Reading and writing to
`memory devices is preferably accomplished through the
`cache. However, in an alternative embodiment, no such
`cache is provided.
`Each local coherence unit maintains a table which indi(cid:173)
`cates whether the most current copy of data stored within the
`node resides in the local memory (i.e., the memory within
`the node in which the local coherence table resides), in the
`local cache (i.e., a cache within the node in which the local
`coherence table resides), or in a non-local cache (i.e., a cache
`within a node other than the node in which the local
`coherence table resides).
`In addition to the plurality of nodes, the multiprocessor
`system of the present invention includes a "global coher(cid:173)
`ence" unit coupled to each node via the logical interconnect.
`The global coherence unit includes a interconnect monitor(cid:173)
`ing device and a global coherence table. The interconnect
`monitoring device monitors communications between the
`nodes and detects transfers of data from one node to another.
`When data which resides within the memory of a first node
`is transferred to a second node, the interconnect monitoring
`device updates the global coherence table to indicate that the
`data is being shared. The global coherence table also pref(cid:173)
`erably indicates that a copy of the most current data resides
`within the second node. Data which resides within the
`memory of a first node and within the cache of a second node
`is referred to generally as shared data, and more particularly
`as "non-locally cached data". In accordance with one
`embodiment of the present invention, when data is trans(cid:173)
`ferred from a first node to a second, the local coherence table
`within the first node is updated to reflect that the most
`current copy resides in a non-local cache. If an attempt is
`made to modify the data, the modification will always be
`directed to be performed on only one copy of the data (the
`most current copy), and that copy will be well identified.
`In an alternative embodiment of the present invention, the
`local coherence table is only updated if data shared by the
`node in which the table resides is modified. Accordingly, in
`
`9
`
`
`
`6,088,769
`
`3
`accordance with such an embodiment, a processor which
`modifies shared data must send a message to each node with
`which the data is shared to indicate that the data stored
`therein is no longer current. The local coherence table within
`each such node is then updated to reflect that fact that the
`shared data is no longer current.
`The use of local coherence tables and a global coherence
`table reduces the size and complexity of the coherence table
`that would otherwise be required. In addition, the present
`invention provides a means by which direct cache to cache
`transfers can be accomplished. Still further, by maintaining
`a local coherence table to indicate when shared data is
`current, use of the logical interconnect is reduced. That is,
`there is no need to inquire outside the node when the most
`current copy of requested data presently resides within the
`node.
`The details of the present invention, both as to its structure
`and operation, can best be understood in reference to the
`accompanying drawings, in which like reference numerals
`refer to like parts, and in which:
`
`BRIEF DESCRIPTION OF 1HE DRAWING
`
`5
`
`4
`invention, a plurality of nodes 101 are provided, each of
`which preferably include a processor 103, a memory 105, a
`cache 107, a local coherence unit (LCU) 109, and a bus
`interface 111. However, it should be understood that nodes
`in accordance with an alternative embodiment of the present
`invention may be provided without a cache. The processor
`103 and memory 105 are collectively referred to as a
`memory/processor pair. The cache 107 is preferably a con(cid:173)
`ventional cache including a cache controller 113, a first level
`10 cache memory 115, and a second level cache memory 117.
`However, it should be understood that the cache may be
`any system or device which increases the speed at which
`information can be read from and/or written to memory by
`storing information in a faster and smaller memory device.
`15 For example, the cache 107 may be a single level cache, or
`may have more than two levels. Furthermore, the cache may
`be implemented using any type of memory device which has
`a read and write cycle that is fast with respect to the memory
`105. Still further, any size "cache line" may be used. A cache
`20 line is defined as the minimum amount of data that is
`transferred into or out of the cache in a single transfer
`operation.
`The LCU 109 includes a control processor 119 and a
`Local Coherence Table (LCT) 121. The LCT 121 preferably
`has one entry associated with each cache line stored within
`the memory 105. Alternatively, the size of each portion of
`memory that is associated with an entry to the LCT 121 may
`be smaller or larger than a cache line. The address of the
`portion of memory is used as an index into the LCT 121.
`30 This index may be either the virtual address or the effective
`address of the particular portion of data. Virtual addresses
`are uniquely assigned to each unique individually address(cid:173)
`able portion of data within the system. Effective addresses
`are the physical addresses within each local memory 105 in
`35 each node. Therefore, the same effective address may be
`used in several different nodes 101 throughout the system to
`access different portions of data. Nonetheless, each portion
`of memory within a particular node has a unique effective
`address. Effective addresses from different nodes can each
`40 be mapped to one virtual address so that the data can be
`shared by different nodes. For example, data stored at
`effective address "101101" in a first node and effective
`address "10001" within a second node can each be mapped
`to virtual address "100000111101". In this way, data can be
`shared by the two processors within the two nodes, and yet
`may be stored locally within each node. The coherence
`tables of the present invention provide the means by which
`such shared data remains coherent. Accordingly, it will be
`understood that each virtual address is mapped to a unique
`50 effective address within only one node 101 unless two or
`more nodes share the same data, in which case each effective
`address at which such shared data is stored will be mapped
`to the same virtual address.
`Each entry in the LCT 121 preferably also has one bit
`55 which indicates whether the contents of the memory 105 at
`the address associated with the entry is the most current copy
`of that data. The cache controller 113 maintains a record as
`to whether a copy of the contents of the memory 105 at each
`particular location is present within the cache 107. In
`60 accordance with the preferred embodiment of the present
`invention, the cache controller 113 is always consulted when
`data is requested by the processor 103 from memory 105.
`Therefore, if the LCT 121 indicates that the most current
`copy of the data stored at a particular address is locally
`65 available, the cache controller 113 will determine whether
`the most current copy of that data is present in the cache 107
`or the memory 105.
`
`FIG. 1 is a functional block diagram of a tightly coupled
`multiprocessor system in accordance with the present inven- 25
`tion;
`FIG. 2a is an illustration of the organization of a local
`coherence table in accordance with one embodiment of the
`present invention;
`FIG. 2b is an illustration of the organization of a local
`coherence table in accordance with an alternative embodi(cid:173)
`ment of the present invention;
`FIG. 2c is an illustration of the organization of a local
`coherence table in accordance with another alternative
`embodiment of the present invention;
`FIG. 3 is a flow chart of one embodiment of the inventive
`method for reading data;
`FIG. 4 is a flow chart of an alternative embodiment of the
`inventive method for reading data;
`FIG. 5 is a flow chart of yet another alternative embodi(cid:173)
`ment of the inventive method for reading data;
`FIG. 6 is a flow chart of one embodiment of the inventive
`method for writing data; and
`FIG. 7 is a flow chart of an alternative embodiment of the 45
`inventive method for writing data.
`
`DETAILED DESCRIPTION OF 1HE
`PREFERRED EMBODIMENTS
`
`The present invention is a method and apparatus that uses
`local coherence tables and a global coherence table to
`maintain coherence between data that is distributed through(cid:173)
`out a tightly coupled computer system. The use of local
`coherence tables and a global coherence table reduces the
`size and complexity of the coherence table that would
`otherwise be required. In addition, the present invention
`provides a means by which direct cache to cache transfers
`can be accomplished. Still further, by maintaining a local
`coherence table to indicate when shared data is current, use
`of the logical interconnect is reduced. That is, there is no
`need to inquire outside the node when the most current copy
`of requested data presently resides within the node.
`Functional Configuration of the Present Invention
`FIG. 1 is a functional block diagram of a tightly coupled
`multiprocessor system 100 configured in accordance with
`the present invention. In accordance with the present
`
`10
`
`
`
`6,088,769
`
`5
`In accordance with one embodiment of the present
`invention, the table maintained by the cache controller 113
`and the table maintained by the LCT 121 may be combined.
`Accordingly, a single table may be maintained which indi(cid:173)
`cates both whether the contents of a particular memory 5
`location is cached and whether the local copy stored in either
`the memory 105 or the cache 107 is the most current copy.
`Thus, the functions of the cache controller 113 and the
`control processor 119 may be combined, such that each entry
`in the LCT 121 indicates whether the most current copy of
`a particular cache line is available from the cache 107, the
`memory 105, or a source outside the node (i.e., either storage
`or a non-local cache (i.e., a cache within another node).
`The bus interface 111 provides an interface between the
`node 101, other nodes 101, and a Global Coherence Unit
`(GCU) 123 via a logical interconnect 124 (such as a bus).
`The GCU 123 includes a bus interface 125, a interconnect
`monitor 127, and a Global Coherence Table (GCT) 129. The
`interconnect monitor 127 passively monitors the logical
`interconnect 124 to detect attempts by a first node to read
`data from, or write data to, a second node. When the
`interconnect monitor 127 detects that data is being written or
`read between two nodes, the interconnect monitor 127
`updates the GCT 129 to indicate that the data is being shared
`by the two nodes. In accordance with one embodiment of the
`present invention, the GCT 129 is indexed by the virtual
`address of the shared data. The GCT 129 may be configured
`as a data base in which the key is the virtual address. When
`data is shared for the first time, a new entry to the GCT 129
`is generated. Each entry indicates which nodes share the data
`(i.e., which nodes have an effective address that is mapped
`to the same virtual address). Likewise, when a node reads or
`writes to a virtual address for the first time, the node that
`reads, or the node to which data is written, is then added to
`the list of nodes that share that virtual address.
`FIG. 2a is an illustration of the organization of an LCT
`121 in accordance with one embodiment of the present
`invention. The LCT 121 in FIG. 2a has a single bit 201
`associated with each cache line of data within the memory
`105. In accordance with one embodiment, the bit is set to
`indicate that the current locally stored data at the address
`associated with that bit is not the most current copy of that
`data (i.e., that the data was read by another node and may
`have been modified in that node). Addresses associated with
`a bit are referred to herein as "cache line x", where x is a
`number from 1 to n, n being the total number of cache lines
`in memory 105. Accordingly, a first cache line of data (cache
`line 1) stored locally within the node 101 is the most current
`copy of that data, as indicated by the fact that the bit 201 is
`not asserted. However, the second cache line of data stored
`locally (cache line 2) may not be the most current copy of
`that data, since the bit associated with that cache line is
`asserted. In an alternative embodiment of the present
`invention, the bit is only asserted when a copy of the data
`associated with that bit has been modified in another node. 55
`In accordance with such an embodiment of the invention, the
`node which modifies the cache line must report that the
`modification has occurred to each of the nodes that share that
`data. While the overhead required to maintain the LCT 121
`is greater in such an embodiment, the cache operates more
`efficiently, since the local cache in a first node can still be
`used after a second node reads the cache line from the first
`node, as long as no other node has modified the information.
`In the embodiment in which the LCT 121 bit 201 is asserted
`whenever that associated cache line has been read, the
`overhead is lower, since it is relatively easy for a local
`control processor 119 to determine when data has been read
`
`6
`from the node in which that control processor 119 resides. In
`contrast, it is more complex for the control processor 119 to
`determine when another node has modified locally stored
`data.
`The LCT 121 of FIG. 2a may be either directly or
`associatively mapped to the memory 105. For example, if
`the cache is organized in accordance with a "hashing"
`algorithm, then the LCT 121 may be organized in accor(cid:173)
`dance with the same "hashing" algorithm used to determine
`10 the manner in which cache lines are organized within the
`cache. Hashing algorithms for determining the order in
`which cache lines are stored in caches are well known.
`Alternatively, the order of the entries may be either arbitrary,
`or may be the same as the order of the cache lines within the
`15 memory 105.
`In an alternative embodiment of the present invention
`shown in FIG. 2b, entries to the LCT 121 indicate whether
`the most current copy of a cache line is presently maintained
`in the memory 105, the cache 107, or non-locally (i.e., in
`20 another node). For example, two bits 203 may be used to
`indicate where the most current copy of the cache line
`resides. In accordance with one embodiment, "00" indicates
`that the most current copy of the cache line resides within
`the local cache 107, "01" indicates that the most current
`25 copy of the cache line resides within the local memory 105,
`and "10" indicates that the most current copy of the cache
`line resides within another node.
`One embodiment of the organization of the contents of the
`GCT 129 in accordance with the present invention is shown
`30 in FIG. 2c. The virtual address of each cache line which is
`currently being shared by more than one node 101 is
`maintained in the GCT 129. Alternatively, the table may be
`organized such that an entry is to be provided for each cache
`line which may possibly be shared, regardless of whether
`35 that virtual address is currently being shared. Preferably,
`associated with each entry are a plurality of bits 205 which
`indicate which nodes are currently sharing the associated
`virtual address. For example, in the table shown in FIG. 2c,
`the virtual address "000000" is shared by nodes 1 and n. The
`40 virtual address "010100" is shared by nodes 1 and 2, etc. In
`an alternative embodiment of the present invention, a bit
`pattern or code may be uniquely assigned to each node. This
`bit pattern is then stored in an entry associated with a virtual
`address to indicate that the node to which the code or pattern
`45 is assigned is one of the nodes that share that virtual address.
`Read Operations in Accordance With the Present Invention
`FIG. 3 is a flow chart illustrating the method for reading
`data in a manner which ensures coherence in accordance
`with the present invention. Initially, the processor 103 within
`50 a first of the nodes 101 requests data from the cache
`controller 113 (STEP 301). As part of the request, the
`processor 103 provides the cache controller 113 with the
`virtual address of the data to be read. In an alternative
`embodiment of the present invention, the processor 103
`provides the cache controller 113 with the effective address
`of the data to be read. In either case, the cache controller 113
`determines whether the data to be read is present in a first
`level cache (Ll) (STEP 303). Preferably, concurrent with
`STEP 303, the cache controller 113 provides the address to
`60 the control processor 119 to determine whether the most
`current copy of the data to be read resides locally (i.e., within
`the same node 101 as the requesting processor 103) (STEP
`305). If the most current copy of the data is present within
`the local cache 107, then the data read request is serviced
`65 from the local cache (STEP 307) and the process ends.
`If the most current copy of the data to be read is not in the
`local cache 107, then the cache controller 113 checks
`
`11
`
`
`
`6,088,769
`
`7
`whether the data to be read resides within the memory 105
`(STEP 309). If the cache controller 113 indicates that the
`data to be read resides within the memory 105, and the
`control processor 119 reads the LCT 121 and determines that
`the most current copy of the data to be read resides locally, 5
`then the data read request is serviced by the cache controller
`113 copying the data from the memory 105 into the cache
`107. The cache controller 113 then makes the data available
`to the processor 103 (STEP 311) and the process ends.
`If the cache controller 113 determines that the data to be
`read is not stored within the memory 105 (STEP 309), or the
`control processor 119 determines that the copy that is stored
`in memory 105 is not the most current copy of the data
`(STEP 309), then the cache controller 113 requests the data
`over the logical interconnect 124 (STEP 313). In forming the
`request, the address of the data being requested must be a 15
`virtual address. The request is received by the GCU 123
`(STEP 315). The GCU 123 then determines which node has
`the most current copy of the data to be read. It may be
`possible for more than one node to have the most current
`copy of the data to be read. In either case, the GCU 123 20
`determines which node is to respond to the request (STEP
`317). In accordance with the embodiment illustrated in FIG.
`3, a request to service the read operation is then sent from
`the GCU 123 to the node which the GCU 123 has deter(cid:173)
`mined should respond (STEP 319). The request is received
`by the bus interface 111 over the logical interconnect 124
`and coupled to the cache controller 113. The cache controller
`113 and the LCU 109 determine where the most current copy
`of the data to be read resides. The cache controller 113 then
`transmits the data to be read over the logical interconnect
`124 (STEP 321). The requesting node 101 receives the data
`in the bus interface 111. The data is coupled to the cache
`controller 113, which makes the data available to the
`requesting processor 103. In accordance with the preferred
`embodiment of the present invention, the data is transferred
`from cache to cache by the cache controllers 113 within the
`requesting node 101 and the node 101 from which the data
`is to be read. If the data to be read is not within the cache
`107, then the cache controller 113 retrieves the data from
`local memory 105 and transmits the data directly to the
`cache controller 113 in the requesting node 101.
`In an alternative embodiment of the present invention
`shown in FIG. 4, each of the steps 301-317 are performed
`as described above. However, once the GCU 12