`(10) Patent N0.:
`US 6,662,277 B2
`
`Gaither
`(45) Date of Patent:
`Dec. 9, 2003
`
`U8006662277B2
`
`(54) CACHE SYSTEM WITH GROUPS OF LINES
`AND WITH COHERENCY FOR BOTH
`SINGLE LINES AND GROUPS OF LINES
`
`(75)
`
`Inventor: Blaine D. Gaither, Fort Collins, CO
`(us)
`
`(73) Assignee: Hewlett-Packard Development
`Company, L.P., Houston, TX (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`OTHER PUBLICATIONS
`.
`.
`“Cache Coherence Protocol For A Multiple Bus Multipro-
`cessor System”; 09/704,176; Blaine D. Gaither et al; Filed
`OCL 31> 2000
`“On Effectiveness Of Sectored Caches In Reducing False
`Sharing Misses”; 1997 IEEE; Kuang—Chih Liu & Chung—Ta
`King; Department of Computer Science; National Tsing Hua
`University; 99. 352—359.
`“Two Techniques For Improving Performance On Bus-
`—based Multiprocessors”; 1995 IEEE; Craig Anderson and
`Jean—Loup Baer; Department of Computer Science and
`Engineering; University of Washington; pp. 264—275.
`
`(21) Appl. No.: 09/919,309
`
`* Cited by examiner
`
`(22)
`
`Filed:
`
`(65)
`
`Jul. 31, 2001
`_
`_
`_
`Pr10r Publlcatlon Data
`
`Primary Examiner—Donald Sparks
`Assistant Examiner—Ngoc Dinh
`(74) Attorney, Agent, or Firm—Augustus W. Winfield
`
`US 2003/0028730 A1 Feb. 6, 2003
`7
`
`(57)
`
`ABSTRACT
`
`................................................ G06F 12/00
`Int. Cl.
`(51)
`....................... 711/145; 711/141; 711/144;
`(52) US. Cl.
`711/146; 711/147; 711/148
`_
`(58) Fleld of Search ................................. 711/141, 145,
`711/146, 147, 148
`
`(56)
`
`_
`References Clted
`U.S. PATENT DOCUMENTS
`
`9/1997 Cheriton ..................... 711/144
`5,666,514 A *
`7/1998 Pawlowski ..
`711/137
`5,787,475 A *
`5,832,232 A * 11/1998 Danneels
`709/235
`6,035,376 A *
`3/2000 James
`...........
`711/145
`6,049,845 A *
`4/2000 Bauman et a1.
`...... 710/113
`
`...... 711/156
`6,189,078 B1 *
`2/2001 Bauman et a1.
`.............. 711/146
`6,266,744 B1 *
`7/2001 Hughes et a1.
`
`In a computer system with caching, memory transactions
`can retrieve and store groups of lines. Coherency states are
`maintained for groups of lines, and for individual lines. A
`single coherency transaction, and a single address
`transaction, can then result in the transfer of multiple lines
`of data, reducing overall latency. Even though lines may be
`transferred as a group, the lines can subsequently be treated
`separately. This avoids many of the problems caused by long
`lines, such as increased cache-to-cache copy activity. In one
`alternative, when a cache memory requests a group of lines,
`and when the group of lines is partially owned by another
`cache memory,
`then the requesting cache receives fewer
`than all the lines in the requested group.
`
`9 Claims, 5 Drawing Sheets
`
`300
`GROUP UNOWNED
`?
`
`r304
`
`REQUESTED LINE
`OWNED
`?
`
`YES
`
`YES
`
`r302
`_TTT___1
`GET UP TO ALL LINES IN GROUP,
`INVALIDATE ALL EXISTING COPIES
`(IN OTHER CACHES) OF RETRIEVED
`LINES, MARK REQUESTED LINE AS
`EXCLUSIVE, MARK OTHER
`RETRIEVED LINES AS SHARED
`306
`NO———1
`GET UP TO ALL UNOWNED LINES IN
`COPIES (IN OTHER CACHES) OF
`RETRIEVED LINES, MARK
`REQUESTED LINE AS EXCLUSIVE,
`MARK OTHER RETRIEVED LINES AS
`SHARED
`
`r310
`No_l
`INVALIDATE ALL EXISTING
`COPIES (IN OTHER CACHES)
`OF REQUESTED LINE, MOST
`CURRENT COPY TO
`MEMORY, MOST CURRENT
`COPY TO REQUESTOR, MARK
`REQUESTED LINE AS
`EXCLUSIVE
`
`
` GROUP, INVALIDATE ALL EXISTING
`YES
`
`
`
`308
`
`LINES OWNED
`
`Y SAME OWNE
`
`
`
`
`
`
`v
`312
`INVALIDATE ALL EXISTING
`COPIES (IN OTHER CACHES)
`OF GROUP, MOST CURRENT
`COPY
`TO MEMORY, MOST
`CU
`RRENT COPY TO
`REQUESTOR, MARK GROUP
`AS EXCLUSIVE
`
`
`
`APPLE1020
`
`1
`
`APPLE 1020
`
`
`
`US. Patent
`
`Dec. 9, 2003
`
`Sheet 1 0f 5
`
`US 6,662,277 B2
`
`flmflflflfl
`
`110
`
`120
`
`an
`112
`
`100}
`
`“I
`
`“8
`
`k
`
`102
`
`104
`
`\V
`
`114
`
`FIG. 1
`
`2
`
`
`
`US. Patent
`
`Dec. 9, 2003
`
`Sheet 2 0f 5
`
`US 6,662,277 B2
`
`200
`
`
`
`GROUP UNOWNED
`
`YES
`
`GET COPIES OF UP TO
`
`
`
`
`ALL LINES IN GROUP,
`MARK RETRIEVED
`
`LINES AS SHARED
`
`NO
`
`
`
`GET COPIES OF UP TO
`
`ALL UNOWNED LINES
`
`
`
`IN GROUP, MARK
`RETRIEVED LINES AS
`
`
`
`
`
`SHARED
`
`
`
`GET COPY AND
`
`
`OWNERSHIP OF LINE
`
`NO
`
`
`
`
`
`FROM OWNER, MARK
`LINE AS EXCLUSIVE OR
`
`SHARED
`
`
`FIG. 2
`
`
`
`
`
`
`
`
`208
`
`
`LINES OWNED
`-Y SAME OWNE'
`
`YES
`
`GET COPIES AND
`
`OWNERSHIP OF UP TO
`
`ALL LINES IN GROUP
`
`FROM OWNER, MARK
`RETRIEVED LINES AS
`
`SHARED
`
`EXCLUSIVE OR
`
`NO
`
`REQUESTED LINE
`
`OWNED
`
`7 '
`
`YES
`
`3
`
`
`
`US. Patent
`
`Dec. 9, 2003
`
`Sheet 3 0f 5
`
`US 6,662,277 B2
`
`F300
`
`
`GROUP UNOWNED
`
`YES
`
`NO
`
`REQUESTED LINE
`
`NO
`
`YES
`
`
`
`
`GET UP TO ALL LINES IN GROUP,
`INVALIDATE ALL EXISTING COPIES
`
`
`
`302
`
`(IN OTHER CACHES) OF RETRIEVED
`LINES, MARK REQUESTED LINE AS
`EXCLUSIVE, MARK OTHER
`RETRIEVED LINES AS SHARED
`
`
`
`
`
`
`GET UP TO ALL UNOWNED LINES IN
`GROUP, INVALIDATE ALL EXISTING
`COPIES (IN OTHER CACHES) OF
`RETRIEVED LINES, MARK
`REQUESTED LINE AS EXCLUSIVE,
`MARK OTHER RETRIEVED LINES AS
`SHARED
`
`
`
`
`
`
`
`(308
` NO
`LINES OWNED
`
`'Y SAME OWNE'
`310
`
`
`
`
`
`
`
`
`
`
`
`
`INVALIDATE ALL EXISTING
`
`
`COPIES (IN OTHER CACHES)
`
`OF REQUESTED LINE, MOST
`
`CURRENT COPY TO
`
`
`
`
`MEMORY, MOST CURRENT
`COPY TO REQUESTOR, MARK
`REQUESTED LINE AS
`EXCLUSIVE
`
`
`INVALIDATE ALL EXISTING
`
`
`COPIES (IN OTHER CACHES)
`OF GROUP, MOST CURRENT
`COPY TO MEMORY, MOST
`CURRENT COPY TO
`
`REQUESTOR, MARK GROUP
`AS EXCLUSIVE
`
`
`
`FIG. 3
`
`4
`
`
`
`US. Patent
`
`Dec. 9, 2003
`
`Sheet 4 0f 5
`
`US 6,662,277 132
`
`308
`
`
`LINES OWNED
`'Y SAME OWNE'
`
`
`
`
`YES
`
`NO
`
`02
`
`
`
`
`
`INVALIDATE ALL EXISTING
`
`COPIES (IN OTHER
`CACHES) OF REQUESTED
`LINE, MOST CURRENT
`COPY TO MEMORY, MOST
`
`CURRENT COPY TO
`
`REQUESTOR, MARK
`REQUESTED LINE AS
`
`
`
`
`
`
`
`
`
`
`
`
`
`00
`
`ARE ALL LINES
`
`MODIFIED
`
`
`
`
`
`?
`
`YES
`
`INVALIDATE ALL EXISTING
`
`COPIES (IN OTHER
`CACHES) OF GROUP,
`MOST CURRENT COPY TO
`
`MEMORY, MOST CURRENT
`COPY TO REQUESTOR,
`
`MARK GROUP AS
`
`EXCLUSIVE
`
`402'WI
`
`
`
`
`EXCLUSIVE
`
`
`
`l
`' COPY ALL, MARK NON-
`
`REQUESTED LINES AS
`SHARED
`:
`r____.——__v——————I
`I
`I
`COPY ALL, MARK NON-
`: REQUESTED MODIFIED
`402"\|
`LINES AS EXCLUSIVE,
`, MARK NON-REQUESTED
`l
`EXCLUSIVE LINES AS
`:
`SHARED
`
`Il
`
`5
`
`
`
`US. Patent
`
`Dec. 9, 2003
`
`Sheet 5 0f 5
`
`US 6,662,277 B2
`
`500
`
`GROUP IN GOTL
`?
`
`
`
`
`
`YES
`
`
`
`
`ANY LINES IN
`GROUP OWNED
`
`502
`
`
`
`
`YES
`
`r505
`
`REQUESTED LINE
`OWNED
`?
`
`INVALIDATE ALL EXISTING
`
`COPIES (IN OTHER CACHES),
`ALLOCATE GROUP IN GOTL,
`
`RETRIEVE UP TO ENTIRE GROUP,
`MARK REQUESTED LINE AS
`OWNED, OTHERS AS SHARED
`
`INVALIDATE ALL EXISTING
`
`COPIES (IN OTHER CACHES),
`RETRIEVE UP TO ALL
`
`UNOWNED LINES, MARK
`REQUESTED LINE AS
`
`OWNED, OTHERS AS
`SHARED
`
`
`
`
`
`
`
`
`YES
`
`510
`
`LINES OWNED
`
`
`'Y SAME OWNE'
`
`YES
`
`RETRIEVE GROUP
`
`RETRIEVE REQUESTED
`
`FROM OWNER,
`TRANSFER
`
`OWNERSHIP IN GOTL
`
`LINE FROM OWNER,
`TRANSFER
`
`OWNERSHIP IN GOTL
`
`FIG. 5
`
`6
`
`
`
`US 6,662,277 B2
`
`1
`CACHE SYSTEM WITH GROUPS OF LINES
`AND WITH COHERENCY FOR BOTH
`SINGLE LINES AND GROUPS OF LINES
`
`FIELD OF INVENTION
`
`This invention relates generally to computer systems and
`more specifically to cache memory systems.
`BACKGROUND OF THE INVENTION
`
`Most computer systems employ a multilevel hierarchy of
`memory systems, with relatively fast, expensive, limited-
`capacity memory at the highest level of the hierarchy and
`proceeding to relatively slower, lower cost, higher-capacity
`memory at the lowest level of the hierarchy. Typically, the
`hierarchy includes a small fast memory called a cache, either
`physically integrated within a processor integrated circuit, or
`mounted physically close to the processor for speed. There
`may be separate instruction caches and data caches. There
`may be multiple levels of caches.
`Caches are commonly organized around an amount of
`memory called a line, or block, or page. The present patent
`document uses the term “line,” but the invention is equally
`applicable to systems employing blocks or pages.
`Many computer systems employ multiple processors,
`each of which may have multiple levels of caches. Some
`caches may be shared by multiple processors. All processors
`and caches may share a common main memory. Aparticular
`line may simultaneously exist in memory and in the cache
`hierarchies for multiple processors. All copies of a line in the
`caches must be identical, a property called coherency. The
`protocols for maintaining coherence for multiple processors
`are called cache coherence protocols.
`A cache “owns” a line if the cache has permission to
`modify the line without issuing any further coherency trans-
`actions. There can only be one “owner” of a line. For any
`cache coherence protocol, the most current copy of a cache
`line must be retrieved from the current owner, if any, and a
`copy of the data must be delivered to the requestor. If the line
`is to be modified, ownership must be acquired by the
`requester, and any shared copies must be invalidated.
`There are three common approaches to determine the
`location of the owner of a line, with many variations and
`hybrids. In one approach, called a snooping protocol, or
`snoop-based protocol, the owner is unknown, and all caches
`must be interrogated (snooped) to determine the location of
`the most current copy of the requested line. All requests for
`access to a cache line, by any device in the system, are
`forwarded to all caches in the system. Eventually, the most
`current copy of a line is located and a copy is provided to the
`requestor.
`In a single-bus system, coherence (snooping)
`traffic, addresses, and often data all share a common bus.
`In a second approach, called a directory-based protocol,
`memory is provided to maintain information about the state
`of every line in the memory system. For example, for every
`line in memory, a directory may include a bit for each cache
`hierarchy to indicate whether that cache hierarchy has a
`copy of the line, and a bit to indicate whether that cache
`hierarchy has ownership. For every request for access to a
`cache line, the directory must be consulted to determine the
`owner, and then the most current copy of the line is retrieved
`and delivered to the requester. Typically, tags and status bits
`for a directory are stored in main memory, so that a request
`for state information cycles main memory and has the
`latency of main memory. In a multiple bus system, directory
`traffic may be on a separate bus.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`A third approach is a global coherency filter, which has a
`tag for every valid line in the cache system. A coherency
`filter is a snoop system with a second set of tags, stored
`centrally, for all caches in the system. A request for a cache
`line is forwarded to the central filter, rather than to all the
`caches. The tags for a coherency filter are typically stored in
`a small high-speed memory. Some coherency filters may
`only track owned lines, and may not be inclusive of all
`shared lines in the system. In a multiple bus system, coher-
`ency filter traffic may be on a separate bus.
`For relatively small systems, with one bus or with only a
`few buses, snoop-based protocols provide the best perfor-
`mance. However, snoop-based systems with one bus
`increase bus traffic, and for large systems with one bus or
`with only a few buses, snoop traffic can limit overall
`performance. Directory-based systems increase the time
`required to retrieve a line (latency) relative to snooping on
`a single bus, but
`in a multiple-bus system a directory
`requires less coherency traffic on the system buses than
`snoop-based systems. For large multiple-bus systems, where
`bus traffic may be more important than latency, directory-
`based systems typically provide the best overall perfor-
`mance. Many computer systems use some sort of hybrid of
`snoop-based and directory-based protocols. For example,
`for a multiple bus system, snoop-based protocols may be
`used for coherency on each local bus, and directory-based
`protocols may be used for coherency across buses.
`If a processor requests a line, the overall time required to
`retrieve the line (overall
`latency) includes (1) the time
`required to acquire access rights using a cache coherency
`protocol, (2) the time required to process an address, and (3)
`the time required to retrieve and transfer the data. As
`discussed above, bus traffic for coherency requests can limit
`overall performance.
`One way to decrease bus traffic for coherency requests is
`to increase the line size. For example, if contiguous lines are
`requested, each line requires a separate coherency request. If
`line size is doubled, twice as much data is read for each
`coherency request. In addition, a substantial part of overall
`latency is the time required to route a memory request to the
`various memory components and to get the data from those
`components. Larger
`lines provide more data for each
`request. However, as lines become even larger, much of the
`data transferred is not needed, and much of the cache space
`is filled with data that is not needed. This increases the bus
`traffic for data transfer, and increases the cache miss rate,
`both of which negatively impact overall performance. In
`addition, some fraction of a line may be needed exclusively
`by more than one processor or node. This can cause exces-
`sive cache-to-cache copy activity as the two processors or
`nodes fight for ownership, and the resulting number of
`coherency requests may increase.
`As an alternative, it is known to permit partial line (or
`partial block) invalidation. It is also known to prefetch extra
`sub-lines. For example, see C. K. Liu and T. C. King, A
`Performance Study on Bounteous Transfer in Multiproces-
`sor Sectored Caches, The Journal of Supercomputing, 11,
`405—420 (1997). Liu and King describe a coherence proto-
`col for invalidating sub-lines, and for prefetching of multiple
`sub-lines.
`
`There is an ongoing need to reduce overall latency while
`maintaining coherency, particularly for large multiple-bus
`systems.
`
`SUMMARY OF THE INVENTION
`
`A computer system retrieves and stores groups of lines.
`Coherency states are maintained for groups of lines and for
`
`7
`
`
`
`US 6,662,277 B2
`
`3
`individual lines. Alternatively, the coherency state of a group
`of lines can be deduced by the coherency state of all of its
`sublines. A single coherency transaction, and a single
`address transaction, can then result in the transfer of multiple
`lines of data, reducing overall latency. Even though lines
`may be retrieved as a group, the lines can subsequently be
`treated separately. This avoids many of the problems caused
`by long lines, such as increased cache-to-cache copy activ-
`ity. There may be multiple owners of lines within a group of
`lines. Special instructions may be implemented that request
`up to a group of lines. That is, depending on ownership, the
`instruction may result in only one line being transferred, or
`up to an entire group of lines being transferred. For multiple-
`bus systems, latency may be further reduced by preferably
`retrieving unowned lines from memory rather than from
`caches.
`
`10
`
`15
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram of an example computer system
`suitable for use with the invention.
`
`20
`
`FIG. 2 is a flow chart of an example method for main-
`taining coherence for a line request without a request for
`ownership.
`FIG. 3 is a flow chart of an example method for main-
`taining coherence for a line request with a request for
`ownership.
`FIG. 4 is a flow chart of an example alternative method for
`part of FIG. 3.
`FIG. 5 is a flow chart of an example alternative to the
`methods of FIGS. 2—4.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENT OF THE
`INVENTION
`
`FIG. 1 illustrates an example computer system suitable
`for use with the invention. In FIG. 1, multiple nodes (100,
`102, and 104) are interconnected through a switch 106 or
`other interconnect. Each node includes multiple processors
`(P). Each processor includes a first level cache (C1). Each
`node includes a shared second level cache (C2), and shared
`memory (M). Each shared memory includes a directory (D).
`The number of nodes, number of processors per node,
`number of caches,
`location of memory, and the use of
`directories, are all for purposes of illustration only.
`For a cache miss, assume that the time (latency) for a
`processor to retrieve a line from a first level cache from
`another processor within its own node (for example,
`the
`latency for processor 108 to retrieve a line from cache 110)
`is T nanoseconds. The time for the processor to retrieve a
`line from the memory of its own node (for example, the
`latency for processor 108 to retrieve a line from memory
`112) may be about 3T nanoseconds. The time for the
`processor to retrieve a line from the memory of a remote
`node (for example, the latency for processor 108 to retrieve
`a line from memory 114) may be about 6T nanoseconds. The
`time for the processor to retrieve a line from a first level
`cache in a remote node (for example, the latency for the
`processor to retrieve a line from cache 116) may be about 9T
`nanoseconds. Within a typical single-bus system (for
`example, bus 120), a cache-to-cache transfer is typically
`faster than a transfer from memory. In a typical multiple-bus
`system, as in FIG. 1, a worst-case transfer from memory (6T
`nanoseconds) is faster than a worst case cache-to-cache
`transfer (9T nanoseconds). Therefore, for a multiple-bus
`system, as compared to a single-bus system, a different
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`strategy is needed for retrieving lines in case of a cache miss
`in order to optimize overall memory latency.
`In a typical computer system, one memory address cor-
`responds to one line of memory. When an address is pre-
`sented to a memory system,
`the memory system may
`retrieve the requested data, possibly from multiple inter-
`leaved sources, and then place the resulting data into a
`memory register. To increase parallelism, it is known for
`memory systems (for example, memory 112) to actually
`retrieve multiple lines and place multiple lines into a
`memory register. For example,
`in one example commer-
`cially available computer system, a line is 128 bytes, and
`when one line is requested, the memory system retrieves 512
`bytes (4 lines) and places the lines in a register. The
`requesting processor only receives the one requested line.
`The remaining lines are available for reading with reduced
`latency, assuming that one of the other lines may be needed
`soon. Knowledge of this arrangement may be used to
`optimize compilers, which may then take advantage of the
`reduced latency for the other lines.
`It is also known for caches to have an associated prefetch
`buffer, so that in case of a cache miss, the requested line and
`N successive lines are retrieved from memory. Typically, the
`transfer of each of the N lines is the result of a separate bus
`transaction. The requesting cache only receives the one
`requested line, and the remaining lines are available for
`reading with reduced latency assuming that one of the other
`lines may be needed soon.
`In contrast to the above, in a computer system in accor-
`dance with the invention, a group of lines may be retrieved
`from memory, or copied from a cache, and the entire group
`of lines may be cached by the requester, with a single read
`instruction. In addition, cache coherency is maintained for
`groups of lines, in addition to individual lines. In particular,
`individual lines within a group of lines may have different
`owners. Several advantages result. First, consider the rela-
`tive latency times discussed above. If processor 108 requests
`a line that is in memory 114, much of the latency involves
`locating the line. If it is likely that adjacent lines will be
`needed (spatial locality), a system in accordance with the
`invention provides multiple lines for very little incremental
`latency. Individual ownership of lines within groups of lines
`reduces cache-to-cache copy activity. In addition, special
`memory instructions permit transfer of a variable number of
`lines, depending on ownership, further reducing cache-to-
`cache copy activity. In addition, as will be discussed in more
`detail below, a system in accordance with the invention will
`typically transfer a group of lines from memory rather than
`from a cache, thereby reducing the average latency, reducing
`local bus traffic, and reducing link (for example, 118) traffic,
`for a multiple bus system. For example, if a processor in
`node 100 requests a line that is memory 114 in node 104, and
`is also cached in a cache in node 102, the system will find
`the entry in the directory for node 104, and retrieve the lines
`directly from memory 114 rather than cause additional bus
`traffic on node 102. Accordingly, a system using groups of
`lines in accordance with the invention reduces latency and
`takes advantage of spatial
`locality even beyond that of
`sectored caches and prefetch buffers, while avoiding excess
`cache-to-cache transfers.
`
`Typical caches have a line size L, and the unit of coher-
`ence and unit of transfer is L. Sectored caches with sub-lines
`
`may have sub-lines (or sub-blocks) of size S, where S<L, the
`unit of coherence may be S, and the unit of transfer may be
`S or L. In a cache in accordance with the invention, the line
`size is L, the unit of transfer may be L or integer multiples
`of L, and the unit of coherence is L and integer multiples of
`
`8
`
`
`
`US 6,662,277 B2
`
`5
`L. Groups of lines may complement use of sub-lines. That
`is, sectored caches with sub-lines, and snooping, may be
`used within each node, and groups of lines, with a directory
`or coherence filter, may be used across multiple nodes.
`If a memory system retrieves N lines and places them into
`a register, it may be convenient to use the same N lines as
`a group of lines. However, it is not necessary for a group of
`lines to be the same size as a memory register. In some of
`the following discussion, a group of lines is assumed to be
`four lines for purposes of illustration only. Given an M-bit
`address, and N lines in a group of lines, a group of lines is
`defined by the most significant M minus log2(N) bits of the
`address. For example, assuming that a group of lines is four
`lines, and assuming for simplicity a 16-bit address, there is
`a four-line group of lines defined by the 14 most significant
`bits of the address in conjunction with all combinations of
`the two least significant bits of the address.
`It is known for a cache to be organized into sets, with
`some of the address bits used as an index to determine which
`set is to be used for the address. Within each set, other
`address bits, called a tag, are used to determine which line
`within a set corresponds to the address. With groups of lines
`in accordance with the invention, the index may correspond
`to a group of lines instead of a single line. For example,
`given an 1-bit index, the most significant I-2 bits of the index
`may be used as an index for a four-line group of lines. One
`tag may apply to the entire group of lines. Alternatively, each
`line preferably has a separate addressable entry, as deter-
`mined by the tag for the line. Each line preferably has its
`own coherency state, and the overall group may optionally
`have a separate coherency state for convenience. The sepa-
`rate coherency state for the overall group is discussed in
`more detail later.
`
`From the above discussion, given an address, N lines may
`be retrieved from memory, or from another cache, as a
`group, and N lines may be placed into a cache as a group.
`Once the group of lines is placed into the cache, the lines
`may be treated as a group, or as separate lines, as discussed
`below.
`
`Cache coherence protocols commonly place each cached
`line into one of multiple states. One common approach uses
`three possible states for each line in a cache. Before any lines
`are placed into the cache, all entries are at a default state
`called “Invalid”. When a previously uncached physical line
`is placed into the cache, the state of the entry in the cache
`is changed from Invalid to “Shared”. If a line is modified in
`a cache, it may also be immediately modified in memory
`(called write through). Alternatively, a cache may write a
`modified line to memory only when the modified line in the
`cache is invalidated or replaced (called write back). For a
`write-back cache, when a line in the cache is modified or will
`be modified, the state of the entry in the cache is changed to
`“Modified”. The three-state assignment just described is
`sometimes called a MSI protocol, referring to the first letter
`of each of the three states.
`A common variation adds one additional state. In the
`
`variation, when a physical line is copied into the cache, if no
`copy of the line exists in any other cache, the line is placed
`in an “Exclusive” state. The word “Exclusive” means that
`
`exactly one cache hierarchy has a copy of the line. If a line
`is in an “Exclusive” state in a cache hierarchy for a first
`processor, and if a second processor requests the same line,
`the line will then be copied into two cache hierarchies, and
`the state of the entry in each cache is set to “Shared”. This
`four-state assignment just described is sometimes called a
`MESI protocol, referring to the first letter of each of the four
`states. There are many other variations.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`A line in the Exclusive state may be modified at any time
`by its owner without further coherency operations. Once
`modified, the line is marked as Modified in the local cache.
`The Exclusive state allows a line to be owned (the only copy
`in the system), without being modified. This scheme reduces
`the number of extra coherence requests needed to modify a
`line.
`
`In the following discussion, the MESI protocol is used to
`illustrate a first example embodiment of the invention. The
`invention is equally applicable to other cache coherency
`systems and methods. In addition, some further assumptions
`are made to facilitate discussion of a first example
`embodiment, as follows. A line is owned in the MESI
`protocol if it is in the Modified or Exclusive state. For
`purposes of illustration only, if a line has an owner (state
`Modified or Exclusive), and the line is requested by a
`requestor that may modify the line, the most current copy of
`the line will be provided to the requester, and in addition will
`be written to memory. The previous owner will Invalidate its
`copy of the line, and the new owner will set the initial state
`of the line to Exclusive. If a line is not owned (not cached,
`or cached with state Shared), the requestor will set the initial
`state of the line to Shared.
`
`Additional assumptions may be useful for groups of lines.
`If a requested line is in a group of lines that has mixed
`ownership (that is, at least two lines in the group have
`different owners), then the requester preferably should get
`ownership of only the one requested line to avoid excessive
`cache-to-cache transfers. In contrast, if a requested line is in
`a group of lines that is entirely owned by a single owner,
`then it is likely that ownership of an entire data structure is
`changing, and the requester preferably should get ownership
`of the entire group of lines. These rules reduce latency for
`groups of lines by transferring a group of lines and owner-
`ship of the group if the lines all used to be owned by the
`same previous owner, while preventing increased cache-to
`cache-transfers by not combining lines that have a history of
`being owned by different owners at the same time.
`Still another optional feature of a system in accordance
`with the invention is to give the requestor control over
`whether multiple lines should be transferred when only one
`line is requested. Specifically, new memory system com-
`mands may be implemented that optionally limit a request to
`a single line, or optionally permit multiple lines to be
`transferred.
`
`Given the above assumptions and features, FIG. 2 illus-
`trates an example method for maintaining coherency for
`groups of lines for a line request in which ownership of the
`requested line is not requested. FIGS. 3 and 4 illustrate an
`example method for maintaining coherency for a group of
`lines for a line request in which ownership of the requested
`line is requested. In FIG. 1, and in the following discussion
`of FIGS. 2—4, a directory is assumed for purposes of an
`example system, but coherency filters or other coherency
`methods are equally suitable.
`In FIG. 2, a line is requested, but ownership of the line is
`not requested. That is, the requester needs to read the data,
`and does not need to modify the data. Coherency states for
`the corresponding group of lines are checked in the direc-
`tory. At step 200, if all lines in the corresponding group of
`lines are unowned (for MESI, no lines are in the Modified
`or Exclusive state in any cache),
`then at step 202 the
`requester gets a copy of the requested line, and copies of up
`to all of the other lines in the corresponding group of lines.
`The requestor will mark its copy of retrieved lines as Shared
`(both locally and within the appropriate directory), and the
`
`9
`
`
`
`US 6,662,277 B2
`
`7
`requestor will mark its entry locations for any non-retrieved
`lines as Invalid. At step 204, if some lines in the correspond-
`ing group of lines are owned, and the requested line is not
`owned, then at step 206 the requestor gets a copy of the
`requested line, and copies of up to all of the other unowned
`lines in the corresponding group of lines. The requester will
`mark its copy of retrieved lines as Shared (both locally and
`within the appropriate directory), and the requestor will
`mark its entry locations for any non-retrieved lines as
`Invalid.
`
`Note that at steps 202 and 206, the requested line, and
`perhaps other lines in a group of lines, are not owned.
`Therefore,
`the unowned lines may be retrieved from
`memory, and as discussed above, for a multiple-bus system
`the unowned lines at a remote node are preferably retrieved
`from memory. The system may use snooping within a node,
`with a resulting priority placed on retrieving lines from
`cache if available, and may use a directory or coherency
`filter across nodes, with a priority placed on retrieving lines
`from memory. As a result,
`latency per line is reduced
`because multiple lines are transferred with a single request,
`and for multiple-bus systems, latency is further reduced, and
`local bus traffic is reduced, because the lines are retrieved
`from memory instead of from caches.
`At step 208, the requested line is owned, and if ownership
`within the group of lines is mixed (that is, some lines in the
`corresponding group of lines are owned by different
`owners),
`then at step 210 the requester gets just
`the
`requested line from the owner. As discussed above,
`the
`requestor preferably should get ownership of only the one
`requested line to avoid excessive cache-to-cache transfers.
`The requestor will mark its copy of the retrieved line as
`Exclusive or Shared (both locally and within the appropriate
`directory), and the requester will mark its entry locations for
`the non-retrieved lines as Invalid. At step 212, all lines in the
`corresponding group of lines are owned by the same owner.
`The requester can get a copy of the requested line and copies
`of up to all the other lines in the group from the owner. As
`discussed above, it is likely that ownership of an entire data
`structure is changing, and the requester preferably should get
`a copy of the entire group of lines. The requester will mark
`its copy of retrieved lines as Exclusive or Shared (both
`locally and within the appropriate directory), and the
`requestor will mark its entry locations for any non-retrieved
`lines as Invalid.
`
`In FIG. 3, a line is requested, and ownership of the line is
`requested. At step 300, if all lines in the corresponding group
`of lines are unowned, then at step 302 the requestor can get
`a copy of the requested line, and up to all of the other lines
`in the corresponding group of lines. All existing copies of
`the requested lines, other than the requestor’s copy, will be
`marked as Invalid. The requestor will mark its copy of the
`requested line as Exclusive, and its copies of other retrieved
`lines as Shared (both locally and within the appropriate
`directory), and the requestor will mark its entry locations for
`any non-retrieved lines as Invalid. At step 304, if some lines
`in the corresponding group of lines are owned, and the
`requested line is not owned, then at step 306 the requestor
`gets a copy of the requested line, and copies of up to all of
`the other unowned lines in the corresponding group of lines.
`All existing copies of the requested lines, other than the
`requestor’s copy, will be marked as Invalid. The requestor
`will mark its copy of the requested line as Exclusive (both
`locally and within the appropriate directory), and the
`requestor will mark its copies of any other retrieved lines as
`Shared, and the requester will mark its entry locations for
`any non-retrieved lines as Invalid. As discussed above in
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`for a multiple-bus system,
`conjunction with FIG. 2,
`unowned lines are preferabl