`
`R. E. Kesslerf, Richard Jooss, Alvin Lebeck and Mark D. Hill:
`
`University of Wisconsin
`Computer Sciences Department
`Madison, Wisconsin 53706
`
`ABSTRACT
`
`set-
`implementing wide
`approach to
`traditional
`The
`associativity is expensive, requiring a wide tag memory (directory)
`and many comparators. Here we examine alternative implementa-
`tions of associativity that use hardware similar to that used to
`implement a direct-mapped cache. One approach scans tags seri-
`ally from most-recently used to least-recently used. Another uses a
`partial compare of a few bits from each tag to reduce the number of
`tags that must be examined serially. The drawback of both
`approaches is that they increase cache access time by a factor of
`two or more over
`the traditional
`implementation of
`set—
`associativity, making them inappropriate for cache designs in
`which a fast access time is crucial (e.g. level one caches, caches
`directly servicing processor requests).
`These schemes are useful, however, if (1) the low miss ratio
`of wide set-associative caches is desired, (2) the low cost of a
`direct—mapped implementation is preferred, and (3) the slower
`access time of these approaches can be tolerated. We expect these
`conditions to be true for caches in multiprocessors designed to
`reduce memory interconnection traffic, caches implemented with
`large. narrow memory chips, and level two (or higher) caches in a
`cache hierarchy.
`
`1. Introduction
`
`The selection of associativity has significant impact on cache
`performance and cost [Smit86] [Smit82] [HillS7] [Przy88a]. The
`associativity (degree of associativity, set size) of a cache is the
`number of places (block frames) in the cache where a block may
`reside. Increasing associativity reduces the probability that a block
`is not foundein the cache (the miss ratio) by decreasing the chance
`that recently referenced blocks map to the same place [Smit78].
`However, increased associativity may nonetheless result in longer
`effective access times since it can increase the latency to retrieve
`data on a cache hit [Hi1188,Przy88a]. When it is important to
`minimize hit times direct-mapped (associativity of one) caches
`
`1' This work has been supported by graduate fellowships from the National Sci»
`ence Foundation and the University of Wisconsin-Madison.
`* This work was sponsored in part by research initiation grants from the graduate
`school of the University of Wisconsin-Madison.
`
`Permission to copy without fee all or part of this material is granted
`provided that the copies are not made. or distributed for direct commer-
`cial advantage, the ACM copyright notice and the title 'of the publication
`and its date appear, and notice is given that copying is by permission of
`the Association for Computing Machinery. To copy otherwise, or to
`republish, requires a fee and/or specific permission.
`
`© 1989 ACM 0884-7495/89/0000/0131$01.50
`
`may be preferred over caches with higher associativity.
`Wide associativity is important when: (i) miss times are very
`long or (12) memory and memory interconnect contention delay is
`significant or sensitive to cache miss ratio. These points are likely
`to be true for shared memory multiprocessors. Multiprocessor
`caches typically service misses via a multistage interconnect or
`bus. When a multi-stage interconnect is used the miss latency can
`be large whether or not contention exists. Bus miss times with low
`utilizations may be small, but delays due to contention among pro-
`cessors can become large and are sensitive to cache miss ratio. As
`the cost of a miss increases, the reduced miss ratio of wider associ-
`ativity will result in better performance when compared to direct—
`mapped caches.
`Associativity is even more useful for level two caches in a
`two-level multiprocessor cache hierarchy. While the level one
`cache must service references from the processor at the speed of
`the processor, the level two cache can be slower since it services
`only processor references that miss in the level one cache. The
`additional hit time delay incurred by associativity in the level two
`cache is not as important
`[Przy88b]. Reducing memory and
`memory interconnect traffic is a larger concern. Wide associativity
`also simplifies the maintenance of multi—level inclusion [Baer88].
`This is the property that all data contained in lower level caches is
`contained in their corresponding higher level caches. Multi-level
`inclusion is useful for reducing coherency invalidations to level
`one caches. Finally, preliminary models indicate that increasing
`associativity reduces the average number of empty cache block
`frames when coherency invalidations are frequentl, This implies
`that wider associativity will result in better utilization of the cache.
`Unfortunately. increasing associativity is likely to increase
`the board area and cost of the cache relative to a direct-mapped
`cache. Traditional
`implementations of a-way set-associative
`caches read and compare all a tags of a set in parallel to determine
`where (and whether) a given block resides in the cache. With 2-bit
`tags, this requires a tag memory that can provide a X: bits in paral—
`lel. A direct-mapped cache can use fewer, narrower, deeper chips
`since it requires only a t-bit wide tag memory. Traditional imple-
`mentations of associativity also use a comparators (each t-bits
`wide) rather than one, wider data memory, more buffers, and more
`multiplexors as compared to a direct-mapped cache. This adds to
`the board area needed for wider associativity. As the size of
`memory chips increases. it becomes more expensive to consume
`board area with multiplexors and other logic since the same area
`could hold more cache memory.
`While numerous papers have examined associativity [Lipt68]
`[Kapl73] [Be1174] [Stre76] [Smit78] [Smit82] [Clar83] [Agar88].
`most have assumed the traditional implementation. One of the few
`
`papers describing a cache with a non-traditional implementation of
`l A miss to rt set-associative cache can fill any empty block frame in the set, whereas. a miss
`to a directen'tflpped cache can fill only a single frame. Increasing associativity increases the
`Chance that on invalidated block frame will be quickly used again by making more empy
`frames available for reuse on a miss.
`
`UNIFIED 1004
`
`UNIFIED 1004
`
`
`
`
`
`
`SET+0
`SET+1vow
`SET+ a-l
`
`(b) Serial (Using Naive Approach)
`
`Figure 1. Implementing Set-Associativity.
`Part (a) of this figure (top) shows the traditional implementation of
`the logic to determine hit/miss in an a-way set-associative cache.
`This logic uses the “SET” field of the reference to select one t-bit
`tag from each of 4 banks. Each stored tag is compared to the incom-
`ing tag (“TAG"). A hit is declared if a stored mg matches the in-
`coming tag, a miss otherwise.
`Part (b) (bottom) shows a serial implementation of the same cache ar-
`chitecture. Here the a stored tags in a set are read from one bank and
`compared serially (the tags are addressed with "SET" concatenated
`with 0 through a — 1).
`
`It discusses a cache implemented for a
`associativity is [Chan87].
`System/370 CPU that has a one-cycle hit
`time to the most-
`recently-used (MRU) block in each set and a longer access time for
`other blocks in the set, similar to the Cray-1 instruction buffers
`[Cray76] and the biased set-associative translation buffer described
`in [Alex86].
`This paper is about lower cost implementations of associa»
`tivity, implementations other than the traditional. We introduce
`cache designs which combine the lower miss ratio of associativity
`and the lower cost of direct—mapped caches. In the new implemen-
`tations the width of the comparison circuitry and tag memory is t,
`the width of one tag, instead of the a ><t required by the traditional
`implementation.
`Implementations using tag widths of b Xt
`(l < b < a) are possible and can result in intermediate costs and
`performance, but are not considered here. This paper is not about
`level two caches per se, but we expect these low cost schemes to be
`applicable to level two caches. We organize this paper as follows.
`Section 2 introduces the new approaches to implementing associa-
`tivity, shows how they cost less than the traditional implementation
`of associativity, and predicts how they will perform. Section 3
`analyzes the approaches of Section 2 with trace-driven simulation.
`
`2. Alternative Approaches to Implementing Set-Associativity
`Let a , a power of two, be a cache's associativity and let t be
`the number of bits in each address tag. During a cache reference.
`an implementation must determine whether any of the a stored
`tags in the set of a reference match the incoming tag. Since at most
`one stored tag can match, the search can be terminated when a
`match is found (a cache hit). Alla stored tags, however, must be
`examined on a cache miss.
`
`Figure la illustrates the traditional implementation of the tag
`memory and comparators for an a-way set-associative cache.
`which reads and probes all tags in parallel. We define a probe as a
`comparison of the incoming tag and the tag memory. If any one of
`the stored tags match. a hit is declared. We concentrate only on
`cache tag memory and comparators, because they are what we pro-
`pose to implement differently. Additional memory (not shown) is
`required by any implementation of associativity with a cache
`replacement policy other than random. A direct-mapped cache
`does not require this memory. The memory for the cache data (also
`not shown)
`is traditionally accessed in parallel with the tag
`memory.
`Figure lb shows a naive way to do an inexpensive set-
`associative lookup.
`It uses hardware similar to a direct-mapped
`cache, but serially accesses the-stored tags of a set until a match is
`found (a hit) or the tags of the set are exhausted (a miss). Note how
`it requires Only a single comparator and a t-bit wide tag memory,
`whereas, the traditional implementation requires 2 comparators and
`an a ><t wide tag memory.
`Unfortunately, the naive approach is slow in comparison to
`the traditional implementation. For hits, each stored tag is equally
`likely to hold the data. Half the non-matching tags are examined
`before finding the tag that matches, making the average number of
`probes (a —l)/2 + 1. For a miss, all a stored tags must be examined
`in series, resulting in a probes. The traditional implementation
`requires only a single probe in both cases.
`
`2.1. The MRU Approach
`The average number of probes needed for a hit may be
`reduced from that needed by the naive approach by ordering the
`stored tags so that the tags most likely to match are examined first,
`One proposed order [8088] [Matt70] is from most‘recently-used
`(MRU) to least-recently-used (LRU). This order is effective for
`level one caches because of the temporal locality of processor
`reference streams [$088] [Chan87]. We find (in Section 3) that it is
`also effective for level two caches due to the temporal locality in
`streams of level one cache misses.
`
`One way to enforce an MRU comparison order is to swap
`blocks to keep the most-recently—used block in block frame 0, the
`second most-recently-used block in block frame 1, etc. Since tags
`(and data) would have to be swapped between consecutive cache
`accesses in order to maintain the MRU order, this is not a viable
`implementation option for most set-associative caches.2 A better
`way to manage an MRU comparison order, illustrated in Figure 2a,
`is to store information for each set indicating its ordering. For-
`tunately, information similar to a MRU list per set is likely to be
`maintained anyway in a set-associative cache implementing a true
`LRU replacement policy.
`lnthis case there is no extra memory
`requirement to store the MRU information. We will also analyze
`(in section 3) reducing the length of the MRU list. using approxi-
`mate rather than full MRU searches, to further decrease memory
`requirements. Unfortunately, the lookup of MRU information must
`precede the probes of the tags3_ This will lead to longer cache
`lookup times than would the swapping scheme.
`If we assume that the initial MRU list lockup takes about the
`same time as one probe, the average number of probes required on
`aacache lookup resulting in a hit using the MRU approach is 1+
`21' f, where f,~ is the prObability the i-th MRU tag matches,
`i=1
`
`given that one of them will match4. The MRU scheme performs
`particularly poorly on cache misses, requiring l + a probes. This 7
`2 While maintaining MRU order using swapping may be feasible for a 2-way
`set-associative cache, Agarwal's hash-rehash cache [Agar87] can be superior to
`MRU in this 2vway case.
`3 While it is possible to lookup the MRU information in parallel with the level-
`one~cache access, it is also possible to start level-two-cache accesses early for
`any of the other implementation approaches [Bren84].
`4 Each f‘ is equal to the probability of areference to LRU distance 1' divided by
`the hit ratio, for a given number of sets [Smit78].
`
`132
`
`
`
`
`
`MRU INFO
`
`SET
`
`°°°
`MRUIMRUZ MRU:
`
`srzr + MRUI
`SET + Msz
`coo
`SET+ MRUI
`
`(a) Using MRU Order
`-IV;(6I-
`
`
`
`l
`l
`
`I
`l
`
`(1)
`
`“'1
`
`l
`L.
`
`|
`.L
`
`
`
`<see below>
`
`
`SET +
`
`‘1’ set + PMl
`sar + m2u.
`
`0
`
`l
`
`a—l
`
`(b) Using Partial Compares
`
`Figure 2. Improved Implementations of Serial Set-Associativity.
`Part (a) of this figure (top) shows an implementation of serial set-
`associativity usin ordering information. This approach first reads
`MRU ordering in onnatiorr (left) and then probes the stored tags from
`the one most-likely to match to the one least-likely to match (right).
`Note “+” represents concatenate.
`Part (b) (bottom) shows an implementation of serial setoassociativity
`using partial compares. This approach first reads k (k =L t/aJ ) bits
`from each stored tag and compares them with the corresponding bits
`of the incoming tag. The second step of this approach serially com-
`pares all stored tags that partially matched (“PM") with the incom-
`ing tag until a match is found or the tags are exhausted (right).
`
`is one more than the naive implementation on misses since the
`MRU list is uselessly consulted.
`
`2.2. The Partial Compare Approach
`We have carefully defined a probe to be the comparison of
`the incoming tag and the tag memory, without requiring that all
`bits of the tag memory come from the same stored tag. We now
`introduce the partial compare approach that uses a two step pro—
`cess to often avoid reading all r bits of each stored tag. In step one.
`the partial compare approach reads t/a bits from each of a stored
`tags and compares them with the corresponding bits of the incom‘
`ing tag. Tags that fail this partial comparison cannot hit and need
`not be examined further on a cache lockup.
`In step two, all stored
`tags that passed step one are examined serially with t-bit (full)
`compares.
`
`The implementation of partial compares is not costly, as it
`can use the same memory and comparators as the naive approach
`assuming k, the partial compare width (k =L t/aj ), is a multiple
`of memory chip and comparator width. Partial compares are done
`with the help of a few tricks. The first trick, illustrated in Figure
`2b, is to provide slightly different addresses to each k-bit wide
`collection of memory chips, addressing the i-th collection with the
`address of the set concatenated with logzi. The second trick is to
`divide the t-bit comparator into a separate k-bit comparatorss.
`5 If kxa does not equal 1 then |_ t/aJ xa bits of the tag can be used for partial
`compares. with another comparator for the extra bits.
`
`I33
`
`This is straight-fonvard, since wide comparators are ofien imple-
`mented by logically AND-ing the results of narrow comparators.
`Note how step two of this partial compare approach uses the same
`tag memory and comparators as step one, but does full tag com-
`pares rather than partial compares.
`The performance of this approach depends on how well the
`partial compares eliminate stored tags from further consideration.
`For independent tags, the average number of probes will be minim-
`ized if each of the values [0, 2" — l] is equally likely for each of the
`k -bit patterns on which partial compares are done. While this con-
`dition may be true for physical address tags, it is unlikely to be true
`for the high order tag bits of virtual addresses. Nevertheless, we
`can use the randomness of the lower bits of the virtual address tag
`to make the distribution of the higher ones more uniform and
`independent. For example, one can transform a tag to a unique
`other tag; by exclusive-oring the low-order k bits of the tag with
`each of the other k -bit pieces of the tag before it is stored in the tag
`memory.
`Incoming tags will go through the same transformation
`so that the incoming tag and the stored tag will match if the
`untransformed tags are the same. The original tags can be retrieved
`from the tag memory for writing back blocks on replacement via
`the same transformation in which they were stored (i.e.
`the
`transformation is its own inverse). This method is used throughout
`this paper to produce stored tags with better probabilistic charac-
`teristics. We will also analyze using no transformation, and using a
`more sophisticated one in Section 3. We make the assumption in
`our analysis to follow that each of the values [0. 2“ — 1] is equally
`likely and independent for each partial compare. Our trace—driven
`simulation (in Section 3) tests this assumption.
`The probability that an incoming tag partially-matches a
`stored tag is 1/2". A false match is a partial tag match which will
`not lead to a match of the full tag. Given a hit, the expected
`number of false matches in step one is (a—1)/2", of which half
`will be examined in step two before a hit is determined. Thus, the
`expected number of probes on a hit is 1+ (a—l)/2"+l + l, where
`the terms of the expression are: the probe for the partial comparison
`(step one), the full tag comparisons (in step two) due to false
`matches, and the full tag match which produces the hit, respec-
`tively. On a miss,
`the expected number of probes in simply
`1+ a /2*, the probe for the partial comparison and the number of
`false matches, respectively.
`The partial compare scheme can lead to poor performance if
`many false matches are encountered in step two. Wider partial
`compares could eliminate some of these false matches. The partial
`compare width can be increased by partitioning the a stored tags of
`a set into: s proper subsets (each containing a/s tags) and examin-
`ing the subsets in seriesé, The step one and step two partial com-
`pare sequence is performed for each of the subsets to determine if
`there is a cache hit. The order in which the subsets are examined is
`arbitrary throughout this paper.
`Increasing the number of subsets
`will increase the partial compare width since fewer partial com-
`pares are done concurrently. For example, 2 subsets could be used
`in an 8-way set-associative cache, with 4 entries in each. A lockup
`in this cache would proceed as two 4-way (single subset) lookups,
`one after the other. With a 16-bit wide tag memory in this cache,
`partitioning into 2 subsets would result in 4-bit partial compares.
`This will. result in fewer false matches titan with the 2-bit partial
`compares without subsets. The number of probes per access
`decreases when using proper subsets if the expected number of
`false matches is reduced (due to wider partial compares) by more
`than the number of probes added due to the additional subsets.
`Subsets may be desirable for implementation considerations in
`addition to performance considerations if the memory chip or com-
`parator width dictate that the partial compares be wider.
`At one extreme (where s = a), partial compares with subsets
`would be implemented as the naive approach, while the other
`
`
`6 Note that subsets are not useful with the naive and MRU approaches.
`
`
`
`
`
`—! Ex-
`tchrobes
`
`
`Method
`Assoc- Number
`Tag Memory
`Assume
`Assume
`‘
`iativi
`Subsets Width its
`Hit
`Miss
`
` 64
`1
`
`
`}
`Traditional
`3
`1
`a X’
`1
`(1x2) (a—l) +1
`
`2.5
`
`
`1+2ifi
`l+a
`[:1
`[2,5]
`5
`
`1
`max(t,a><k)
`2+(a—1)/2’=+1 1+a/2"
`
`1.25
`‘
`‘
`l
`2.09
`16
`
`
`
`.2+ (1/2)(s—1)
`s+a/2"
`max(t, a/S Xk)
`
`
`
`
`+ (r171)l2‘H1
`2.88
`3.00
`(k -2 bits)
`16
`re
`
`'
`2.5
`2.72
`
`
`
`
`
`Table 1. Performance of Set-Associativity Implementations.
`
`gives at least four-bit pam‘al compares will work well.
`Table 1 summarizes our analysis of the expected number of
`probes required for the traditional, naive, MRU and partial compare
`approaches to implementing set-associativity. Note that this table
`as well as most of the trace-driven simulation assumes 16 bit tags
`are used. We will examine the positive effect of increasing the tag
`width on the partial compare approach in section 3.
`Table 2 summarizes paper implementations of tag memory
`and comparison logic for a direct-mapped cache. a traditional
`implementation of set-associativity, and an implementation of set-
`associativity using MRU and partial compares. We found that the
`MRU and partial compare implementations have a slower access
`time than the traditional
`implementation of associativity but
`includes no implementation surprises. Most notably, the control
`logic was found to be of reasonable complexity. The MRU and
`partial compare implementations use hardware similar to a direct—
`mapped cache and can make effective use of page—mode dynamic
`RAMs, as would other serial implementations of set—associativity.
`Page-mode dynamic RAMs are those in which the access time of
`multiple probes to the same set can be significantly less than if the
`probes were to other sets. Subsequent probes take less than half the
`time of the first probe to the set. Cache cost is reduced in two ways
`when using one of the alternative implementations of associativity.
`First, tag memory cost is directly reduced, by 1/3 to 1/2 in our
`design. Second, cache data memory cost is reduced since only 1.
`rather than a words, need to be read at a time.
`
`For various methods and associativities this table gives the number of
`subsets, the tag memory width, the number of probes for a hit, and,
`the number for a miss. The table assumes t-bit tags (t = 16), k-bit
`partial compares, and that the i-th most-recently used tag matches
`with probability f,- on a hit. Note how an increase from 1 to 2 subsets
`improved the predicted performance of the partial compare approach
`at an associativity of 8.
`
`3. Trace-Driven Performance Comparison
`(s =1) can lead to many false matches. An important question to
`This
`section analyzes the performance of the low-cost
`ask is: what number of subsets leads to the best performance (Le.
`schemes to implement set-associativity in level two caches using
`fewest number of probes per cache lockup) ? The next three
`simulation with relatively short multiprograrnming traces. We
`answers to this question vary from the most-accurate to the most
`analyze associativity in the level two cache since the low cost
`succinct. (1) One can compute the expected number of probes for
`implementations of associativity are more appropriate for level two
`each of s = 1,2,4.
`,a/2 and at using the equations for a hit and
`(or higher) caches than for level one caches. We concentrate on
`miss (from Table 1) weighted to reflect your expected miss ratio
`presenting and characterizing the relative performance of the alter-
`and choose the minimum.
`(2) One can ignore misses (which are
`natives. We do not demonstrate the absolute utility of these
`less common and never require more than twice the probes of hits).
`approaches to important future cache configurations (e.g. multiple
`assume variables are continuous, and find the optimum partial com-
`megabyte level two caches in multiprocessors) since our traces are
`pare width. k0 , =log21 — 1/2 for hits only. The optimum number
`for a single processor and are not sufficiently long to exercise very
`of subsets for iiits and misses together is likely to be the value for s
`large caches.
`resulting from a partial compare width ofL km] or I" k0,”) .
`(3)
`assumed cache
`the
`and
`traces
`the
`The makeup of
`Finally, one can observe that many tags in current caches are
`configurations are indicated in Table 3. We assume a uniprocessor
`between 16 and 32 bits wide, implying the number of subsets that
`system with a two level cache hierarchy (a level one cache and a
`
`
`
`
`lar e1in out stud ,two cachelevel because the traces were
`
`
`-lementations
`
`
`Usin: Static RAMs
`Usingflnamic RAMs
`
`
`
`
`Direct-
`4-Wa Set-Associative
`
` :grl
`
`May" .
`
`
`
`Memory Packages
`
`
`40
`40
`40
`Basic Access Time (115)
`
`
`
`Illa
`n/a
`n/a
`Page Mode Access Time (ns)
`
`
`40
`40
`40
`
`
`Basic Cycle Time (ns)
`
`
`
`n/a
`n/a
`n/a
`Page Mode Cycle Time (ns)
`
`
`Size (bits)
`256Kx(1 6,8)
`
`
`
`Implementatlons
`
`
`1 50+50x
`
`Access Time (ns)
`65+55x
`65+55y
`
`250+50(x+u)
`Cycle Time (ns)
`7S+55(x+u)
`75+55y
`
`
`Number of Packages 22
`24
`
`
`
`
`
`
`
`Table 2. Trial Set-Associativity Implementations.
`
`This table compares paper implementations of the tag memory and comparison logic for a direct-mapped and four-way set-associative cache
`holding 1 million 24-bit tags, assuming dynamic or static RAM chips housed in hybrid packages. The top half of the table summarizes the
`memory packages used to implement tag memory, while the bottom half gives cache implementation numbers. The MRU implementation
`” n>
`assumes that the MRU list storage costs nothing extra (as it would it full LRU replacement is used). MRU access and cycles are given as-
`suming
`x
`IS the expected number of probes after reading the MRU information (”x" is between 1 and a for hits, a for misses) and “u” is
`the probability that MRU information must be updated. Partial compare access and cycles are given assuming “y” probes in step two (“y”
`is between 1 and a for hits and 0 and a for misses). The number of packages assumes some semi-custom logic and hybrid packages.
`
`
`l34
`
`
`
`Trace-Driven Two-Level Cache Simulation
`
`ATUM [Agar86] virtual address traces of a multipro-
`grammed operating system, described in [Hi1187]. Operat-
`ing system references are included as well as references of
`user-level processes. One very large trace (over 8 million
`references) was constructed as a concatenation of 23 indi-
`vidual ATUM traces, each of which is approximately
`350,000 references. Cache flushes of the level one and
`level two caches were inserted between each of the 23
`traces, thus each trace starts from a “cold" cache.
`
`
`
`A direct-mapped write-back cache. On misses causing
`replacement of a dirty block,
`the new block is first
`obtained via a read-in request, then a write-back is issued
`to the level two cache. Three level one caches are simu-
`latett A 4 Kbyte cache with a 16 byte block size (4K-16);
`16 Kbyte with 16 byte blocks (16K-16); and 16 Kbyte with
`32 byte blocks (16K-32). The miss ratios corresponding to
`these level one caches are: 0.1181, 0.0657, and 0.0513,
`respectively.
`
`An a-way set-associative write-back cache which services
`read-ins and write-backs from the level one cache. We
`compare different implementations of associativity in the
`level two cache. The least-recently-used entry in a set is
`replaced on a cache miss. We simulate five different level
`two caches: a 64 Kbyte cache with 16 byte block size
`(64K-16); 64 Kbyte with 32 byte blocks (64K-32); 256
`Kbyte with 16 byte blocks (256K-16); 256 Kbyte with 32
`byte blocks (256K-32); and 256 Kbyte with 64 byte blocks
`(256K—64). While multi—level inclusion is not enforced in
`this simulation, by monitoring the number of write-backs
`which missed when written back to the level two cache we
`were able to extrapolate that the maintenance of multi-
`level inclusion would have a very small effect (in most
`configurations studied, no effect) on the miss ratio of the
`level two cache (and no effect on the miss ratio of the level
`one cache).
`
`Table 3. Detailed Information on the Trace-Driven Simulation.
`
`
`uniprocessor traces. The level one cache is direct-mapped, while
`the level two cache is of varying associativity. Both caches are
`write-back caches, with the level two cache servicing read-in and
`write-back requests from the level one cache. We chose this
`write-back configuration to minimize the amount of communica-
`tion between cache levels. This can be important
`in a shared
`memory multiprocessor since the level one cache will be utilized
`servicing processor references while the level two cache is servic-
`ing coherency invalidations, as in [Good88]. Also, it was found in
`[Shor88] that this configuration has better performance than if
`either cache is write-through. Cache sizes simulated here (up to
`256 Kbytes) are limited by the size of the traces. We expect future
`level two (and higher) caches to be considerably larger (e.g. 4
`Mbytes). Though the results presented are for “cold” caches, lim-
`ited “warmer" results were found to be similar, except that the
`miss ratios were smaller.
`
`The graphs in Figure 3 show the average number of probes
`versus the associativity of the level two cache for a 16K-16 (16
`Kbyte capacity with 16 byte block size) level one cache and 256K-
`32 (256 Kbyte with 32 byte block size) level two cache. The tag
`width is 16-bits (t =16) and the partial compare width 4-bits
`(k =4) in all simulations unless otherwise specified. 1, 2. and 4
`subsets were used for 4, 8, and l6-way set-associative partial com-
`pare implementations, respectively. The graphs indicate the gen-
`eral linearly increasing relationship between the number of probes
`required per search and the associativity. The number of probes
`per
`access
`is
`expected
`to
`increase
`for
`the
`alternative
`
`135
`
`implementations of associativity as the associativity increases since
`there are more places where a given cache block can reside. A
`cache lockup simply must look in more places on the average. For
`wider associativity to be preferred, the added delay for these addi-
`tional probes must be more than offset by the time saved servicing
`fewer misses. One would also expect
`the Naive and Partial
`schemes: to have a linear relationship between probes per access
`and associativity. However, the fact that this relation is linear for
`MRU came as a surprise. We will examine the MU and partial
`schemes: more closely in subsequent figures. As will always be the
`case, the traditional implementation of associativity results in the
`minimum number of probes. These graphs show that the partial
`compare approach performs the best of the low cost implementa-
`tions. The naive scheme performs the worst, with the MRU
`scheme between them.
`
`Figure 3 also shows the performance benefit of a write-back
`optimization which can be made when the multi-level inclusion
`property is maintained with a cache hierarchy. The level one cache
`can be certain that all write-back requests will hit in the level two
`cache.
`it can also be certain that the block will reside in precisely
`the same position in which it was loaded in the level two cache
`from memory (if blocks do not change position in the level two
`cache from the time they are loaded to the time they are replaced).
`This implies that if the level one cache retains a logz a-bit indica-
`tor of which position in the set the given cache block occupies (a is
`the associativity of the level two cache), write-backs can proceed
`without requiring tag probes. Note that even if multi-level inclu-
`sion is not maintained, the indicators in the level one cache can be
`used as hints, not always correct, where the entry resides in the
`level two cache.
`All
`the methods, Traditional, Naive, MRU. and Partial
`require no probes to service a write-back request when using the
`write—back optimization. Since write—backs are approximately 20%
`of the requests to the level two cache (as shown in Table 4), this
`can result in significant performance improvements, as indicated in
`the figure. We feel the cost of implementing this optimization is
`sufficiently modest (2 bits per level one cache entry for a 4—way
`set-associative level two cache) to warrant its use when implement-
`ing one of the reduced cost implementations of associativity, We
`assume the write—back optimization is used, and all subsequent
`figures contain data for read-in requests only, since the different
`approaches perform the same on write~backs. Write-back requests
`are still considered references as they update the MRU list, deter—
`mining the replacement policy of the cache.
`Table 4. presented at the end of the paper. lists the number of
`probes required for various cache configurations when using the
`naive, MRU, and partial schemes. Note that the data in Table 4
`assumes the write-back optimization is being used. Table 4 uses
`the terms global miss ratio and local miss ratio [Przy88b]. The
`global miss ratio is the fraction of processor requests which miss in
`both the level one and level two cache. The local miss ratio of the
`level two cache is the fraction of read-ins and write-backs from