throbber
Inexpensive Implementations of Set-Associativity
`
`R. E. Kesslerf, Richard Jooss, Alvin Lebeck and Mark D. Hill:
`
`University of Wisconsin
`Computer Sciences Department
`Madison, Wisconsin 53706
`
`ABSTRACT
`
`set-
`implementing wide
`approach to
`traditional
`The
`associativity is expensive, requiring a wide tag memory (directory)
`and many comparators. Here we examine alternative implementa-
`tions of associativity that use hardware similar to that used to
`implement a direct-mapped cache. One approach scans tags seri-
`ally from most-recently used to least-recently used. Another uses a
`partial compare of a few bits from each tag to reduce the number of
`tags that must be examined serially. The drawback of both
`approaches is that they increase cache access time by a factor of
`two or more over
`the traditional
`implementation of
`set—
`associativity, making them inappropriate for cache designs in
`which a fast access time is crucial (e.g. level one caches, caches
`directly servicing processor requests).
`These schemes are useful, however, if (1) the low miss ratio
`of wide set-associative caches is desired, (2) the low cost of a
`direct—mapped implementation is preferred, and (3) the slower
`access time of these approaches can be tolerated. We expect these
`conditions to be true for caches in multiprocessors designed to
`reduce memory interconnection traffic, caches implemented with
`large. narrow memory chips, and level two (or higher) caches in a
`cache hierarchy.
`
`1. Introduction
`
`The selection of associativity has significant impact on cache
`performance and cost [Smit86] [Smit82] [HillS7] [Przy88a]. The
`associativity (degree of associativity, set size) of a cache is the
`number of places (block frames) in the cache where a block may
`reside. Increasing associativity reduces the probability that a block
`is not foundein the cache (the miss ratio) by decreasing the chance
`that recently referenced blocks map to the same place [Smit78].
`However, increased associativity may nonetheless result in longer
`effective access times since it can increase the latency to retrieve
`data on a cache hit [Hi1188,Przy88a]. When it is important to
`minimize hit times direct-mapped (associativity of one) caches
`
`1' This work has been supported by graduate fellowships from the National Sci»
`ence Foundation and the University of Wisconsin-Madison.
`* This work was sponsored in part by research initiation grants from the graduate
`school of the University of Wisconsin-Madison.
`
`Permission to copy without fee all or part of this material is granted
`provided that the copies are not made. or distributed for direct commer-
`cial advantage, the ACM copyright notice and the title 'of the publication
`and its date appear, and notice is given that copying is by permission of
`the Association for Computing Machinery. To copy otherwise, or to
`republish, requires a fee and/or specific permission.
`
`© 1989 ACM 0884-7495/89/0000/0131$01.50
`
`may be preferred over caches with higher associativity.
`Wide associativity is important when: (i) miss times are very
`long or (12) memory and memory interconnect contention delay is
`significant or sensitive to cache miss ratio. These points are likely
`to be true for shared memory multiprocessors. Multiprocessor
`caches typically service misses via a multistage interconnect or
`bus. When a multi-stage interconnect is used the miss latency can
`be large whether or not contention exists. Bus miss times with low
`utilizations may be small, but delays due to contention among pro-
`cessors can become large and are sensitive to cache miss ratio. As
`the cost of a miss increases, the reduced miss ratio of wider associ-
`ativity will result in better performance when compared to direct—
`mapped caches.
`Associativity is even more useful for level two caches in a
`two-level multiprocessor cache hierarchy. While the level one
`cache must service references from the processor at the speed of
`the processor, the level two cache can be slower since it services
`only processor references that miss in the level one cache. The
`additional hit time delay incurred by associativity in the level two
`cache is not as important
`[Przy88b]. Reducing memory and
`memory interconnect traffic is a larger concern. Wide associativity
`also simplifies the maintenance of multi—level inclusion [Baer88].
`This is the property that all data contained in lower level caches is
`contained in their corresponding higher level caches. Multi-level
`inclusion is useful for reducing coherency invalidations to level
`one caches. Finally, preliminary models indicate that increasing
`associativity reduces the average number of empty cache block
`frames when coherency invalidations are frequentl, This implies
`that wider associativity will result in better utilization of the cache.
`Unfortunately. increasing associativity is likely to increase
`the board area and cost of the cache relative to a direct-mapped
`cache. Traditional
`implementations of a-way set-associative
`caches read and compare all a tags of a set in parallel to determine
`where (and whether) a given block resides in the cache. With 2-bit
`tags, this requires a tag memory that can provide a X: bits in paral—
`lel. A direct-mapped cache can use fewer, narrower, deeper chips
`since it requires only a t-bit wide tag memory. Traditional imple-
`mentations of associativity also use a comparators (each t-bits
`wide) rather than one, wider data memory, more buffers, and more
`multiplexors as compared to a direct-mapped cache. This adds to
`the board area needed for wider associativity. As the size of
`memory chips increases. it becomes more expensive to consume
`board area with multiplexors and other logic since the same area
`could hold more cache memory.
`While numerous papers have examined associativity [Lipt68]
`[Kapl73] [Be1174] [Stre76] [Smit78] [Smit82] [Clar83] [Agar88].
`most have assumed the traditional implementation. One of the few
`
`papers describing a cache with a non-traditional implementation of
`l A miss to rt set-associative cache can fill any empty block frame in the set, whereas. a miss
`to a directen'tflpped cache can fill only a single frame. Increasing associativity increases the
`Chance that on invalidated block frame will be quickly used again by making more empy
`frames available for reuse on a miss.
`
`UNIFIED 1004
`
`UNIFIED 1004
`
`

`

`
`
`
`SET+0
`SET+1vow
`SET+ a-l
`
`(b) Serial (Using Naive Approach)
`
`Figure 1. Implementing Set-Associativity.
`Part (a) of this figure (top) shows the traditional implementation of
`the logic to determine hit/miss in an a-way set-associative cache.
`This logic uses the “SET” field of the reference to select one t-bit
`tag from each of 4 banks. Each stored tag is compared to the incom-
`ing tag (“TAG"). A hit is declared if a stored mg matches the in-
`coming tag, a miss otherwise.
`Part (b) (bottom) shows a serial implementation of the same cache ar-
`chitecture. Here the a stored tags in a set are read from one bank and
`compared serially (the tags are addressed with "SET" concatenated
`with 0 through a — 1).
`
`It discusses a cache implemented for a
`associativity is [Chan87].
`System/370 CPU that has a one-cycle hit
`time to the most-
`recently-used (MRU) block in each set and a longer access time for
`other blocks in the set, similar to the Cray-1 instruction buffers
`[Cray76] and the biased set-associative translation buffer described
`in [Alex86].
`This paper is about lower cost implementations of associa»
`tivity, implementations other than the traditional. We introduce
`cache designs which combine the lower miss ratio of associativity
`and the lower cost of direct—mapped caches. In the new implemen-
`tations the width of the comparison circuitry and tag memory is t,
`the width of one tag, instead of the a ><t required by the traditional
`implementation.
`Implementations using tag widths of b Xt
`(l < b < a) are possible and can result in intermediate costs and
`performance, but are not considered here. This paper is not about
`level two caches per se, but we expect these low cost schemes to be
`applicable to level two caches. We organize this paper as follows.
`Section 2 introduces the new approaches to implementing associa-
`tivity, shows how they cost less than the traditional implementation
`of associativity, and predicts how they will perform. Section 3
`analyzes the approaches of Section 2 with trace-driven simulation.
`
`2. Alternative Approaches to Implementing Set-Associativity
`Let a , a power of two, be a cache's associativity and let t be
`the number of bits in each address tag. During a cache reference.
`an implementation must determine whether any of the a stored
`tags in the set of a reference match the incoming tag. Since at most
`one stored tag can match, the search can be terminated when a
`match is found (a cache hit). Alla stored tags, however, must be
`examined on a cache miss.
`
`Figure la illustrates the traditional implementation of the tag
`memory and comparators for an a-way set-associative cache.
`which reads and probes all tags in parallel. We define a probe as a
`comparison of the incoming tag and the tag memory. If any one of
`the stored tags match. a hit is declared. We concentrate only on
`cache tag memory and comparators, because they are what we pro-
`pose to implement differently. Additional memory (not shown) is
`required by any implementation of associativity with a cache
`replacement policy other than random. A direct-mapped cache
`does not require this memory. The memory for the cache data (also
`not shown)
`is traditionally accessed in parallel with the tag
`memory.
`Figure lb shows a naive way to do an inexpensive set-
`associative lookup.
`It uses hardware similar to a direct-mapped
`cache, but serially accesses the-stored tags of a set until a match is
`found (a hit) or the tags of the set are exhausted (a miss). Note how
`it requires Only a single comparator and a t-bit wide tag memory,
`whereas, the traditional implementation requires 2 comparators and
`an a ><t wide tag memory.
`Unfortunately, the naive approach is slow in comparison to
`the traditional implementation. For hits, each stored tag is equally
`likely to hold the data. Half the non-matching tags are examined
`before finding the tag that matches, making the average number of
`probes (a —l)/2 + 1. For a miss, all a stored tags must be examined
`in series, resulting in a probes. The traditional implementation
`requires only a single probe in both cases.
`
`2.1. The MRU Approach
`The average number of probes needed for a hit may be
`reduced from that needed by the naive approach by ordering the
`stored tags so that the tags most likely to match are examined first,
`One proposed order [8088] [Matt70] is from most‘recently-used
`(MRU) to least-recently-used (LRU). This order is effective for
`level one caches because of the temporal locality of processor
`reference streams [$088] [Chan87]. We find (in Section 3) that it is
`also effective for level two caches due to the temporal locality in
`streams of level one cache misses.
`
`One way to enforce an MRU comparison order is to swap
`blocks to keep the most-recently—used block in block frame 0, the
`second most-recently-used block in block frame 1, etc. Since tags
`(and data) would have to be swapped between consecutive cache
`accesses in order to maintain the MRU order, this is not a viable
`implementation option for most set-associative caches.2 A better
`way to manage an MRU comparison order, illustrated in Figure 2a,
`is to store information for each set indicating its ordering. For-
`tunately, information similar to a MRU list per set is likely to be
`maintained anyway in a set-associative cache implementing a true
`LRU replacement policy.
`lnthis case there is no extra memory
`requirement to store the MRU information. We will also analyze
`(in section 3) reducing the length of the MRU list. using approxi-
`mate rather than full MRU searches, to further decrease memory
`requirements. Unfortunately, the lookup of MRU information must
`precede the probes of the tags3_ This will lead to longer cache
`lookup times than would the swapping scheme.
`If we assume that the initial MRU list lockup takes about the
`same time as one probe, the average number of probes required on
`aacache lookup resulting in a hit using the MRU approach is 1+
`21' f, where f,~ is the prObability the i-th MRU tag matches,
`i=1
`
`given that one of them will match4. The MRU scheme performs
`particularly poorly on cache misses, requiring l + a probes. This 7
`2 While maintaining MRU order using swapping may be feasible for a 2-way
`set-associative cache, Agarwal's hash-rehash cache [Agar87] can be superior to
`MRU in this 2vway case.
`3 While it is possible to lookup the MRU information in parallel with the level-
`one~cache access, it is also possible to start level-two-cache accesses early for
`any of the other implementation approaches [Bren84].
`4 Each f‘ is equal to the probability of areference to LRU distance 1' divided by
`the hit ratio, for a given number of sets [Smit78].
`
`132
`
`

`

`
`
`MRU INFO
`
`SET
`
`°°°
`MRUIMRUZ MRU:
`
`srzr + MRUI
`SET + Msz
`coo
`SET+ MRUI
`
`(a) Using MRU Order
`-IV;(6I-
`
`
`
`l
`l
`
`I
`l
`
`(1)
`
`“'1
`
`l
`L.
`
`|
`.L
`
`
`
`<see below>
`
`
`SET +
`
`‘1’ set + PMl
`sar + m2u.
`
`0
`
`l
`
`a—l
`
`(b) Using Partial Compares
`
`Figure 2. Improved Implementations of Serial Set-Associativity.
`Part (a) of this figure (top) shows an implementation of serial set-
`associativity usin ordering information. This approach first reads
`MRU ordering in onnatiorr (left) and then probes the stored tags from
`the one most-likely to match to the one least-likely to match (right).
`Note “+” represents concatenate.
`Part (b) (bottom) shows an implementation of serial setoassociativity
`using partial compares. This approach first reads k (k =L t/aJ ) bits
`from each stored tag and compares them with the corresponding bits
`of the incoming tag. The second step of this approach serially com-
`pares all stored tags that partially matched (“PM") with the incom-
`ing tag until a match is found or the tags are exhausted (right).
`
`is one more than the naive implementation on misses since the
`MRU list is uselessly consulted.
`
`2.2. The Partial Compare Approach
`We have carefully defined a probe to be the comparison of
`the incoming tag and the tag memory, without requiring that all
`bits of the tag memory come from the same stored tag. We now
`introduce the partial compare approach that uses a two step pro—
`cess to often avoid reading all r bits of each stored tag. In step one.
`the partial compare approach reads t/a bits from each of a stored
`tags and compares them with the corresponding bits of the incom‘
`ing tag. Tags that fail this partial comparison cannot hit and need
`not be examined further on a cache lockup.
`In step two, all stored
`tags that passed step one are examined serially with t-bit (full)
`compares.
`
`The implementation of partial compares is not costly, as it
`can use the same memory and comparators as the naive approach
`assuming k, the partial compare width (k =L t/aj ), is a multiple
`of memory chip and comparator width. Partial compares are done
`with the help of a few tricks. The first trick, illustrated in Figure
`2b, is to provide slightly different addresses to each k-bit wide
`collection of memory chips, addressing the i-th collection with the
`address of the set concatenated with logzi. The second trick is to
`divide the t-bit comparator into a separate k-bit comparatorss.
`5 If kxa does not equal 1 then |_ t/aJ xa bits of the tag can be used for partial
`compares. with another comparator for the extra bits.
`
`I33
`
`This is straight-fonvard, since wide comparators are ofien imple-
`mented by logically AND-ing the results of narrow comparators.
`Note how step two of this partial compare approach uses the same
`tag memory and comparators as step one, but does full tag com-
`pares rather than partial compares.
`The performance of this approach depends on how well the
`partial compares eliminate stored tags from further consideration.
`For independent tags, the average number of probes will be minim-
`ized if each of the values [0, 2" — l] is equally likely for each of the
`k -bit patterns on which partial compares are done. While this con-
`dition may be true for physical address tags, it is unlikely to be true
`for the high order tag bits of virtual addresses. Nevertheless, we
`can use the randomness of the lower bits of the virtual address tag
`to make the distribution of the higher ones more uniform and
`independent. For example, one can transform a tag to a unique
`other tag; by exclusive-oring the low-order k bits of the tag with
`each of the other k -bit pieces of the tag before it is stored in the tag
`memory.
`Incoming tags will go through the same transformation
`so that the incoming tag and the stored tag will match if the
`untransformed tags are the same. The original tags can be retrieved
`from the tag memory for writing back blocks on replacement via
`the same transformation in which they were stored (i.e.
`the
`transformation is its own inverse). This method is used throughout
`this paper to produce stored tags with better probabilistic charac-
`teristics. We will also analyze using no transformation, and using a
`more sophisticated one in Section 3. We make the assumption in
`our analysis to follow that each of the values [0. 2“ — 1] is equally
`likely and independent for each partial compare. Our trace—driven
`simulation (in Section 3) tests this assumption.
`The probability that an incoming tag partially-matches a
`stored tag is 1/2". A false match is a partial tag match which will
`not lead to a match of the full tag. Given a hit, the expected
`number of false matches in step one is (a—1)/2", of which half
`will be examined in step two before a hit is determined. Thus, the
`expected number of probes on a hit is 1+ (a—l)/2"+l + l, where
`the terms of the expression are: the probe for the partial comparison
`(step one), the full tag comparisons (in step two) due to false
`matches, and the full tag match which produces the hit, respec-
`tively. On a miss,
`the expected number of probes in simply
`1+ a /2*, the probe for the partial comparison and the number of
`false matches, respectively.
`The partial compare scheme can lead to poor performance if
`many false matches are encountered in step two. Wider partial
`compares could eliminate some of these false matches. The partial
`compare width can be increased by partitioning the a stored tags of
`a set into: s proper subsets (each containing a/s tags) and examin-
`ing the subsets in seriesé, The step one and step two partial com-
`pare sequence is performed for each of the subsets to determine if
`there is a cache hit. The order in which the subsets are examined is
`arbitrary throughout this paper.
`Increasing the number of subsets
`will increase the partial compare width since fewer partial com-
`pares are done concurrently. For example, 2 subsets could be used
`in an 8-way set-associative cache, with 4 entries in each. A lockup
`in this cache would proceed as two 4-way (single subset) lookups,
`one after the other. With a 16-bit wide tag memory in this cache,
`partitioning into 2 subsets would result in 4-bit partial compares.
`This will. result in fewer false matches titan with the 2-bit partial
`compares without subsets. The number of probes per access
`decreases when using proper subsets if the expected number of
`false matches is reduced (due to wider partial compares) by more
`than the number of probes added due to the additional subsets.
`Subsets may be desirable for implementation considerations in
`addition to performance considerations if the memory chip or com-
`parator width dictate that the partial compares be wider.
`At one extreme (where s = a), partial compares with subsets
`would be implemented as the naive approach, while the other
`
`
`6 Note that subsets are not useful with the naive and MRU approaches.
`
`

`

`
`
`—! Ex-
`tchrobes
`
`
`Method
`Assoc- Number
`Tag Memory
`Assume
`Assume
`‘
`iativi
`Subsets Width its
`Hit
`Miss
`
` 64
`1
`
`
`}
`Traditional
`3
`1
`a X’
`1
`(1x2) (a—l) +1
`
`2.5
`
`
`1+2ifi
`l+a
`[:1
`[2,5]
`5
`
`1
`max(t,a><k)
`2+(a—1)/2’=+1 1+a/2"
`
`1.25
`‘
`‘
`l
`2.09
`16
`
`
`
`.2+ (1/2)(s—1)
`s+a/2"
`max(t, a/S Xk)
`
`
`
`
`+ (r171)l2‘H1
`2.88
`3.00
`(k -2 bits)
`16
`re
`
`'
`2.5
`2.72
`
`
`
`
`
`Table 1. Performance of Set-Associativity Implementations.
`
`gives at least four-bit pam‘al compares will work well.
`Table 1 summarizes our analysis of the expected number of
`probes required for the traditional, naive, MRU and partial compare
`approaches to implementing set-associativity. Note that this table
`as well as most of the trace-driven simulation assumes 16 bit tags
`are used. We will examine the positive effect of increasing the tag
`width on the partial compare approach in section 3.
`Table 2 summarizes paper implementations of tag memory
`and comparison logic for a direct-mapped cache. a traditional
`implementation of set-associativity, and an implementation of set-
`associativity using MRU and partial compares. We found that the
`MRU and partial compare implementations have a slower access
`time than the traditional
`implementation of associativity but
`includes no implementation surprises. Most notably, the control
`logic was found to be of reasonable complexity. The MRU and
`partial compare implementations use hardware similar to a direct—
`mapped cache and can make effective use of page—mode dynamic
`RAMs, as would other serial implementations of set—associativity.
`Page-mode dynamic RAMs are those in which the access time of
`multiple probes to the same set can be significantly less than if the
`probes were to other sets. Subsequent probes take less than half the
`time of the first probe to the set. Cache cost is reduced in two ways
`when using one of the alternative implementations of associativity.
`First, tag memory cost is directly reduced, by 1/3 to 1/2 in our
`design. Second, cache data memory cost is reduced since only 1.
`rather than a words, need to be read at a time.
`
`For various methods and associativities this table gives the number of
`subsets, the tag memory width, the number of probes for a hit, and,
`the number for a miss. The table assumes t-bit tags (t = 16), k-bit
`partial compares, and that the i-th most-recently used tag matches
`with probability f,- on a hit. Note how an increase from 1 to 2 subsets
`improved the predicted performance of the partial compare approach
`at an associativity of 8.
`
`3. Trace-Driven Performance Comparison
`(s =1) can lead to many false matches. An important question to
`This
`section analyzes the performance of the low-cost
`ask is: what number of subsets leads to the best performance (Le.
`schemes to implement set-associativity in level two caches using
`fewest number of probes per cache lockup) ? The next three
`simulation with relatively short multiprograrnming traces. We
`answers to this question vary from the most-accurate to the most
`analyze associativity in the level two cache since the low cost
`succinct. (1) One can compute the expected number of probes for
`implementations of associativity are more appropriate for level two
`each of s = 1,2,4.
`,a/2 and at using the equations for a hit and
`(or higher) caches than for level one caches. We concentrate on
`miss (from Table 1) weighted to reflect your expected miss ratio
`presenting and characterizing the relative performance of the alter-
`and choose the minimum.
`(2) One can ignore misses (which are
`natives. We do not demonstrate the absolute utility of these
`less common and never require more than twice the probes of hits).
`approaches to important future cache configurations (e.g. multiple
`assume variables are continuous, and find the optimum partial com-
`megabyte level two caches in multiprocessors) since our traces are
`pare width. k0 , =log21 — 1/2 for hits only. The optimum number
`for a single processor and are not sufficiently long to exercise very
`of subsets for iiits and misses together is likely to be the value for s
`large caches.
`resulting from a partial compare width ofL km] or I" k0,”) .
`(3)
`assumed cache
`the
`and
`traces
`the
`The makeup of
`Finally, one can observe that many tags in current caches are
`configurations are indicated in Table 3. We assume a uniprocessor
`between 16 and 32 bits wide, implying the number of subsets that
`system with a two level cache hierarchy (a level one cache and a
`
`
`
`
`lar e1in out stud ,two cachelevel because the traces were
`
`
`-lementations
`
`
`Usin: Static RAMs
`Usingflnamic RAMs
`
`
`
`
`Direct-
`4-Wa Set-Associative
`
` :grl
`
`May" .
`
`
`
`Memory Packages
`
`
`40
`40
`40
`Basic Access Time (115)
`
`
`
`Illa
`n/a
`n/a
`Page Mode Access Time (ns)
`
`
`40
`40
`40
`
`
`Basic Cycle Time (ns)
`
`
`
`n/a
`n/a
`n/a
`Page Mode Cycle Time (ns)
`
`
`Size (bits)
`256Kx(1 6,8)
`
`
`
`Implementatlons
`
`
`1 50+50x
`
`Access Time (ns)
`65+55x
`65+55y
`
`250+50(x+u)
`Cycle Time (ns)
`7S+55(x+u)
`75+55y
`
`
`Number of Packages 22
`24
`
`
`
`
`
`
`
`Table 2. Trial Set-Associativity Implementations.
`
`This table compares paper implementations of the tag memory and comparison logic for a direct-mapped and four-way set-associative cache
`holding 1 million 24-bit tags, assuming dynamic or static RAM chips housed in hybrid packages. The top half of the table summarizes the
`memory packages used to implement tag memory, while the bottom half gives cache implementation numbers. The MRU implementation
`” n>
`assumes that the MRU list storage costs nothing extra (as it would it full LRU replacement is used). MRU access and cycles are given as-
`suming
`x
`IS the expected number of probes after reading the MRU information (”x" is between 1 and a for hits, a for misses) and “u” is
`the probability that MRU information must be updated. Partial compare access and cycles are given assuming “y” probes in step two (“y”
`is between 1 and a for hits and 0 and a for misses). The number of packages assumes some semi-custom logic and hybrid packages.
`
`
`l34
`
`

`

`Trace-Driven Two-Level Cache Simulation
`
`ATUM [Agar86] virtual address traces of a multipro-
`grammed operating system, described in [Hi1187]. Operat-
`ing system references are included as well as references of
`user-level processes. One very large trace (over 8 million
`references) was constructed as a concatenation of 23 indi-
`vidual ATUM traces, each of which is approximately
`350,000 references. Cache flushes of the level one and
`level two caches were inserted between each of the 23
`traces, thus each trace starts from a “cold" cache.
`
`
`
`A direct-mapped write-back cache. On misses causing
`replacement of a dirty block,
`the new block is first
`obtained via a read-in request, then a write-back is issued
`to the level two cache. Three level one caches are simu-
`latett A 4 Kbyte cache with a 16 byte block size (4K-16);
`16 Kbyte with 16 byte blocks (16K-16); and 16 Kbyte with
`32 byte blocks (16K-32). The miss ratios corresponding to
`these level one caches are: 0.1181, 0.0657, and 0.0513,
`respectively.
`
`An a-way set-associative write-back cache which services
`read-ins and write-backs from the level one cache. We
`compare different implementations of associativity in the
`level two cache. The least-recently-used entry in a set is
`replaced on a cache miss. We simulate five different level
`two caches: a 64 Kbyte cache with 16 byte block size
`(64K-16); 64 Kbyte with 32 byte blocks (64K-32); 256
`Kbyte with 16 byte blocks (256K-16); 256 Kbyte with 32
`byte blocks (256K-32); and 256 Kbyte with 64 byte blocks
`(256K—64). While multi—level inclusion is not enforced in
`this simulation, by monitoring the number of write-backs
`which missed when written back to the level two cache we
`were able to extrapolate that the maintenance of multi-
`level inclusion would have a very small effect (in most
`configurations studied, no effect) on the miss ratio of the
`level two cache (and no effect on the miss ratio of the level
`one cache).
`
`Table 3. Detailed Information on the Trace-Driven Simulation.
`
`
`uniprocessor traces. The level one cache is direct-mapped, while
`the level two cache is of varying associativity. Both caches are
`write-back caches, with the level two cache servicing read-in and
`write-back requests from the level one cache. We chose this
`write-back configuration to minimize the amount of communica-
`tion between cache levels. This can be important
`in a shared
`memory multiprocessor since the level one cache will be utilized
`servicing processor references while the level two cache is servic-
`ing coherency invalidations, as in [Good88]. Also, it was found in
`[Shor88] that this configuration has better performance than if
`either cache is write-through. Cache sizes simulated here (up to
`256 Kbytes) are limited by the size of the traces. We expect future
`level two (and higher) caches to be considerably larger (e.g. 4
`Mbytes). Though the results presented are for “cold” caches, lim-
`ited “warmer" results were found to be similar, except that the
`miss ratios were smaller.
`
`The graphs in Figure 3 show the average number of probes
`versus the associativity of the level two cache for a 16K-16 (16
`Kbyte capacity with 16 byte block size) level one cache and 256K-
`32 (256 Kbyte with 32 byte block size) level two cache. The tag
`width is 16-bits (t =16) and the partial compare width 4-bits
`(k =4) in all simulations unless otherwise specified. 1, 2. and 4
`subsets were used for 4, 8, and l6-way set-associative partial com-
`pare implementations, respectively. The graphs indicate the gen-
`eral linearly increasing relationship between the number of probes
`required per search and the associativity. The number of probes
`per
`access
`is
`expected
`to
`increase
`for
`the
`alternative
`
`135
`
`implementations of associativity as the associativity increases since
`there are more places where a given cache block can reside. A
`cache lockup simply must look in more places on the average. For
`wider associativity to be preferred, the added delay for these addi-
`tional probes must be more than offset by the time saved servicing
`fewer misses. One would also expect
`the Naive and Partial
`schemes: to have a linear relationship between probes per access
`and associativity. However, the fact that this relation is linear for
`MRU came as a surprise. We will examine the MU and partial
`schemes: more closely in subsequent figures. As will always be the
`case, the traditional implementation of associativity results in the
`minimum number of probes. These graphs show that the partial
`compare approach performs the best of the low cost implementa-
`tions. The naive scheme performs the worst, with the MRU
`scheme between them.
`
`Figure 3 also shows the performance benefit of a write-back
`optimization which can be made when the multi-level inclusion
`property is maintained with a cache hierarchy. The level one cache
`can be certain that all write-back requests will hit in the level two
`cache.
`it can also be certain that the block will reside in precisely
`the same position in which it was loaded in the level two cache
`from memory (if blocks do not change position in the level two
`cache from the time they are loaded to the time they are replaced).
`This implies that if the level one cache retains a logz a-bit indica-
`tor of which position in the set the given cache block occupies (a is
`the associativity of the level two cache), write-backs can proceed
`without requiring tag probes. Note that even if multi-level inclu-
`sion is not maintained, the indicators in the level one cache can be
`used as hints, not always correct, where the entry resides in the
`level two cache.
`All
`the methods, Traditional, Naive, MRU. and Partial
`require no probes to service a write-back request when using the
`write—back optimization. Since write—backs are approximately 20%
`of the requests to the level two cache (as shown in Table 4), this
`can result in significant performance improvements, as indicated in
`the figure. We feel the cost of implementing this optimization is
`sufficiently modest (2 bits per level one cache entry for a 4—way
`set-associative level two cache) to warrant its use when implement-
`ing one of the reduced cost implementations of associativity, We
`assume the write—back optimization is used, and all subsequent
`figures contain data for read-in requests only, since the different
`approaches perform the same on write~backs. Write-back requests
`are still considered references as they update the MRU list, deter—
`mining the replacement policy of the cache.
`Table 4. presented at the end of the paper. lists the number of
`probes required for various cache configurations when using the
`naive, MRU, and partial schemes. Note that the data in Table 4
`assumes the write-back optimization is being used. Table 4 uses
`the terms global miss ratio and local miss ratio [Przy88b]. The
`global miss ratio is the fraction of processor requests which miss in
`both the level one and level two cache. The local miss ratio of the
`level two cache is the fraction of read-ins and write-backs from

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket