`
`Evaluating Associativity in CPU Caches
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 12, DECEMBER 1989
`
`Abstract-Because of the infeasibility or expense of large
`fully-associative caches, cache memories are usually designed
`to be set-associative or direct-mapped. This paper presents 1)
`new and efficient algorithms for simulating alternative direct-
`mapped and set associative caches, and 2) uses those algorithms
`to quantify the effect of limited associativity on the cache miss
`ratio.
`We introduce a new algorithm, forest simulation, for simulat-
`ing alternative direct-mapped caches and generalize one, which
`we call all-associativity simulation, for simulating alternative
`direct-mapped, set-associative, and fully-associative caches. We
`find that while all-associativity simulation is theoretically less ef-
`ficient than forest simulation or stack simulation (a commonly
`used simulation algorithm); in practice, it is not much slower
`and allows the simulation of many more caches with a single
`pass through an address trace.
`We also provide data and insight into how varying associativity
`affects the miss ratio. We show: 1) how to use simulations of
`alternative caches to isolate the cause of misses; 2) that the
`principal reason why set-associative miss ratios are larger than
`fully-associative ones is (as one might expect) that too many
`active blocks map to a fraction of the sets even when blocks
`map to sets in a uniform random manner; and 3) that reducing
`associativity from eight-way to four-way, from four-way to two-
`way, and from two-way to direct-mapped causes relative miss
`ratio increases in our data of respectively about 5, 10, and 30
`percent, consistently over a wide range of cache sizes and a
`range of line sizes.
`
`Index Terms- Associativity, buffer, cache memory, com-
`puter architecture, direct-mapped, memory systems, perfor-
`mance evalulation, set-associative and trace-driven simulation
`algorithms.
`
`I. INTRODUCTION
`HREE important CPU cache parameters are cache size,
`
`T block (line) size, and associativity [27]. Cache size (buffer
`
`size, capacity) is so important that it is a part of almost all
`cache studies (for a partial bibliography see [29]). Block size
`(line size) has recently been examined in detail in [30]. Here
`we concentrate on associativity (degree of associativity, set
`
`Manuscript received February 14, 1989; revised July 10, 1989. The work
`presented here is based on research supported in part by the National Sci-
`ence Foundation under Grants CCR-8202591 and MIP-8713274, by the State
`of California under the MICRO program, by the Defense Advanced Re-
`search Projects Agency monitored by Naval Electronics Systems Command
`under Contract N00039-85-C-0269, the graduate school at the University of
`Wisconsin-Madison, and by IBM Corporation, Digital Equipment Corpora-
`tion, Philips Research Laboratories/Signetics Corporation, Apple Computer
`Corporation, and the Hewlett Packard Corporation.
`M. D. Hill is with the Department of Computer Sciences, University of
`Wisconsin, Madison, WI, 53706.
`A. J. Smith is with the Computer Science Division, Department of Elec-
`trical Engineering and Computer Science, University of California, Berkely,
`CA 94720.
`IEEE Log Number 8931174.
`
`size) which is the number of places in a cache where a block
`can reside.
`Selecting optimal associativity is important, because chang-
`ing associativity has a significant impact on cache performance
`and cost. Increasing associativity improves the likelihood that
`a block is resident by decreasing the probability that too many
`recently-referenced blocks map to the same place and by al-
`lowing more blocks to be considered for replacement. The
`effect of associativity on cache miss ratio has never been iso-
`lated and quantified, and that is one of the major goals of
`this paper. Conversely, increasing associativity often increases
`cache cost and access time, since more blocks (frames) must
`be searched in parallel to find a reference [ 161.
`Fig. 1 illustrates set-associativity. A set-associative cache
`uses a set-mapping function f to partition all blocks (data
`in an aligned, fixed-sized region of memory) into a number
`of equivalence classes. Some number of block frames in the
`cache are assigned to hold recently-referenced blocks from
`each equivalence class. Each group of block frames is called
`a set. The number of such groups, equal to the number of
`equivalence classes, is called the number of sets (s). The
`number of block frames in each set is called the associativity
`(degree of associativity, set size, n). The number of block
`frames in the cache (c) always equals the associativity times
`the number of sets (c = n . s). A cache is fully-ussociative
`if it contains only one set (n = c , s = l), is direct-mapped
`if each set contains one block frame (n = 1, s = c ) , and is
`n-way set-ussociative otherwise (where n is the associativity,
`s = c / n ) .
`On a reference to block x, the set-mapping function f feeds
`the “set decoder” with f(x) to select one set (one row), and
`then each block frame in the set is searched until x is found (a
`cache hit) or the set is exhausted (a cache miss). On a cache
`miss, one block in set f(x) is replaced with the block x ob-
`tained from memory. Finally, the word requested from block
`x is returned to the processor. Here for conceptual simplic-
`ity we show the word within the block selected last (in the
`box “compare block number with tags and select data word”).
`Many implementations, however, select the word within the
`block while selecting the set to reduce the number of bits that
`must be read; i.e., only words are gated into the multiplexer,
`not full lines. The most commonly used set-mapping function
`is the block number modulo the number of sets, where the
`number of sets is a power of two. This set mapping function
`is called bit selection since the set number is just the num-
`ber given by the low-order bits of the block address. For 256
`sets, for example, f ( x ) = x mod 256 or f ( x ) = x AND Oxff,
`where mod is remainder and AND is bitwise AND.
`The method we use for examining associativity in CPU
`caches is tracedriven simulation. It uses one or more (ad-
`
`OO18-9340/89/1200-1612$01.OO @ 1989 IEEE
`
`DOCKET NO.: IPR-5463750
`
`1
`
`TI-EXHIBIT 1014
`
`
`
`HILL AND SMITH: ASSOCIATIVITY IN CPU CACHES
`
`1613
`
`block number
`Address I
`
`b i a k offset
`1
`
`I
`
`
`
`Set
`Decoder
`
`<
`
`Associativily (A=n)
`
`b
`
`I
`
`I
`I
`
`I
`
`...
`
`I
`
`Compare Block Number wilh Tags
`
`Fig. 1. Set-associative mapping.
`
`A
`
`v
`
`Number
`
`of sw
`
`(S=dn)
`
`dress) traces and a (cache) simulator. A trace is the log of a
`dynamic series of memory references, recorded during the ex-
`ecution of a program or workload. The information recorded
`for each reference must include the address of the reference
`and may include the reference's type (instruction fetch, data
`read, or data write), length, and other information. A sim-
`ulator is a program that accepts a trace and parameters that
`describe one or more caches, mimics the behavior of those
`caches in response to the trace, and computes performance
`metrics (e.g., miss ratio) for each cache.
`We analyze associativity in caches with trace-driven simula-
`tion for the same reasons as are discussed in [28]. The princi-
`pal advantage of trace-driven simulation over random number
`driven simulation or analytical modeling is that there exists no
`generally-accepted model for program behavior (at the cache
`level) with demonstrated validity and predictive power. The
`major disadvantage is that workload samples must be rela-
`tively short, due to disk space and simulation time limits.
`The CPU time required to simulate many alternative caches
`with many traces can be enormous. Mattson et al. [I91 ad-
`dressed a similar problem for virtual memory simulation by
`developing a technique we call stack simulation, which allows
`miss ratios for all memory sizes to be computed simultane-
`ously, during one pass through the address trace, subject to
`several constraints including a fixed page size. While stack
`simulation can be applied to caches, each cache configuration
`with a different number of sets requires a separate simulation.
`For this reason, this paper first examines better algorithms
`for simulating alternative direct-mapped and set-associative
`caches, and then uses those algorithms to study associativity
`in caches.
`The rest of this paper is organized as follows. Section I1
`reviews previous work on cache simulation algorithms and
`associativity in caches. In Section 111, we explain our meth-
`ods in more detail and describe our traces. Section IV dis-
`cusses cache simulation algorithms, including properties that
`facilitate rapid simulation, a new algorithm for simulating al-
`ternative direct-mapped caches, and an extension to an al-
`gorithm for simulating alternative caches with arbitrary set-
`mapping functions. Section V examines the effect of associa-
`tivity on miss ratio, including categorizing the cause of misses
`in set-associative caches, relating set-associative miss ratios
`to fully-associative ones, comparing miss ratios from similar
`
`set-associative caches, and extending the design target miss
`ratios from [28] and [30] to caches with reduced associativity.
`Readers interested in the effect of associativity on miss ratio
`but not in cache simulation algorithms may skip Section IV,
`as Section V is written to stand alone.
`
`II. RELATED WORK
`A . Simulation Algorithms
`The original paper on memory hierarchy simulation is by
`Mattson et al. [19]. They introduce inclusion, show when
`inclusion holds, and develop stack simulation, which uses
`inclusion to rapidly simulate alternative caches. Inclusion is
`the property that after any series of references, larger alterna-
`tive caches always contain a superset of the blocks in smaller
`alternative caches.' Mattson et al. show inclusion holds be-
`tween alternative caches that have the same block size, do no
`prefetching , use the same set-mapping function (and therefore
`have the same number of sets), and use replacement algorithms
`that before each reference induce a total priority ordering on
`all previously referenced blocks (that map to each set) and use
`only this priority ordering to make the next replacement deci-
`sion. Replacement algorithms which meet the above condition,
`called stack algorithms, include LRU, OPTIMUM, and (if
`properly defined) RANDOM [6]. FIFO does not qualify since
`cache capacity affects a block's replacement priority. In Sec-
`tion IV-A, we will prove when inclusion holds for caches that
`use arbitrary set-mapping functions and LRU replacement.
`Mattson et al. develop stack simulation to simulate alter-
`native caches that have the same block size, do no prefetching,
`use the same set-mapping function, and use a stack replace-
`ment algorithm. Since inclusion holds, a single list per set,
`called a stack, can be used to represent caches of all associa-
`tives, with the first n elements of each stack representing the
`blocks in an n-way set-associative cache. For each reference,
`stack simulation performs three operations: 1) locate the ref-
`erence in the stack, 2) update one or more metrics to indicate
`which caches contained the reference, and 3) update the stack
`to reflect the contents of the caches after the reference. We
`
`' Inclusion is different from multilevel inclusion defined by Baer and
`Wang [ 5 ] . While inclusion is a property relating alternative caches, multilevel
`inclusion relates caches in the same cache hierarchy.
`
`DOCKET NO.: IPR-5463750
`
`2
`
`TI-EXHIBIT 1014
`
`
`
`1614
`
`call these three operations FIND, METRIC, and UPDATE,
`and will show that the algorithms discussed in later in Sections
`IV-B and IV-C use the same steps.
`The most straightforward implementation of stack simula-
`tion is to implement each stack with a linked list and record
`hits to position n by incrementing a counter distance[n]. After
`N references have been processed, the miss ratio of an n-way
`set-associative cache is simply 1 - Er=, distance[il/N. Since
`performance with a linked list will be poor if many elements of
`a stack must for searched on each reference, other researchers
`have developed more complex implementations of stack simu-
`lation, using hash tables, m-ary trees, and AVL trees [8], [21],
`[33]. While these algorithms are useful for some memory hi-
`erarchy simulations, Thompson [33] concludes that linked list
`stack simulation is near optimal for most CPU cache simu-
`lations. Linked list stack simulation is fast when few links
`are traversed to find a reference. On average, this is the case
`in CPU cache simulations since 1) CPU references exhibit a
`high degree of locality, and 2) CPU caches usually have a
`large number of sets and limited associativity, dividing active
`blocks among many stacks and bounding maximum stack size;
`different results are found for file system and database traces.
`For this reason, we consider only linked list stack simulation
`further, and use stack simulation to refer to linked list stack
`simulation.
`Mattson et al. also briefly mention a way of simulating
`caches with different numbers of sets (and therefore different
`set-mapping functions). In two technical reports, Traiger and
`Slutz extend the algorithms to simulate alternative caches with
`different numbers of sets and block sizes [34], and with dif-
`ferent numbers of sets, block sizes, and subblock sizes (sector
`and block sizes, address and transfer block sizes) [24]. They
`require that all alternative caches use LRU replacement, bit-
`selection for set mapping, and have block and subblock sizes
`that are powers of two. (Bit selection uses some of the bits
`of the block address as a binary number to specify the set.)
`In Section IV-C, we generalize to arbitrary set-mapping func-
`tions their algorithm for simulating alternative caches that use
`bit selection.
`The speed of stack simulation can also be improved by delet-
`ing references (trace entries) that will hit and not affect re-
`placement decisions in the caches to be simulated [25]. Puzak
`[23] shows that if all caches simulated use bit selection and
`LRU replacement, references that hit the most recently used
`element of a set can be deleted without affecting the total
`number of misses. We will show that this result trivially fol-
`lows from properties we define in Section IV-A, allowing
`such references to be deleted from traces before using any
`of our simulation algorithms. (The total number of memory
`references in the original trace must be retained, in order to
`compute the miss ratio.)
`
`B. Associativity
`Previous work on associativity can be broken into the fol-
`lowing three categories: 1) papers that discuss associativity
`as part of a more general analysis of 32 kbyte and smaller
`caches, among the more notable of which are [18], [17], [7],
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 12, DECEMBER 1989
`
`[32], [27],* and [ll], and [13]; 2) papers that discuss asso-
`ciativity and other aspects of cache design for larger caches
`([4], [2], and [22]); and 3) those that discuss only associativity
`([26] and [16]). Since caches have been getting larger, papers
`in category 1) can also be characterized as older, while those
`in category 2) are more recent.
`Papers in category 1) provide varying quantities of data
`regarding the effect of changing associativity in small caches.
`The qualitative trend they support is that changing associativ-
`ity from direct-mapped to two-way set-associative improves
`miss ratio, doubling associativity to four-way produces a
`smaller improvement, doubling again to eight-way yields an
`even smaller improvement, and subsequent doublings yield no
`significant improvement. Our quantitative results are consis-
`tent with results in these papers. We extend their results by
`examining relative miss ratio changes to isolate the effect of
`associativity from other cache aspects, and by examining some
`larger caches.
`Alexander et al. use trace-driven simulation to study small
`and large caches [4]. Unfortunately, the miss ratios they give
`are much lower than those that have been measured with hard-
`ware monitors and real workloads; see [28] for reports of real
`measurements.
`Agarwal et al. use traces gathered by modifying the mi-
`crocode of the VAX 8200 to study large caches and to try
`to separate operating system and multiprogramming effects
`[2]. They briefly examine associativity, where they find that
`associativity in large caches impacts multiprogramming work-
`loads more strongly than uniprocessor workloads. They find
`for one workload that decreasing associativity from two-way
`to direct-mapped increases the multiprogramming miss ratio
`by 100 percent and the uniprogramming miss ratio by 43 per-
`cent. These numbers are much larger than the average miss
`ratio change we find (30 percent).
`Przybylski et al. [22] examine cache implementation trade-
`offs. They find that reducing associativity from two-way to
`direct-mapped increases miss ratio 25 percent, regardless of
`cache size, which is consistent with our results. One contribu-
`tion of that paper is a method of translating the architectural
`impact of a proposed design change into time by computing
`the cache hit time increase that will exactly offset the bene-
`fit of the proposed change. A change improves performance
`only if the additional delay required to implement the change
`is less than the above increase. Przybylski et al. find that the
`architectural impact times for increasing associativity are of-
`ten small, especially for large caches, calling into question the
`benefit of wide associativity.
`The first paper to concentrate exclusively on associativity is
`[26]. That paper presents a model that allows miss ratios for
`set associative caches to be accurately derived from the fully
`associative miss ratio. In Section V-B, we further validate
`those results by showing that the model accurately relates the
`miss ratios of many caches, including large direct-mapped
`caches, to LRU distance probabilities.
`The second paper to concentrate on associativity is [16],
`based on parts of [15]. It shows that many large single-level
`This survey includes results for some large caches with wide associativity
`(e.g., 32-way set-associative 64 kbyte caches).
`
`DOCKET NO.: IPR-5463750
`
`3
`
`TI-EXHIBIT 1014
`
`
`
`HILL AND SMITH: ASSOCIATIVITY IN CPU CACHES
`
`caches in uniprocessors should be direct-mapped, since the
`drawbacks of direct-mapped caches (e.g., worse miss ratios
`and more-common worst case behavior) have small signifi-
`cance for large caches with small miss ratios, while the bene-
`fits of direct-mapped caches (lower cost and faster access time)
`do not diminish with increasing cache size. Here we examine
`miss ratio in more detail, but do not discuss implementation
`considerations.
`
`111. METHODS AND TRACES
`In this section, we discuss the use of the miss ratio as a
`suitable metric (among others), describe the traces that we
`use, show how we estimate average steady-state miss ratios,
`and show that our traces yield results consistent with those
`observed from running systems.
`To first order, the effective access time of a cache can be
`modeled as tcache + miss-ratio . tmemory. (Additional factors
`which affect access time including the overhead of write backs,
`extra time for line crossers, page crossers, and TLB misses,
`and the fact that writes may be slower than reads. These latter
`delays are much less significant than those given in the expres-
`sion.) The miss ratio is the number of cache misses divided
`by the number of memory references, tmemory is the time for
`a cache miss, and tcache is the time to access the cache on a
`hit. The two latter parameters are implementation dependent,
`and in [15] there is a discussion of their effect on cache per-
`formance. As noted earlier, increases in associativity, while
`generally improving the miss ratio, can increase access time,
`and thus degrade overall performance. Here, we concentrate
`on miss ratio because it is easy to define, interpret, compute,
`and is implementation independent. This independence facili-
`tates cache performance comparisons between caches not yet
`implemented and those implemented with different technolo-
`gies and in different kinds of systems.
`Results in this paper are based on two partially overlapping
`groups of traces, called the five-trace and 23-trace groups,
`respectively. Table I presents data on the traces. The first
`column gives the name of each trace sample. The second gives
`the fraction of all references that are instruction references.
`In these simulations, we do not distinguish between data reads
`and writes. The third column gives the length of the address
`traces in 1000’s of references. The final column gives the
`number of distinct bytes referenced by the trace, where any
`reference in an aligned 32-byte block is considered to have
`touched each byte in the block.
`Each of the trace samples in the five-trace group comes
`from the second 500000 references of a longer trace. The
`first three samples are user and system VAX-11 traces gath-
`ered with ATUM [ 11. Trace mu12-2nd500k contains a circuit
`simulator and a microcode address allocator running concur-
`rently under VMS. Trace mu18-2nd500k is an eight-job mul-
`tiprogrammed workload under VMS: spice, alloc, a Fortran
`compile, a Pascal compile, an assembler, a string search in
`a file, jacobi (a numerical benchmark) and an octal dump.
`Trace ue2nd500k consists of several copies of a program
`that simulates interactive users running under Ultrix. The
`other two samples in the trace group, mvs1-2nd500k and
`rnvs2-2nd500k, are collections of IBM 370 references from
`
`1615
`
`TABLE I
`DATA ON ~ C E S
`
`
`
`Trace Sampk
`N W
`mul2-2ndMOK
`mul8-2ndMOK
`ue-2nd500K
`mvs1-2ndS00K
`mvs2 2ndMOK
`
`lnstnrtion
`References (%)
`53
`51
`55
`52
`55
`
`Length (looo’s
`of references)
`500
`500
`500
`500
`500
`
`Dynamic Size
`(K-bytes)
`218
`292
`277
`163
`201
`
`Trace
`Name
`
`-...
`
`fora
`forf
`
`I
`
`lnsrmction
`References (%)
`
`Length (looo’s
`of references)
`
`Dynamic Size
`(K-bytes)
`
`50
`52
`52
`
`::
`
`I
`
`11
`
`353
`388
`401
`387
`414
`
`125
`144
`128
`152
`105
`
`I
`
`I
`
`ue
`
`61
`56
`57
`55
`
`228
`358
`372
`364
`
`54
`205
`191
`22 1
`
`system calls invoked in two Amdahl standard MVS workloads
`1281.
`The second trace group contains 23 samples of various
`workloads gathered on a VAX- 1 1 with ATUM [ 13. Trace sam-
`ples that exhibit unstable behavior (e.g., a particular doubling
`of cache size or associativity alters the miss ratio observed by
`many factors of two) have been excluded from both groups.
`We estimate the steady-state miss ratios for a trace sample
`using the miss ratio for a trace after the cache is warm (the
`warm-start miss ratio). A cache is warm if its future miss
`ratio is not significantly affected by the cache recently being
`empty [2]. We compute warm-start miss ratios using the sec-
`ond 250K references of each 500K-reference trace sample.
`We found that most caches with our traces are warm by 250K
`references by locating the knee in the graph of the cumulative
`misses to empty block frames versus references, a method
`of determining when caches are warm proposed in Agarwal
`et al. [2]. Furthermore, results for these multiprogrammed
`traces properly include cold-start effects whenever a process
`resumes execution.
`Fig. 2(a) and (b) displays miss ratio data for unified caches
`(mixed, i.e., cache data and instructions together) with 32-
`byte blocks. Solid lines show the average warm-start miss ra-
`tios with different associativities (1, 2, 4, and 8). The average
`warm-start miss ratio is the arithmetic average of warm-start
`miss ratios for each of the five traces in the five-trace group.
`The arithmetic mean is used because it represents the miss
`ratio of a workload consisting of an equal number of refer-
`ences from each of the traces. Previous experiments (as were
`done for [3 11 and [ 151) showed that little difference was ob-
`served when other averaging methods were used. The dashed
`line (labeled “inf”) gives the warm-start miss ratio of an infi-
`
`DOCKET NO.: IPR-5463750
`
`4
`
`TI-EXHIBIT 1014
`
`
`
`1616
`
`M
`I
`s
`s
`
`R
`a
`t
`
`1
`
`0
`
`M
`1
`s
`s
`
`0.40
`
`0.30
`
`0.20
`
`0.10
`
`0.00
`100
`
`o~060 1
`0.050 -
`
`o.Oo0
`
`0.030.
`
`0.020 -
`
`0.010 -
`
`o.Oo0 j4,
`IOK
`
`IK
`Cache Size (bytes)
`(a)
`
`10K
`
`100K
`Cache Size (bytes)
`
`IM
`
`(b)
`Fig. 2. Miss ratios for five-trace workload with caches of associativities of
`1, 2, 4, and 8. The dashed line shows the miss ratio for an infinite cache.
`(a) Smaller caches. (b) Larger caches.
`nite cache, a cache so large that it never replaces any blocks.
`Measurements for the 23-trace group are similar.
`Fig. 3 compares miss ratios for the five-trace group in eight-
`way set-associative unified caches, having 16-byte and 32-byte
`blocks, to miss ratios from other sources. Line “cold” mea-
`sures miss ratios from an emDtv cache. while line “warm”
`
`I I
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 12, DECEMBER 1989
`
`0.20
`
`0.15
`
`0.10
`
`0.05
`
`0.00
`1K
`
`0.20
`
`0.15 ,
`
`
`
`Cache Size 10K (bytes)
`
`l00K
`
`(a)
`
`IK
`
`IOK
`Cache Size (bytes)
`@)
`Comparison of our miss ratio data (solid lines) with other published
`data (A, B, C, D). (a) 16-byte blocks. (b) 32-byte blocks.
`
`IOOK
`
`M
`I
`s
`s
`
`R
`a
`I
`I
`
`0
`
`M
`
`1
`s
`s
`
`Fig. 3.
`
`does not count misses until after 250K references. Since the
`trace samples include multiprogramming effect, both contain
`some cold-start misses [12]. Lines labeled A and B show
`the design target miss ratios for fully-associative caches from
`-
`-
`
`1281 and 1301. The line labeled C from r21 shows four-way
`-
`-
`
`set-associative miss ratio results from Fig. -17 in that paper. Fi-
`
`DOCKET NO.: IPR-5463750
`
`5
`
`TI-EXHIBIT 1014
`
`
`
`HILL AND SMITH: ASSOCIATIVITY IN CPU CACHES
`
`1617
`
`nally, the line labeled D from [27] shows four-, six- and eight-
`way set-associative miss ratios taken from hardware monitor
`measurements on an Amdahl 470 (Fig. 33 of that paper, as-
`suming 50 percent supervisor execution). Fig. 3 demonstrates
`that the miss ratios of the five-trace group are consistent with
`those measured and/or proposed for actual operating environ-
`ments.
`Despite the similarities with previously published data, miss
`ratio data for large caches (greater than 64K bytes) are subject
`to greater error, since only a few thousand misses may occur
`during a trace sample. To reduce sensitivity to such error,
`results in Section V concentrate on the relationship between
`the miss ratios of alternative caches rather than on the miss
`ratio values themselves.
`IV . SIMULATION TECHNIQUES
`FOR ALTERNATIVE
`DIRECT-MAPPED
`AND SET-ASSOCIATIVE CACHES
`In this section we first discuss two properties, set refine-
`ment and inclusion, that facilitate the rapid simulation of al-
`ternative caches. We then develop a new algorithm that uses
`both set-refinement and inclusion to rapidly simulate alterna-
`tive direct-mapped caches. Next we generalize an algorithm
`that simulates alternative set-associative caches using bit se-
`lection [34] to one that allows arbitrary set-mapping functions.
`Finally we compare implementations of the algorithms.
`A . Properties that Facilitate Rapid Simulation
`Two properties useful for simulating alternative direct-
`mapped and set-associative caches are set -refinement3 (in-
`troduced below) and inclusion (introduced in Mattson et al.
`[19]). Here we discuss these properties with respect to caches
`that have the same block size, do no prefetching, use LRU
`replacement, have arbitrary associativities, and can use arbi-
`trary set-mapping functions. Let C l ( A = n l , F = f l ) and
`C2(A = n2, F = f 2 ) be two such caches, where cache C;
`has associativity n; and set-mapping function f;, i = 1, 2 .
`Definition I: Set-refinement: Set-mapping function f2
`refines set-mapping function f1 if f2(x) = f 2 ( y ) implies
`f 1 ( x ) = f 1 (y), for all blocks x and y.
`Furthermore, cache C2(A = n2, F = f 2 ) is said to refine
`an alternative cache C l ( A = n l , F = f1)
`if set-mapping
`function f2 refines set-mapping function f l . Refines is so
`named becauses f2 refines f1 implies set-mapping function
`f2 induces a finer partition on all blocks than does f1. Since
`set refinement is clearly transitive, if f;+l refines f; for each
`i = 1, L - 1 then fj refines f; for all j > i , implying a
`hierarchy of sets. We will use set refinement to facilitate the
`rapid simulation of alternative direct-mapped caches (Section
`IV-B) and set-associative caches (Section IV-C).
`Definition 2: Znciusion: Cache C2(A = n2, F = f2) in-
`cludes an alternative cache C l ( A = n l , F = f l ) if, for any
`block x after any series of references, x is resident in C1
`implies x is resident in C2.
`Thus, when cache C2 includes cache C1, C2 always contains
`a superset of the blocks in C1. Inclusion facilitates rapid sim-
`ulation of alternative caches by allowing hits in larger caches
`
`to be inferred from hits detected in smaller ones. Mattson et
`al. [19] show when inclusion holds for alternative caches that
`use the same set-mapping function (and hence the same num-
`ber of sets). Next we show when it holds with LRU replace-
`ment and arbitrary set-mapping functions.
`Theorem I: Given the same block size, no prefetching and
`LRU replacement, cache C2(A = n2, F = f2) includes cache
`C l ( A = n l , F = f l ) if and only if set-mapping function f2
`refines f1 (set-refinement) and associativity n2 2 n1 (nonde-
`creasing associativity).
`Proof: Suppose cache C2 includes cache C1. Suppose
`further that a large number of blocks map to each set in both
`caches, as is trivially true for practical set-mapping functions
`(e.g., bit selection). To demonstrate that inclusion implies
`both set-refinement and nondecreasing associativity, we show
`that a block can be replaced in cache C1 and still remain in
`cache C2, violating inclusion, if either 1) set-refinement does
`not hold or 2) set-refinement holds but the larger cache has
`the smaller associativity.
`1) If cache C2 does not refine cache CI, then there exists at
`least one pair of blocks x and y such that f2(x) = f 2 ( y ) and
`f 1 ( x ) # f 1 0 ) . Since we assume many blocks map to each set,
`there exist many blocks z; for which f2(z;) = f2(x) = f 2 ( y > .
`Since f 1 ( x ) # f lo), either f I ( Z i ) # f 1 ( x ) or f l ( Z i ) # f 10)
`(or both), implying set-refinement is violated many times.
`Without loss of generality, assume that many 2;’s map to dif-
`ferent f1 sets than x (otherwise, many map to a different f1
` . . . , w,, .4 Con-
`sets than y). Let n2 of these be denoted by w 1 ,
`sider references to x , w 1 , . . . , w n2 . Inclusion is now violated
`since x is in cache C1, but not in cache C2. It is in cache C1,
`because blocks w 1 , . . . , w,, mapped to other sets than x and
`could not force its replacement; x is replaced in n2-way set-
`associative cache C2, since LRU replacement is used and the
`n2 other blocks mapped to its set are more recently referenced.
`2) Let X O , . . . ,xn2 be a collection of blocks that map to the
`same f2 set. Since we are assuming f2 refines f1, they also
`map the same f1 set. Consider references to X O , X I , . . . , x n 2 .
`Inclusion is now violated since xo is in nl-way set-associative
`cache C1, but not in n2-way set-associative cache C2(n1 > n2
`implies nl 2 n2 + 1).
`Suppose cache C2 refines cache C1 and n2 2 nl . Initially
`both caches are empty and inclusion holds, because everything
`(nothing) in cache C1 is also in cache C2. Consider the first
`time inclusion is violated, i.e., some block is in cache C1 that
`is not in cache C2. This can only occur when some block xo
`is replaced from cache C2, but not from cache C1. A block xo
`can only be replaced from cache C2 if n2 blocks, x1 through
`xnz , all mapping to f2(xo), are referenced after it. By set-
`refinement, fl(x0) = fl(x1) = ... = f1(xn2). Since n2 2
`nl , xo must also be replaced in cache C1.
`0
`Several corollaries, used to develop the cache simulation
`algorithms in the next two sections, follow directly from the
`above definitions and theorem.
`1) If cache C2 refines cache C1 and their set-mapping func-
`tions f 2 and f1 are different (partition blocks differently), then
`cache C2 has more sets than cache C1. The number of sets
`
`Set-refinement is called set-hierarchy in 1151.
`
`Blocks W I , . . . , w,, exist if at least 2 4 blocks map to set f ~ ( x ) .
`
`DOCKET NO