throbber
1612
`
`Evaluating Associativity in CPU Caches
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 12, DECEMBER 1989
`
`Abstract-Because of the infeasibility or expense of large
`fully-associative caches, cache memories are usually designed
`to be set-associative or direct-mapped. This paper presents 1)
`new and efficient algorithms for simulating alternative direct-
`mapped and set associative caches, and 2) uses those algorithms
`to quantify the effect of limited associativity on the cache miss
`ratio.
`We introduce a new algorithm, forest simulation, for simulat-
`ing alternative direct-mapped caches and generalize one, which
`we call all-associativity simulation, for simulating alternative
`direct-mapped, set-associative, and fully-associative caches. We
`find that while all-associativity simulation is theoretically less ef-
`ficient than forest simulation or stack simulation (a commonly
`used simulation algorithm); in practice, it is not much slower
`and allows the simulation of many more caches with a single
`pass through an address trace.
`We also provide data and insight into how varying associativity
`affects the miss ratio. We show: 1) how to use simulations of
`alternative caches to isolate the cause of misses; 2) that the
`principal reason why set-associative miss ratios are larger than
`fully-associative ones is (as one might expect) that too many
`active blocks map to a fraction of the sets even when blocks
`map to sets in a uniform random manner; and 3) that reducing
`associativity from eight-way to four-way, from four-way to two-
`way, and from two-way to direct-mapped causes relative miss
`ratio increases in our data of respectively about 5, 10, and 30
`percent, consistently over a wide range of cache sizes and a
`range of line sizes.
`
`Index Terms- Associativity, buffer, cache memory, com-
`puter architecture, direct-mapped, memory systems, perfor-
`mance evalulation, set-associative and trace-driven simulation
`algorithms.
`
`I. INTRODUCTION
`HREE important CPU cache parameters are cache size,
`
`T block (line) size, and associativity [27]. Cache size (buffer
`
`size, capacity) is so important that it is a part of almost all
`cache studies (for a partial bibliography see [29]). Block size
`(line size) has recently been examined in detail in [30]. Here
`we concentrate on associativity (degree of associativity, set
`
`Manuscript received February 14, 1989; revised July 10, 1989. The work
`presented here is based on research supported in part by the National Sci-
`ence Foundation under Grants CCR-8202591 and MIP-8713274, by the State
`of California under the MICRO program, by the Defense Advanced Re-
`search Projects Agency monitored by Naval Electronics Systems Command
`under Contract N00039-85-C-0269, the graduate school at the University of
`Wisconsin-Madison, and by IBM Corporation, Digital Equipment Corpora-
`tion, Philips Research Laboratories/Signetics Corporation, Apple Computer
`Corporation, and the Hewlett Packard Corporation.
`M. D. Hill is with the Department of Computer Sciences, University of
`Wisconsin, Madison, WI, 53706.
`A. J. Smith is with the Computer Science Division, Department of Elec-
`trical Engineering and Computer Science, University of California, Berkely,
`CA 94720.
`IEEE Log Number 8931174.
`
`size) which is the number of places in a cache where a block
`can reside.
`Selecting optimal associativity is important, because chang-
`ing associativity has a significant impact on cache performance
`and cost. Increasing associativity improves the likelihood that
`a block is resident by decreasing the probability that too many
`recently-referenced blocks map to the same place and by al-
`lowing more blocks to be considered for replacement. The
`effect of associativity on cache miss ratio has never been iso-
`lated and quantified, and that is one of the major goals of
`this paper. Conversely, increasing associativity often increases
`cache cost and access time, since more blocks (frames) must
`be searched in parallel to find a reference [ 161.
`Fig. 1 illustrates set-associativity. A set-associative cache
`uses a set-mapping function f to partition all blocks (data
`in an aligned, fixed-sized region of memory) into a number
`of equivalence classes. Some number of block frames in the
`cache are assigned to hold recently-referenced blocks from
`each equivalence class. Each group of block frames is called
`a set. The number of such groups, equal to the number of
`equivalence classes, is called the number of sets (s). The
`number of block frames in each set is called the associativity
`(degree of associativity, set size, n). The number of block
`frames in the cache (c) always equals the associativity times
`the number of sets (c = n . s). A cache is fully-ussociative
`if it contains only one set (n = c , s = l), is direct-mapped
`if each set contains one block frame (n = 1, s = c ) , and is
`n-way set-ussociative otherwise (where n is the associativity,
`s = c / n ) .
`On a reference to block x, the set-mapping function f feeds
`the “set decoder” with f(x) to select one set (one row), and
`then each block frame in the set is searched until x is found (a
`cache hit) or the set is exhausted (a cache miss). On a cache
`miss, one block in set f(x) is replaced with the block x ob-
`tained from memory. Finally, the word requested from block
`x is returned to the processor. Here for conceptual simplic-
`ity we show the word within the block selected last (in the
`box “compare block number with tags and select data word”).
`Many implementations, however, select the word within the
`block while selecting the set to reduce the number of bits that
`must be read; i.e., only words are gated into the multiplexer,
`not full lines. The most commonly used set-mapping function
`is the block number modulo the number of sets, where the
`number of sets is a power of two. This set mapping function
`is called bit selection since the set number is just the num-
`ber given by the low-order bits of the block address. For 256
`sets, for example, f ( x ) = x mod 256 or f ( x ) = x AND Oxff,
`where mod is remainder and AND is bitwise AND.
`The method we use for examining associativity in CPU
`caches is tracedriven simulation. It uses one or more (ad-
`
`OO18-9340/89/1200-1612$01.OO @ 1989 IEEE
`
`DOCKET NO.: IPR-5463750
`
`1
`
`TI-EXHIBIT 1014
`
`

`

`HILL AND SMITH: ASSOCIATIVITY IN CPU CACHES
`
`1613
`
`block number
`Address I
`
`b i a k offset
`1
`
`I
`
`
`
`Set
`Decoder
`
`<
`
`Associativily (A=n)
`
`b
`
`I
`
`I
`I
`
`I
`
`...
`
`I
`
`Compare Block Number wilh Tags
`
`Fig. 1. Set-associative mapping.
`
`A
`
`v
`
`Number
`
`of sw
`
`(S=dn)
`
`dress) traces and a (cache) simulator. A trace is the log of a
`dynamic series of memory references, recorded during the ex-
`ecution of a program or workload. The information recorded
`for each reference must include the address of the reference
`and may include the reference's type (instruction fetch, data
`read, or data write), length, and other information. A sim-
`ulator is a program that accepts a trace and parameters that
`describe one or more caches, mimics the behavior of those
`caches in response to the trace, and computes performance
`metrics (e.g., miss ratio) for each cache.
`We analyze associativity in caches with trace-driven simula-
`tion for the same reasons as are discussed in [28]. The princi-
`pal advantage of trace-driven simulation over random number
`driven simulation or analytical modeling is that there exists no
`generally-accepted model for program behavior (at the cache
`level) with demonstrated validity and predictive power. The
`major disadvantage is that workload samples must be rela-
`tively short, due to disk space and simulation time limits.
`The CPU time required to simulate many alternative caches
`with many traces can be enormous. Mattson et al. [I91 ad-
`dressed a similar problem for virtual memory simulation by
`developing a technique we call stack simulation, which allows
`miss ratios for all memory sizes to be computed simultane-
`ously, during one pass through the address trace, subject to
`several constraints including a fixed page size. While stack
`simulation can be applied to caches, each cache configuration
`with a different number of sets requires a separate simulation.
`For this reason, this paper first examines better algorithms
`for simulating alternative direct-mapped and set-associative
`caches, and then uses those algorithms to study associativity
`in caches.
`The rest of this paper is organized as follows. Section I1
`reviews previous work on cache simulation algorithms and
`associativity in caches. In Section 111, we explain our meth-
`ods in more detail and describe our traces. Section IV dis-
`cusses cache simulation algorithms, including properties that
`facilitate rapid simulation, a new algorithm for simulating al-
`ternative direct-mapped caches, and an extension to an al-
`gorithm for simulating alternative caches with arbitrary set-
`mapping functions. Section V examines the effect of associa-
`tivity on miss ratio, including categorizing the cause of misses
`in set-associative caches, relating set-associative miss ratios
`to fully-associative ones, comparing miss ratios from similar
`
`set-associative caches, and extending the design target miss
`ratios from [28] and [30] to caches with reduced associativity.
`Readers interested in the effect of associativity on miss ratio
`but not in cache simulation algorithms may skip Section IV,
`as Section V is written to stand alone.
`
`II. RELATED WORK
`A . Simulation Algorithms
`The original paper on memory hierarchy simulation is by
`Mattson et al. [19]. They introduce inclusion, show when
`inclusion holds, and develop stack simulation, which uses
`inclusion to rapidly simulate alternative caches. Inclusion is
`the property that after any series of references, larger alterna-
`tive caches always contain a superset of the blocks in smaller
`alternative caches.' Mattson et al. show inclusion holds be-
`tween alternative caches that have the same block size, do no
`prefetching , use the same set-mapping function (and therefore
`have the same number of sets), and use replacement algorithms
`that before each reference induce a total priority ordering on
`all previously referenced blocks (that map to each set) and use
`only this priority ordering to make the next replacement deci-
`sion. Replacement algorithms which meet the above condition,
`called stack algorithms, include LRU, OPTIMUM, and (if
`properly defined) RANDOM [6]. FIFO does not qualify since
`cache capacity affects a block's replacement priority. In Sec-
`tion IV-A, we will prove when inclusion holds for caches that
`use arbitrary set-mapping functions and LRU replacement.
`Mattson et al. develop stack simulation to simulate alter-
`native caches that have the same block size, do no prefetching,
`use the same set-mapping function, and use a stack replace-
`ment algorithm. Since inclusion holds, a single list per set,
`called a stack, can be used to represent caches of all associa-
`tives, with the first n elements of each stack representing the
`blocks in an n-way set-associative cache. For each reference,
`stack simulation performs three operations: 1) locate the ref-
`erence in the stack, 2) update one or more metrics to indicate
`which caches contained the reference, and 3) update the stack
`to reflect the contents of the caches after the reference. We
`
`' Inclusion is different from multilevel inclusion defined by Baer and
`Wang [ 5 ] . While inclusion is a property relating alternative caches, multilevel
`inclusion relates caches in the same cache hierarchy.
`
`DOCKET NO.: IPR-5463750
`
`2
`
`TI-EXHIBIT 1014
`
`

`

`1614
`
`call these three operations FIND, METRIC, and UPDATE,
`and will show that the algorithms discussed in later in Sections
`IV-B and IV-C use the same steps.
`The most straightforward implementation of stack simula-
`tion is to implement each stack with a linked list and record
`hits to position n by incrementing a counter distance[n]. After
`N references have been processed, the miss ratio of an n-way
`set-associative cache is simply 1 - Er=, distance[il/N. Since
`performance with a linked list will be poor if many elements of
`a stack must for searched on each reference, other researchers
`have developed more complex implementations of stack simu-
`lation, using hash tables, m-ary trees, and AVL trees [8], [21],
`[33]. While these algorithms are useful for some memory hi-
`erarchy simulations, Thompson [33] concludes that linked list
`stack simulation is near optimal for most CPU cache simu-
`lations. Linked list stack simulation is fast when few links
`are traversed to find a reference. On average, this is the case
`in CPU cache simulations since 1) CPU references exhibit a
`high degree of locality, and 2) CPU caches usually have a
`large number of sets and limited associativity, dividing active
`blocks among many stacks and bounding maximum stack size;
`different results are found for file system and database traces.
`For this reason, we consider only linked list stack simulation
`further, and use stack simulation to refer to linked list stack
`simulation.
`Mattson et al. also briefly mention a way of simulating
`caches with different numbers of sets (and therefore different
`set-mapping functions). In two technical reports, Traiger and
`Slutz extend the algorithms to simulate alternative caches with
`different numbers of sets and block sizes [34], and with dif-
`ferent numbers of sets, block sizes, and subblock sizes (sector
`and block sizes, address and transfer block sizes) [24]. They
`require that all alternative caches use LRU replacement, bit-
`selection for set mapping, and have block and subblock sizes
`that are powers of two. (Bit selection uses some of the bits
`of the block address as a binary number to specify the set.)
`In Section IV-C, we generalize to arbitrary set-mapping func-
`tions their algorithm for simulating alternative caches that use
`bit selection.
`The speed of stack simulation can also be improved by delet-
`ing references (trace entries) that will hit and not affect re-
`placement decisions in the caches to be simulated [25]. Puzak
`[23] shows that if all caches simulated use bit selection and
`LRU replacement, references that hit the most recently used
`element of a set can be deleted without affecting the total
`number of misses. We will show that this result trivially fol-
`lows from properties we define in Section IV-A, allowing
`such references to be deleted from traces before using any
`of our simulation algorithms. (The total number of memory
`references in the original trace must be retained, in order to
`compute the miss ratio.)
`
`B. Associativity
`Previous work on associativity can be broken into the fol-
`lowing three categories: 1) papers that discuss associativity
`as part of a more general analysis of 32 kbyte and smaller
`caches, among the more notable of which are [18], [17], [7],
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 12, DECEMBER 1989
`
`[32], [27],* and [ll], and [13]; 2) papers that discuss asso-
`ciativity and other aspects of cache design for larger caches
`([4], [2], and [22]); and 3) those that discuss only associativity
`([26] and [16]). Since caches have been getting larger, papers
`in category 1) can also be characterized as older, while those
`in category 2) are more recent.
`Papers in category 1) provide varying quantities of data
`regarding the effect of changing associativity in small caches.
`The qualitative trend they support is that changing associativ-
`ity from direct-mapped to two-way set-associative improves
`miss ratio, doubling associativity to four-way produces a
`smaller improvement, doubling again to eight-way yields an
`even smaller improvement, and subsequent doublings yield no
`significant improvement. Our quantitative results are consis-
`tent with results in these papers. We extend their results by
`examining relative miss ratio changes to isolate the effect of
`associativity from other cache aspects, and by examining some
`larger caches.
`Alexander et al. use trace-driven simulation to study small
`and large caches [4]. Unfortunately, the miss ratios they give
`are much lower than those that have been measured with hard-
`ware monitors and real workloads; see [28] for reports of real
`measurements.
`Agarwal et al. use traces gathered by modifying the mi-
`crocode of the VAX 8200 to study large caches and to try
`to separate operating system and multiprogramming effects
`[2]. They briefly examine associativity, where they find that
`associativity in large caches impacts multiprogramming work-
`loads more strongly than uniprocessor workloads. They find
`for one workload that decreasing associativity from two-way
`to direct-mapped increases the multiprogramming miss ratio
`by 100 percent and the uniprogramming miss ratio by 43 per-
`cent. These numbers are much larger than the average miss
`ratio change we find (30 percent).
`Przybylski et al. [22] examine cache implementation trade-
`offs. They find that reducing associativity from two-way to
`direct-mapped increases miss ratio 25 percent, regardless of
`cache size, which is consistent with our results. One contribu-
`tion of that paper is a method of translating the architectural
`impact of a proposed design change into time by computing
`the cache hit time increase that will exactly offset the bene-
`fit of the proposed change. A change improves performance
`only if the additional delay required to implement the change
`is less than the above increase. Przybylski et al. find that the
`architectural impact times for increasing associativity are of-
`ten small, especially for large caches, calling into question the
`benefit of wide associativity.
`The first paper to concentrate exclusively on associativity is
`[26]. That paper presents a model that allows miss ratios for
`set associative caches to be accurately derived from the fully
`associative miss ratio. In Section V-B, we further validate
`those results by showing that the model accurately relates the
`miss ratios of many caches, including large direct-mapped
`caches, to LRU distance probabilities.
`The second paper to concentrate on associativity is [16],
`based on parts of [15]. It shows that many large single-level
`This survey includes results for some large caches with wide associativity
`(e.g., 32-way set-associative 64 kbyte caches).
`
`DOCKET NO.: IPR-5463750
`
`3
`
`TI-EXHIBIT 1014
`
`

`

`HILL AND SMITH: ASSOCIATIVITY IN CPU CACHES
`
`caches in uniprocessors should be direct-mapped, since the
`drawbacks of direct-mapped caches (e.g., worse miss ratios
`and more-common worst case behavior) have small signifi-
`cance for large caches with small miss ratios, while the bene-
`fits of direct-mapped caches (lower cost and faster access time)
`do not diminish with increasing cache size. Here we examine
`miss ratio in more detail, but do not discuss implementation
`considerations.
`
`111. METHODS AND TRACES
`In this section, we discuss the use of the miss ratio as a
`suitable metric (among others), describe the traces that we
`use, show how we estimate average steady-state miss ratios,
`and show that our traces yield results consistent with those
`observed from running systems.
`To first order, the effective access time of a cache can be
`modeled as tcache + miss-ratio . tmemory. (Additional factors
`which affect access time including the overhead of write backs,
`extra time for line crossers, page crossers, and TLB misses,
`and the fact that writes may be slower than reads. These latter
`delays are much less significant than those given in the expres-
`sion.) The miss ratio is the number of cache misses divided
`by the number of memory references, tmemory is the time for
`a cache miss, and tcache is the time to access the cache on a
`hit. The two latter parameters are implementation dependent,
`and in [15] there is a discussion of their effect on cache per-
`formance. As noted earlier, increases in associativity, while
`generally improving the miss ratio, can increase access time,
`and thus degrade overall performance. Here, we concentrate
`on miss ratio because it is easy to define, interpret, compute,
`and is implementation independent. This independence facili-
`tates cache performance comparisons between caches not yet
`implemented and those implemented with different technolo-
`gies and in different kinds of systems.
`Results in this paper are based on two partially overlapping
`groups of traces, called the five-trace and 23-trace groups,
`respectively. Table I presents data on the traces. The first
`column gives the name of each trace sample. The second gives
`the fraction of all references that are instruction references.
`In these simulations, we do not distinguish between data reads
`and writes. The third column gives the length of the address
`traces in 1000’s of references. The final column gives the
`number of distinct bytes referenced by the trace, where any
`reference in an aligned 32-byte block is considered to have
`touched each byte in the block.
`Each of the trace samples in the five-trace group comes
`from the second 500000 references of a longer trace. The
`first three samples are user and system VAX-11 traces gath-
`ered with ATUM [ 11. Trace mu12-2nd500k contains a circuit
`simulator and a microcode address allocator running concur-
`rently under VMS. Trace mu18-2nd500k is an eight-job mul-
`tiprogrammed workload under VMS: spice, alloc, a Fortran
`compile, a Pascal compile, an assembler, a string search in
`a file, jacobi (a numerical benchmark) and an octal dump.
`Trace ue2nd500k consists of several copies of a program
`that simulates interactive users running under Ultrix. The
`other two samples in the trace group, mvs1-2nd500k and
`rnvs2-2nd500k, are collections of IBM 370 references from
`
`1615
`
`TABLE I
`DATA ON ~ C E S
`
`
`
`Trace Sampk
`N W
`mul2-2ndMOK
`mul8-2ndMOK
`ue-2nd500K
`mvs1-2ndS00K
`mvs2 2ndMOK
`
`lnstnrtion
`References (%)
`53
`51
`55
`52
`55
`
`Length (looo’s
`of references)
`500
`500
`500
`500
`500
`
`Dynamic Size
`(K-bytes)
`218
`292
`277
`163
`201
`
`Trace
`Name
`
`-...
`
`fora
`forf
`
`I
`
`lnsrmction
`References (%)
`
`Length (looo’s
`of references)
`
`Dynamic Size
`(K-bytes)
`
`50
`52
`52
`
`::
`
`I
`
`11
`
`353
`388
`401
`387
`414
`
`125
`144
`128
`152
`105
`
`I
`
`I
`
`ue
`
`61
`56
`57
`55
`
`228
`358
`372
`364
`
`54
`205
`191
`22 1
`
`system calls invoked in two Amdahl standard MVS workloads
`1281.
`The second trace group contains 23 samples of various
`workloads gathered on a VAX- 1 1 with ATUM [ 13. Trace sam-
`ples that exhibit unstable behavior (e.g., a particular doubling
`of cache size or associativity alters the miss ratio observed by
`many factors of two) have been excluded from both groups.
`We estimate the steady-state miss ratios for a trace sample
`using the miss ratio for a trace after the cache is warm (the
`warm-start miss ratio). A cache is warm if its future miss
`ratio is not significantly affected by the cache recently being
`empty [2]. We compute warm-start miss ratios using the sec-
`ond 250K references of each 500K-reference trace sample.
`We found that most caches with our traces are warm by 250K
`references by locating the knee in the graph of the cumulative
`misses to empty block frames versus references, a method
`of determining when caches are warm proposed in Agarwal
`et al. [2]. Furthermore, results for these multiprogrammed
`traces properly include cold-start effects whenever a process
`resumes execution.
`Fig. 2(a) and (b) displays miss ratio data for unified caches
`(mixed, i.e., cache data and instructions together) with 32-
`byte blocks. Solid lines show the average warm-start miss ra-
`tios with different associativities (1, 2, 4, and 8). The average
`warm-start miss ratio is the arithmetic average of warm-start
`miss ratios for each of the five traces in the five-trace group.
`The arithmetic mean is used because it represents the miss
`ratio of a workload consisting of an equal number of refer-
`ences from each of the traces. Previous experiments (as were
`done for [3 11 and [ 151) showed that little difference was ob-
`served when other averaging methods were used. The dashed
`line (labeled “inf”) gives the warm-start miss ratio of an infi-
`
`DOCKET NO.: IPR-5463750
`
`4
`
`TI-EXHIBIT 1014
`
`

`

`1616
`
`M
`I
`s
`s
`
`R
`a
`t
`
`1
`
`0
`
`M
`1
`s
`s
`
`0.40
`
`0.30
`
`0.20
`
`0.10
`
`0.00
`100
`
`o~060 1
`0.050 -
`
`o.Oo0
`
`0.030.
`
`0.020 -
`
`0.010 -
`
`o.Oo0 j4,
`IOK
`
`IK
`Cache Size (bytes)
`(a)
`
`10K
`
`100K
`Cache Size (bytes)
`
`IM
`
`(b)
`Fig. 2. Miss ratios for five-trace workload with caches of associativities of
`1, 2, 4, and 8. The dashed line shows the miss ratio for an infinite cache.
`(a) Smaller caches. (b) Larger caches.
`nite cache, a cache so large that it never replaces any blocks.
`Measurements for the 23-trace group are similar.
`Fig. 3 compares miss ratios for the five-trace group in eight-
`way set-associative unified caches, having 16-byte and 32-byte
`blocks, to miss ratios from other sources. Line “cold” mea-
`sures miss ratios from an emDtv cache. while line “warm”
`
`I I
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 12, DECEMBER 1989
`
`0.20
`
`0.15
`
`0.10
`
`0.05
`
`0.00
`1K
`
`0.20
`
`0.15 ,
`
`
`
`Cache Size 10K (bytes)
`
`l00K
`
`(a)
`
`IK
`
`IOK
`Cache Size (bytes)
`@)
`Comparison of our miss ratio data (solid lines) with other published
`data (A, B, C, D). (a) 16-byte blocks. (b) 32-byte blocks.
`
`IOOK
`
`M
`I
`s
`s
`
`R
`a
`I
`I
`
`0
`
`M
`
`1
`s
`s
`
`Fig. 3.
`
`does not count misses until after 250K references. Since the
`trace samples include multiprogramming effect, both contain
`some cold-start misses [12]. Lines labeled A and B show
`the design target miss ratios for fully-associative caches from
`-
`-
`
`1281 and 1301. The line labeled C from r21 shows four-way
`-
`-
`
`set-associative miss ratio results from Fig. -17 in that paper. Fi-
`
`DOCKET NO.: IPR-5463750
`
`5
`
`TI-EXHIBIT 1014
`
`

`

`HILL AND SMITH: ASSOCIATIVITY IN CPU CACHES
`
`1617
`
`nally, the line labeled D from [27] shows four-, six- and eight-
`way set-associative miss ratios taken from hardware monitor
`measurements on an Amdahl 470 (Fig. 33 of that paper, as-
`suming 50 percent supervisor execution). Fig. 3 demonstrates
`that the miss ratios of the five-trace group are consistent with
`those measured and/or proposed for actual operating environ-
`ments.
`Despite the similarities with previously published data, miss
`ratio data for large caches (greater than 64K bytes) are subject
`to greater error, since only a few thousand misses may occur
`during a trace sample. To reduce sensitivity to such error,
`results in Section V concentrate on the relationship between
`the miss ratios of alternative caches rather than on the miss
`ratio values themselves.
`IV . SIMULATION TECHNIQUES
`FOR ALTERNATIVE
`DIRECT-MAPPED
`AND SET-ASSOCIATIVE CACHES
`In this section we first discuss two properties, set refine-
`ment and inclusion, that facilitate the rapid simulation of al-
`ternative caches. We then develop a new algorithm that uses
`both set-refinement and inclusion to rapidly simulate alterna-
`tive direct-mapped caches. Next we generalize an algorithm
`that simulates alternative set-associative caches using bit se-
`lection [34] to one that allows arbitrary set-mapping functions.
`Finally we compare implementations of the algorithms.
`A . Properties that Facilitate Rapid Simulation
`Two properties useful for simulating alternative direct-
`mapped and set-associative caches are set -refinement3 (in-
`troduced below) and inclusion (introduced in Mattson et al.
`[19]). Here we discuss these properties with respect to caches
`that have the same block size, do no prefetching, use LRU
`replacement, have arbitrary associativities, and can use arbi-
`trary set-mapping functions. Let C l ( A = n l , F = f l ) and
`C2(A = n2, F = f 2 ) be two such caches, where cache C;
`has associativity n; and set-mapping function f;, i = 1, 2 .
`Definition I: Set-refinement: Set-mapping function f2
`refines set-mapping function f1 if f2(x) = f 2 ( y ) implies
`f 1 ( x ) = f 1 (y), for all blocks x and y.
`Furthermore, cache C2(A = n2, F = f 2 ) is said to refine
`an alternative cache C l ( A = n l , F = f1)
`if set-mapping
`function f2 refines set-mapping function f l . Refines is so
`named becauses f2 refines f1 implies set-mapping function
`f2 induces a finer partition on all blocks than does f1. Since
`set refinement is clearly transitive, if f;+l refines f; for each
`i = 1, L - 1 then fj refines f; for all j > i , implying a
`hierarchy of sets. We will use set refinement to facilitate the
`rapid simulation of alternative direct-mapped caches (Section
`IV-B) and set-associative caches (Section IV-C).
`Definition 2: Znciusion: Cache C2(A = n2, F = f2) in-
`cludes an alternative cache C l ( A = n l , F = f l ) if, for any
`block x after any series of references, x is resident in C1
`implies x is resident in C2.
`Thus, when cache C2 includes cache C1, C2 always contains
`a superset of the blocks in C1. Inclusion facilitates rapid sim-
`ulation of alternative caches by allowing hits in larger caches
`
`to be inferred from hits detected in smaller ones. Mattson et
`al. [19] show when inclusion holds for alternative caches that
`use the same set-mapping function (and hence the same num-
`ber of sets). Next we show when it holds with LRU replace-
`ment and arbitrary set-mapping functions.
`Theorem I: Given the same block size, no prefetching and
`LRU replacement, cache C2(A = n2, F = f2) includes cache
`C l ( A = n l , F = f l ) if and only if set-mapping function f2
`refines f1 (set-refinement) and associativity n2 2 n1 (nonde-
`creasing associativity).
`Proof: Suppose cache C2 includes cache C1. Suppose
`further that a large number of blocks map to each set in both
`caches, as is trivially true for practical set-mapping functions
`(e.g., bit selection). To demonstrate that inclusion implies
`both set-refinement and nondecreasing associativity, we show
`that a block can be replaced in cache C1 and still remain in
`cache C2, violating inclusion, if either 1) set-refinement does
`not hold or 2) set-refinement holds but the larger cache has
`the smaller associativity.
`1) If cache C2 does not refine cache CI, then there exists at
`least one pair of blocks x and y such that f2(x) = f 2 ( y ) and
`f 1 ( x ) # f 1 0 ) . Since we assume many blocks map to each set,
`there exist many blocks z; for which f2(z;) = f2(x) = f 2 ( y > .
`Since f 1 ( x ) # f lo), either f I ( Z i ) # f 1 ( x ) or f l ( Z i ) # f 10)
`(or both), implying set-refinement is violated many times.
`Without loss of generality, assume that many 2;’s map to dif-
`ferent f1 sets than x (otherwise, many map to a different f1
` . . . , w,, .4 Con-
`sets than y). Let n2 of these be denoted by w 1 ,
`sider references to x , w 1 , . . . , w n2 . Inclusion is now violated
`since x is in cache C1, but not in cache C2. It is in cache C1,
`because blocks w 1 , . . . , w,, mapped to other sets than x and
`could not force its replacement; x is replaced in n2-way set-
`associative cache C2, since LRU replacement is used and the
`n2 other blocks mapped to its set are more recently referenced.
`2) Let X O , . . . ,xn2 be a collection of blocks that map to the
`same f2 set. Since we are assuming f2 refines f1, they also
`map the same f1 set. Consider references to X O , X I , . . . , x n 2 .
`Inclusion is now violated since xo is in nl-way set-associative
`cache C1, but not in n2-way set-associative cache C2(n1 > n2
`implies nl 2 n2 + 1).
`Suppose cache C2 refines cache C1 and n2 2 nl . Initially
`both caches are empty and inclusion holds, because everything
`(nothing) in cache C1 is also in cache C2. Consider the first
`time inclusion is violated, i.e., some block is in cache C1 that
`is not in cache C2. This can only occur when some block xo
`is replaced from cache C2, but not from cache C1. A block xo
`can only be replaced from cache C2 if n2 blocks, x1 through
`xnz , all mapping to f2(xo), are referenced after it. By set-
`refinement, fl(x0) = fl(x1) = ... = f1(xn2). Since n2 2
`nl , xo must also be replaced in cache C1.
`0
`Several corollaries, used to develop the cache simulation
`algorithms in the next two sections, follow directly from the
`above definitions and theorem.
`1) If cache C2 refines cache C1 and their set-mapping func-
`tions f 2 and f1 are different (partition blocks differently), then
`cache C2 has more sets than cache C1. The number of sets
`
`Set-refinement is called set-hierarchy in 1151.
`
`Blocks W I , . . . , w,, exist if at least 2 4 blocks map to set f ~ ( x ) .
`
`DOCKET NO

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket