`and Multiprogramming Workloads
`
`ANANT AGARWAL, JOHN HENNESSY, and MARK HOROWITZ
`Stanford University
`
`the required
`to provide
`computer systems
`in current high-performance
`Large caches are necessary
`in significant
`can result
`high memory bandwidth. Because a small decrease
`in cache performance
`of large caches is impor-
`system performance
`degradation, accurately
`characterizing
`the performance
`systems and multipro-
`tant. Although measurements on actual systems have shown
`that operating
`gramming can affect cache performance,
`previous studies have not focused on these effects. We have
`developed a program
`tracing
`technique
`called ATUM
`(Address Tracing Using Microcode)
`that
`captures
`realistic
`traces of multitasking workloads
`including
`the operating system. Examining
`cache
`behavior
`using
`these
`traces
`from a VAX processor
`shows
`that both
`the operating
`system and
`multiprogramming
`activity significantly
`degrade cache performance, with an even greater proportional
`impact on large caches. From a careful analysis of the causes of this degradation, we explore various
`techniques
`to reduce
`this
`loss. While seemingly
`little can be done to mitigate
`the effect of system
`references, multitasking
`cache miss activity
`can be substantially
`reduced with
`small hardware
`additions.
`
`mem-
`Design Styles-ossociatiue
`Structures]:
`Categories and Subject Descriptors: B.3.2 [Memory
`and
`ories, cache memories, virtual memory; B.3.3
`[Memory
`Structures]:
`Performance
`Analysis
`of
`Design Aids--formal
`models, simulation; C.4 [Computer
`Systems Organization]:
`Performance
`Systems--design
`studies, measurement
`techniques, modeling
`techniques; D.4.1 [Operating
`Systems]:
`Process Management-multiprocessing/mu&rogrumming;
`D.4.8
`[Operating
`Systems]:
`Perform-
`ance-measurements
`
`General Terms: Design, Measurement, Performance
`
`Additional
`simulation,
`
`Key Words and Phrases: Cache miss rate, cache performance,
`virtual caches
`
`cold start,
`
`trace-driven
`
`1. INTRODUCTION
`processors need high memory bandwidth. Caches, which store
`High-performance
`instructions
`and data
`in high-speed memories close to
`the
`frequently
`used
`processor, are a cost-effective means of attaining
`this bandwidth. System per-
`formance
`is a decreasing
`function of the cache miss rate, the cache access time,
`to service a miss. Let T, and T,,
`and the number of processor cycles
`taken
`represent
`the cache access
`time and
`the mean cache miss service
`time,
`
`is currently with
`
`the Laboratory
`
`for Computer Science
`
`(NE43-418), MIT, Cambridge,
`
`A. Agarwal
`MA 02139.
`Stanford, CA 94305.
`Authors’ address: Computer Systems Laboratory,
`the copies are not
`that
`Permission
`to copy without
`fee all or part of this material
`is granted provided
`made or distributed
`for direct commercial advantage,
`the ACM copyright notice and the title of the
`publication
`and its date appear, and notice
`is given that copying
`is by permission of the Association
`for Computing Machinery.
`To copy otherwise,
`or
`to
`republish,
`requires a fee and/or
`specific
`permission.
`0 1988 ACM 0734-2071/88/1100-0393
`
`$01.50
`
`ACM Transactions
`
`on Computer Systems, Vol. 6, No. 4, Nov. 1988, Pages 393-431.
`
`DOCKET NO.: IPR-5463750
`
`1
`
`TI-EXHIBIT 1015
`
`
`
`394
`
`l
`
`A. Agarwal et al.
`
`respectively, and m the cache miss rate. Then, for a memory bandwidth limited
`system, performance
`is inversely related to the average memory access time
`T, + mT,. The ideal memory hierarchy minimizes this average memory access
`time. Because memory access times often do not scale up in proportion
`to
`processor speeds, both the cost of servicing a miss (T,) and-to a smaller extent-
`the cache access time (T,) become relatively more expensive as processors achieve
`greater speeds [ 151. Thus minimizing average memory access time will become
`even more important as we try to build faster machines. In this paper we will
`concentrate on the miss rate and cache access time because the miss service time,
`T,,,, is primarily a function of main memory and bus architectures. Some of the
`cache parameters we consider affect both m and T,,, (e.g., the block size); in such
`cases we include T,,, in our analysis of cache performance.
`Simultaneously minimizing both cache access time and miss rate is virtually
`impossible, necessitating a careful trade-off between the two to minimize
`the
`average memory access time. At first glance, reducing miss rate may seem more
`important, due to the large miss penalty in fetching data from main memory.
`Unfortunately, using more complex cache organizations
`to reduce miss rates
`results in additional
`levels of logic and increases cache access times, which can
`affect the clock cycle time of the machine. Thus, as many recent studies have
`pointed out (e.g., [15, 22, 24]), implementation
`issues often favor simple organi-
`zations for improved overall system performance. Direct-mapped caches, which
`do not suffer from the complications of associative caches arising from replace-
`ment, multiplexing, and associative searches, may offer a cost-effective means of
`obtaining a low average memory access time. Because cache access time T, is a
`major bottleneck
`in many current high-performance system designs (e.g., VAX
`8800 [12]), and especially in RISC-style architectures (e.g., SPUR [16], MIPSX
`[17], R2000 [21]), some designers have chosen to use caches with a low degree of
`associativity causing an increased miss rate for the same total cache size. Direct-
`mapped caches can also show pathological performance for some workloads.
`Since the miss rate must be kept low to sustain performance, large caches become
`necessary. Although
`large caches have an access time penalty, it is often less
`than the clock-speed penalty associated with greater set-sizes. Thus a viable
`design alternative
`is to overcome the absence of associativity
`in direct-mapped
`caches by substantially
`increasing the cache size. Large caches are also necessi-
`tated by the increasing performance of processors and the increasing gap this
`generates between processor bandwidth requirements and main-memory band-
`width capabilities.
`The increased reliance on large caches introduces a number of important
`questions that have not been adequately addressed in published cache studies.
`For instance most current research has focused on small (less than 64K bytes)
`caches for predominantly
`single-process user programs. Most earlier studies
`either roughly modeled the effects of multiprogramming and system references
`or excluded them altogether. This omission was not significant
`in the study of
`small caches as shown by simulation
`results of Smith
`[27] that correspond
`reasonably well with actual measurements of systems [7]. In small caches the
`contribution
`to the miss rate generated by a single-process dominates the total
`miss rate. However, the same argument does not apply to large caches, where
`ACM Transactions
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`2
`
`TI-EXHIBIT 1015
`
`
`
`Cache Performance of Operating System and Multiprogramming Workloads
`
`l
`
`395
`
`the
`
`can play a large role in determining
`
`both system references and multitasking
`overall miss rate.
`for realistic workloads has
`of cache performance
`Accurate characterization
`been difficult
`thus
`far because large traces of multitasking workloads
`including
`the operating system were not generally available. A recent study of multipro-
`gramming
`traces of operating system workloads
`[5] concentrates on small caches.
`Some preliminary
`results
`from our research were discussed
`in a paper about
`ATUM
`[4]; the data presented and analyzed here is much more extensive. The
`ATUM
`technique allows
`the capture of real working environments with
`little
`distortion.
`The
`traces obtained using ATUM
`can accurately characterize cache
`performance,
`taking
`into account system references and multiprogramming,
`even
`for large caches.
`introducing
`the transient behavior of caches in detail
`In this paper we analyze
`cache analysis and trace stitching
`to obtain suffi-
`trace sampling
`for efficient
`ciently
`long traces to achieve steady-state miss rates. Using
`finite-length
`trace
`samples of program execution
`(such as ATUM
`traces)
`to characterize
`cache
`performance
`for the entire
`trace, which we call trace sampling,
`is an effective
`means of reducing both computation
`time and trace storage space. Naive use of
`traces that are not long enough
`to reach steady-state
`leads to very pessimistic
`miss rates. This
`is more severe for large caches, because the larger a cache the
`longer
`the traces need to be to achieve steady-state miss rates. Long realistic
`traces are hard to obtain;
`trace stitching overcomes these difficulties
`by allowing
`traces to be catenated under certain conditions.
`the traces
`Section 2 introduces
`the ATUM
`tracing
`technique and describes
`used in this study. We also review pertinent
`cache terminology. An analysis of
`transient behavior
`is provided
`in Section 3 to validate our trace sampling and
`trace stitching methodology,
`and indicate
`the minimum
`trace length needed to
`derive steady-state
`cache miss rate statistics. These
`techniques are then used
`to analyze cache performance
`for system
`references
`in Section 4. Section 5
`presents our results on multiprogramming
`issues, followed by our conclusions.
`
`2. METHODOLOGY
`is the
`to study cache performance. TDS
`(TDS)
`simulation
`We use trace-driven
`technique by which a model of a proposed system is evaluated using previously
`recorded address traces as the external stimuli. Because of its flexibility
`and ease
`of use, TDS
`is a popular
`technique
`for cache performance evaluation.
`Some of the other
`techniques of cache performance evaluation,
`such as hard-
`ware measurement
`[7], or analytical modeling
`[3], are not suited
`to a study of
`this nature. Hardware measurement
`is an accurate
`technique, but is limited
`to
`existing cache organizations.
`Prototype hardware systems that can modify
`the
`cache organization
`at will are hard to build. Mathematical models can yield quick
`estimates of cache performance; however they are inadequate
`if fine-tuned
`results
`are required. Trace-driven
`simulation
`does not suffer
`from
`these drawbacks:
`Accurate cache performance
`statistics
`for any given cache organization
`can be
`obtained. Smith discusses further
`the merits of trace-driven
`simulation
`for cache
`studies
`[28].
`
`ACM Transactions
`
`on Computer Systems, Vol. 6, No. 4, NOV. 1988.
`
`DOCKET NO.: IPR-5463750
`
`3
`
`TI-EXHIBIT 1015
`
`
`
`396
`
`l
`
`A. Agarwal et al.
`
`is that the results are only
`simulation
`One of the key properties of trace-driven
`as accurate as the traces that drive the simulation. Earlier
`traces seldom included
`operating
`system and multiprogramming
`activity.
`Indeed, Clark
`[7] argues in
`favor of hardware measurement over trace-driven
`simulation
`for this
`reason.
`Exclusion
`of system and multiprogramming
`references have caused miss rate
`estimates
`to be optimistic
`because (1) the working sets of system references or
`multiprogramming
`workloads are much larger than single process user workloads;
`(2) system code is less repetitive
`than user; and (3) interruption
`of user activity
`by system calls, or by other user processes, tends
`to flush out live portions of
`user code and data.
`to model the effects of system references and multi-
`Some studies have tried
`programming. Smith
`[27] interleaves
`traces to simulate multiprogramming
`with
`round-robin
`scheduling and constant duration
`time slices. However, Kobayashi’s
`results
`for MVS
`[19] and ours for VMS and Ultrix’
`indicate
`that actual
`traces
`show a wide variation
`in
`time slice duration
`and execution order. Therefore
`Smith’s miss rate estimates are expected
`to be pessimistic because round-robin
`scheduling
`results
`in worst-case miss rates with
`respect
`to scheduling policy.
`Goodman
`[13] simulated
`the effect of multiprogramming
`and system activity by
`flushing
`the cache on every context switch or system kernel call. A similar model
`was used by Peuto and Shustek 1231 to observe
`the effect of user-initiated
`supervisor calls on cache performance. The results presented here show that this
`model is inaccurate
`for large caches where significant
`data retention occurs across
`interrupts.
`trace-driven
`component,
`traces do not exclude any workload
`Since ATUM
`simulation
`results using ATUM
`traces are accurate. We briefly describe
`the
`ATUM methodology
`in this paper. For a detailed discussion of the ATUM
`tracing
`technique and its merits relative
`to other tracing schemes please refer to [4].
`
`2.1 ATUM: Address Tracing Using Microcode
`The ATUM
`scheme is very simple and can collect complete address information
`without
`the need to build special-purpose
`hardware. The basic idea is to do the
`tracing
`“below”
`the operating system-in
`microcode. By making minor changes
`to the existing microcode of a machine, a trace of all addresses that are touched
`by the processor are stashed away
`in a reserved area of main memory, and
`periodically written out to secondary storage.
`that has either a writable
`Microcode modification
`is possible
`in any processor
`or patchable control store. We used a VAX 8200 processor at Digital Equipment
`Corporation, Hudson,
`to obtain our traces. The patchable microcode of the 8200
`was designed
`to allow microcode errors
`to be corrected without
`the expense of
`replacing all the ROMs
`in existing machines. We were able to use this facility
`to
`modify
`the microcode
`to record all memory
`traffic.
`for macroinstruc-
`routines
`The addresses generated by appropriate microcode
`tion
`fetches and data directly
`correspond
`to
`the addresses specified by
`the
`machine architecture.
`The addresses are not tainted by implementation-specific
`
`’ Ultrix and VMS are trademarks
`
`of Digital Equipment Corporation.
`
`ACM Transactions
`
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`4
`
`TI-EXHIBIT 1015
`
`
`
`Cache Performance of Operating System and Multiprogramming Workloads
`
`l
`
`397
`
`resources such as prefetch buffers, caches, or bus sizes. Implementation-specific
`details,
`if needed, can be incorporated
`into a trace by a postprocessor.
`
`2.2 Traces Used in This Study
`on a VAX 8200
`systems running
`We traced both Ultrix
`and VMS operating
`processor and gathered a comprehensive
`set of traces covering a wide range of
`workloads and multiprogramming
`levels. The
`traces are a sampling of typical
`programs used in general-purpose
`systems. Each ATUM
`trace sample, roughly
`400,000 references
`long,
`is a half-second
`snapshot execution on a machine
`the
`speed of a VAX-11/780.
`In all, 30 traces are used in this study.
`For the effect of system references we concentrated on traces that had just one
`or two processes to exclude multitasking
`behavior. Twenty
`traces of 11 programs
`running under VMS and Ultrix are used.
`
`running
`
`TLB, a TLB
`
`simulator
`
`a behavioral
`
`simulator
`
`at DEC, simulating
`
`(VMS);
`trace sample of a microcode address allocator
`ALLC-a
`diagnostics program
`for the VAX computer
`(VMS);
`DIAO-a
`IVEX3-two
`samples of a DEC program,
`Interconnect Verify, checking
`IVEXO,
`net lists in a VLSI chip (VMS);
`UMILS-MIPS
`instruction
`level simulator
`(VMS);
`of DECSIM,
`DECO, DECl-trace
`some cache hardware
`(VMS);
`(VMS);
`compile of LINPACK
`FORLO, FORLl-FORTRAN
`(VMS);
`compile of a microcode parser program
`PASCO, PASCl-Pascal
`simulating
`a 2-input
`tri-state NAND buffer
`(VMS);
`SPICO, SPICl-SPICE
`runs of BOYER
`(a theorem prover)
`(VMS);
`LISPO, LISPl-LISP
`a name list from an executable
`tile (Ultrix);
`NMLST-produces
`SAVECO, SAVECl, SAVEC2, SAVEC3-samples
`of the C compiler
`
`(Ultrix).
`
`traces between 5 and 50 percent of the references are due to system
`In the VMS
`activity, with an average user/system
`ratio of 4 to 1. In the Ultrix workloads,
`about 50 percent of the references
`in NMLST
`are system, while
`the SAVEC
`traces show about 75 percent system references.
`levels three, six, and
`We used nine VMS
`trace samples of multiprogramming
`ten-three
`trace samples at each multiprogramming
`level-to
`study
`the impact
`of process switching. The trace samples for a particular multiprogramming
`level
`showed the same component processes. The three samples of each multiprogram-
`ming
`level were concatenated
`to form
`three
`large traces called MUL3, MULG,
`and MULlO.
`The processes active
`in MUL3
`are a FORTRAN
`compile of
`LINPACK,
`the microcode address allocator, and a directory
`search;
`those
`in
`and a program called 4 X 1 X
`MUL6 are two FORTRAN
`compiles of LINPACK
`5, the microcode address allocator, a SPICE run, a Pascal compile of a microcode
`parser, and LINPACK;
`and those in MULlO
`include
`the six programs
`in MULG,
`a numerical benchmark JACOBI, a string search in a file, MACRO-an
`assembly
`level compile, an octal dump, and a linker. For multiprogramming
`under Ultrix,
`
`ACM Transactions
`
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`5
`
`TI-EXHIBIT 1015
`
`
`
`398
`
`.
`
`A. Agarwal et al.
`
`traces. The resulting
`four samples of the SAVEC
`the
`we concatenated
`called SAVEC, has the C compiler, and two other tasks.
`In addition
`to the above traces, we will use a VMS
`trace of Interconnect Verify
`stripped of all extraneous processes-called
`IVEX-for
`some of the examples.
`Table
`II in the Appendix
`summarizes benchmark
`trace characteristics.
`
`trace,
`
`2.3 Cache Terminology and Definitions
`the cache
`include
`to our discussion
`The cache and workload parameters pertinent
`1271. The
`size, number of sets, set size, block size, and replacement algorithm
`number of sets (S) has also been called the number of rows; the set size (D), or
`the scope of associative search, has also been called degree of associativity
`or
`number of columns. Block size (B) is defined as the amount of storage associated
`with an address tag and is sometimes
`referred
`to as line size. We assume that
`the entire block is fetched on a miss to any word in that block. The replacement
`algorithm
`is the process used to select one of the blocks
`in a given set for
`occupation by a newly
`referenced block. The
`important
`schemes include LRU
`(least recently used), random, and FIFO
`(first
`in first out). LRU
`replacement
`is
`used throughout
`this paper. Our simulations also assume demand fetch and write
`allocate. Demand fetch means that a block is fetched from memory
`into the cache
`only on a cache miss; and write allocate
`is the policy where an entire block
`is
`fetched
`into
`the cache on a write
`into the block if it is absent from the cache.
`We present cache miss rate results
`throughout
`this paper because
`
`impact of the cache on the processor can be easily deter-
`(1) The performance
`mined once the miss rate is known;
`independent of processor
`largely
`(2) The miss rate is a cache performance metric
`and system
`implementation;
`other metrics such as average memory access
`time
`involve assumptions about cache access time, bus bandwidth, memory
`bus traffic, etc.
`the miss
`(3) Since most major studies have used the miss rate as a key metric,
`rate is convenient
`for comparing
`the various
`results. However we will also
`use the average cache access time and the bus traffic generated by the cache
`when consideration
`of just the miss rate might be misleading.
`
`2.4 Simulation Models
`is to obtain a “steady
`evaluation
`One of the chief aims of cache performance
`state” miss rate for a given organization. What do we mean by steady-state cache
`miss rate? While
`it
`is hard
`to define steady-state
`cache miss rate precisely,
`generally speaking
`it is the average miss rate of the cache over a long period of
`time-long
`enough
`that a good cross-section of the intended workload
`is seen.
`Unfortunately
`only measurement over this period can yield an exact steady-state
`miss rate. A slightly
`less accurate method
`is to derive the statistical mean of miss
`rates obtained by simulating
`the cache against
`long traces of a number of typical
`programs, preferably
`running
`to completion, either singly or multiprogrammed.
`While we will not actually simulate
`the cache against an entire execution
`trace
`of a program
`to get the steady-state program miss rate, we will examine more
`ACM Transactions
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`6
`
`TI-EXHIBIT 1015
`
`
`
`Cache Performance of Operating System and Multiprogramming Workloads
`
`*
`
`399
`
`techniques,
`practical alternate
`estimates
`from several smaller
`
`trace sampling and trace stitching,
`trace samples.
`
`to obtain good
`
`3. TRANSIENT BEHAVIOR AND TRACE SAMPLING
`Trace sampling
`is the method by which
`finite segments of a trace, relatively
`small
`in size compared
`to the entire
`trace generated by a program, are used to
`accurately estimate steady-state cache performance
`for the program. Studying
`trace sampling
`is important
`for two reasons. First, enormous potential exists for
`reducing
`the simulation
`time in characterizing
`large caches. Second, since ATUM
`traces are of finite
`length
`(typically
`a half-million
`references)
`it could be argued
`that artificial
`truncation
`of long execution
`traces
`into multiple
`small samples
`introduces start-up distortion
`in simulations
`of large caches. The need to study
`sampling effects and the transient behavior of caches is even greater in operating
`system and multiprogramming
`simulations,
`where start-up
`effects are more
`pronounced due to larger working sets and context switching. We will describe a
`technique
`to overcome
`these difficulties
`and show how trace sampling can be
`used to obtain accurate estimates of cache performance efficiently.
`This section deals primarily with
`trace sample size. The number of samples
`required, an important
`consideration, will depend both on the desired accuracy
`and on workload characteristics. Nonuniformly
`behaved programs will need more
`samples than well-behaved ones. In general, a simple statistical analysis will yield
`confidence
`limits on the accuracy of the mean miss rate obtained
`from a number
`of samples (see [20] on a related
`topic).
`behavior, we provide a set of
`After motivating
`the study of cache transient
`working definitions
`for the transient behavior of caches to assess start-up effects.
`The aim of this section will be to find
`the minimum
`trace length necessary
`to
`obtain steady-state cache miss rates for a given workload. From
`the results of
`this analysis, we then go on to show how small trace samples can be effectively
`used to obtain accurate steady-state cache performance statistics.
`
`3.1 Transient Behavior Analysis
`We examine
`the transient nature of the miss rate to motivate studying start-up
`behavior and finite
`trace-size effects, especially
`for large caches. Figure 1 shows
`a plot of miss rate versus time for cache sizes 4K and 64K bytes for the benchmark
`IVEX. While
`the dependence of miss rate on time is not strong for the 4K-cache,
`it is a decreasing
`function
`of trace
`length
`for the larger cache, implying
`that
`start-up cost can dominate behavior. Use of a short trace (say 150,000 references
`for
`the
`IVEX
`benchmark)
`for
`the 64K cache will
`introduce
`severe start-up
`distortion. Even when
`long traces were available, one may have to wait a long
`time
`for
`the miss rate
`to stabilize. Clearly, an understanding
`of the exact
`dependence of miss rate on trace
`length
`is essential
`to obtain
`steady-state
`performance data from short trace lengths.
`Easton and Fagin
`[ll]
`and Streker
`[31] also stress the need to accurately
`analyze
`transient
`effects
`in caches. Easton and Fagin define
`two types of miss
`rates for fully-associative
`caches to address this issue: cold start and warm start.
`By their definition,
`the cold-start miss rate is obtained by simulating an initially
`ACM Transactions
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`7
`
`TI-EXHIBIT 1015
`
`
`
`400
`
`*
`
`A. Agarwal et al.
`
`for
`time
`versus
`rate
`Fig. 1. Miss
`benchmark
`IVEX. Set size is 1; block
`size is 4.
`
`4K bytes
`
`4
`
`50 1001502002503&73504lW4505W5506W
`Time (K tefs)
`
`the
`(or steady-state) miss rate by first allowing
`empty cache and warm-start
`issue
`cache to fill up. While many previous studies did not address the cold-start
`in their cache simulations,
`some studies that used small trace samples (e.g., [13,
`271) applied Easton and Fagin’s warm-start
`terminology
`for
`fully-associative
`caches to their set-associative and direct-mapped
`caches as well. Although
`these
`definitions
`are satisfactory
`for small,
`fully associative caches (where
`the cache
`fills up quickly),
`they become
`inadequate
`for
`large
`low-associativity
`caches
`because one might have to wait an arbitrarily
`long time for the cache to fill up.
`The basic problem with
`these definitions
`is the implicit
`notion of reaching
`steady state when the cache fills up. This notion
`is based on the assumption
`that
`a large proportion
`of cache misses in the steady state is due to interference. An
`interference miss
`is caused by a reference
`to a block
`that was purged
`from
`the cache by an intervening
`reference.
`Interference
`is predominantly
`seen in
`small caches where multiple,
`simultaneously
`active, program-blocks map into the
`same cache block causing strong contention
`for limited cache resources.
`The same situation does not apply to large caches. A new reference that purges
`a valid cache entry will cause an interference miss only if the purged block is still
`be.
`In other words,
`interference misses are caused only by purged blocks in the
`current working
`set of the program. Program working
`sets tend
`to be much
`smaller
`than overall program size owing
`to the locality property of programs
`[9].
`Consequently,
`large caches, which can comfortably
`sustain
`the working sets of
`multiple processes, show little
`interference because purged blocks are often dead.
`The misses that occur in such a large cache are those caused by bringing
`in the
`initial working sets into
`the cache for the first
`time, called start-up misses, and
`from
`then on, in fetching additional
`blocks
`for the first
`time due to changing
`program
`localities, or nonstationary misses [3].
`from the
`The need for a modified definition
`of steady state becomes apparent
`following
`hypothetical
`example. Consider an infinte
`cache. Previous definitions
`based on cache fill-up would
`imply
`that such a cache would never reach steady
`state. But clearly a program would
`reach steady state after an initial
`transient.
`Recalling
`that our aim is to obtain
`the miss rate of the program,
`the desired
`steady-state miss rate of the trace sample should correspond
`to program miss
`rate. Our definition will yield precisely
`this steady-state miss rate.
`
`3.1.1 Definitions. We define cold start and warm start using the more general
`notion of steady state. Let the warm-start
`region be the portion of a trace after
`
`ACM Transactions
`
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`8
`
`TI-EXHIBIT 1015
`
`
`
`Cache Performance of Operating System and Multiprogramming Workloads
`
`-
`
`401
`
`3lw
`
`2500
`
`2000
`
`15W
`
`TOW
`
`SW
`
`(or
`cold misses
`Fig. 2. Cumulative
`for
`misses to empty cache block
`frames)
`IVEX
`for a 16K-byte
`cache and block
`size 4 bytes.
`
`0 i
`0
`
`I
`20
`
`I
`40
`
`I
`60
`
`I
`so
`
`I
`loo
`
`I
`
`I
`
`120
`Time
`
`140
`(K n?fs)
`
`the cache has filled up, or the initial working set of the
`the point where either
`program or workload
`is cache resident. The former effect is called cache saturation
`and the latter
`trace saturation. The steady-state miss rate or warm-start miss rate
`is the miss rate in the warm-start
`region. As before, the cold-start miss rate is for
`the entire
`trace assuming an initially
`empty cache. We also define a cold miss as
`a miss in an empty cache location, and cumulative
`cold misses as the aggregate
`number of cold misses in an initially
`empty cache. A nice feature of our warm-
`start definition
`is that it gracefully degenerates to Easton and Fagin’s
`for a small
`fully-associative
`cache.
`for small and
`is different
`The mechanism by which steady state is reached
`large caches. In small caches, cache saturation
`occurs first, and in large caches
`trace saturation
`sets in earlier. A uniform and convenient method of detecting
`the onset of the warm phase for both cases is to observe when
`the rate of cold
`misses in a cache drops sharply. Alternatively,
`if we plot
`the cumulative
`cold
`misses for a given trace and a given cache as a function of time, the knee of the
`curve suggests the beginning of the steady-state
`region.
`in
`trace depicted
`IVEX
`Consider
`this curve
`for a 16K-byte
`cache and the
`Figure 2. The
`figure shows the bilinear nature of the cumulative
`cold misses
`curve, with
`the knee around 30K
`references. The
`initial
`high slope portion
`corresponds
`to the cache acquiring parts of the program working set. In a small
`cache the knee occurs when there are no more empty cache blocks, or the cache
`saturates;
`in a large cache, the knee arises when the rate of new references drops
`sharply, or the trace saturates. The
`inclusion
`of trace saturation
`distinguishes
`our steady-state
`definition
`from earlier ones, which
`considered
`just cache
`saturation.
`
`first
`in Single Process Traces. We will
`3.1.2 Analysis of Start- Up Effects
`address start-up effects
`in uniprogrammed
`traces. The goal is to obtain steady-
`state miss rates for the program
`from minimal-length
`traces and to ascertain
`whether
`the length of ATUM
`traces is sufficient
`for this purpose. The procedure
`involves observing
`the curve showing cumulative
`cold misses as a function
`of
`time, where the knee of the curve marks the onset of trace or cache saturation
`and the end of the cold-start portion
`in the trace. Trace
`length must be much
`longer than
`the cold-start period
`to obtain useful steady-state cache miss rates.
`We will use the trace IVEX as an example
`to analyze how cache organization
`affects start-up behavior. Figure 3(a) shows cumulative
`cold misses for a number
`
`ACM Transactions
`
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`9
`
`TI-EXHIBIT 1015
`
`
`
`402
`
`l
`
`A. Agatwal
`
`et al.
`
`loo0
`
`0
`
`inn
`
`16K
`
`fy
`
`,
`
`,
`
`,
`
`,
`
`,
`
`,
`
`0
`
`20
`
`40
`
`60
`
`60
`
`1m
`
`140
`120
`77ms (K refs)
`
`0
`
`20
`
`40
`
`60
`
`60
`
`120 140
`loo
`nm
`(K refs)
`
`ia)
`
`@I
`
`Fig. 3.
`
`(a) Cumulative
`
`cold misses and (b) cumulative misses for direct-mapped
`
`caches.
`
`increases as the cache
`portion
`of cache sizes. In small caches, the cold-start
`grows: from about 5K references
`in a 4K-byte cache, to 10K references
`in a 16K-
`byte cache. The length of the cold-start portion ultimately
`attains a fixed value
`of about 25K references
`irrespective of cache size. The dotted curves representing
`cumulative misses
`in Figure 3(b), which
`correspond
`to all
`types of misses,
`corroborate
`the above argument. The number of cumulative misses is very close
`to the number of cold misses for large caches, showing
`that most of the misses
`occur in bringing
`in the program
`references
`for the first
`time.
`In small caches,
`because cold misses account
`for a very small portion of the overall number of
`misses, most misses are caused by interference.
`The length of the cold-start phase also depends on the other cache parameters
`such as block size and associativity.
`The start-up period
`is longer
`for a higher
`degree of associativity
`because of reduced
`interference;
`it is shorter
`for larger
`block sizes because a small cache can fill up with
`fewer misses [l], and in a large
`cache the working
`set can be brought
`in with
`fewer misses (assuming
`that
`the
`entire block
`is fetched on a miss). Workload characteristics
`are also important
`because start-up
`is longer
`in programs with
`large working sets.
`is for
`For IVEX,
`the longest cold-start
`region (equal to about 25K references)
`caches bigger than 64K bytes. The start-up
`region will only decrease if the block
`size is increased
`from its current value of 16 bytes or if the cache size is decreased.
`Therefore,
`in general, a good steady-state estimate of miss rate can be obtained
`by letting
`the cache warm up on the initial 25K references from the trace provided
`the block size is not decreased. Clea