throbber
Cache Performance of Operating System
`and Multiprogramming Workloads
`
`ANANT AGARWAL, JOHN HENNESSY, and MARK HOROWITZ
`Stanford University
`
`the required
`to provide
`computer systems
`in current high-performance
`Large caches are necessary
`in significant
`can result
`high memory bandwidth. Because a small decrease
`in cache performance
`of large caches is impor-
`system performance
`degradation, accurately
`characterizing
`the performance
`systems and multipro-
`tant. Although measurements on actual systems have shown
`that operating
`gramming can affect cache performance,
`previous studies have not focused on these effects. We have
`developed a program
`tracing
`technique
`called ATUM
`(Address Tracing Using Microcode)
`that
`captures
`realistic
`traces of multitasking workloads
`including
`the operating system. Examining
`cache
`behavior
`using
`these
`traces
`from a VAX processor
`shows
`that both
`the operating
`system and
`multiprogramming
`activity significantly
`degrade cache performance, with an even greater proportional
`impact on large caches. From a careful analysis of the causes of this degradation, we explore various
`techniques
`to reduce
`this
`loss. While seemingly
`little can be done to mitigate
`the effect of system
`references, multitasking
`cache miss activity
`can be substantially
`reduced with
`small hardware
`additions.
`
`mem-
`Design Styles-ossociatiue
`Structures]:
`Categories and Subject Descriptors: B.3.2 [Memory
`and
`ories, cache memories, virtual memory; B.3.3
`[Memory
`Structures]:
`Performance
`Analysis
`of
`Design Aids--formal
`models, simulation; C.4 [Computer
`Systems Organization]:
`Performance
`Systems--design
`studies, measurement
`techniques, modeling
`techniques; D.4.1 [Operating
`Systems]:
`Process Management-multiprocessing/mu&rogrumming;
`D.4.8
`[Operating
`Systems]:
`Perform-
`ance-measurements
`
`General Terms: Design, Measurement, Performance
`
`Additional
`simulation,
`
`Key Words and Phrases: Cache miss rate, cache performance,
`virtual caches
`
`cold start,
`
`trace-driven
`
`1. INTRODUCTION
`processors need high memory bandwidth. Caches, which store
`High-performance
`instructions
`and data
`in high-speed memories close to
`the
`frequently
`used
`processor, are a cost-effective means of attaining
`this bandwidth. System per-
`formance
`is a decreasing
`function of the cache miss rate, the cache access time,
`to service a miss. Let T, and T,,
`and the number of processor cycles
`taken
`represent
`the cache access
`time and
`the mean cache miss service
`time,
`
`is currently with
`
`the Laboratory
`
`for Computer Science
`
`(NE43-418), MIT, Cambridge,
`
`A. Agarwal
`MA 02139.
`Stanford, CA 94305.
`Authors’ address: Computer Systems Laboratory,
`the copies are not
`that
`Permission
`to copy without
`fee all or part of this material
`is granted provided
`made or distributed
`for direct commercial advantage,
`the ACM copyright notice and the title of the
`publication
`and its date appear, and notice
`is given that copying
`is by permission of the Association
`for Computing Machinery.
`To copy otherwise,
`or
`to
`republish,
`requires a fee and/or
`specific
`permission.
`0 1988 ACM 0734-2071/88/1100-0393
`
`$01.50
`
`ACM Transactions
`
`on Computer Systems, Vol. 6, No. 4, Nov. 1988, Pages 393-431.
`
`DOCKET NO.: IPR-5463750
`
`1
`
`TI-EXHIBIT 1015
`
`

`
`394
`
`l
`
`A. Agarwal et al.
`
`respectively, and m the cache miss rate. Then, for a memory bandwidth limited
`system, performance
`is inversely related to the average memory access time
`T, + mT,. The ideal memory hierarchy minimizes this average memory access
`time. Because memory access times often do not scale up in proportion
`to
`processor speeds, both the cost of servicing a miss (T,) and-to a smaller extent-
`the cache access time (T,) become relatively more expensive as processors achieve
`greater speeds [ 151. Thus minimizing average memory access time will become
`even more important as we try to build faster machines. In this paper we will
`concentrate on the miss rate and cache access time because the miss service time,
`T,,,, is primarily a function of main memory and bus architectures. Some of the
`cache parameters we consider affect both m and T,,, (e.g., the block size); in such
`cases we include T,,, in our analysis of cache performance.
`Simultaneously minimizing both cache access time and miss rate is virtually
`impossible, necessitating a careful trade-off between the two to minimize
`the
`average memory access time. At first glance, reducing miss rate may seem more
`important, due to the large miss penalty in fetching data from main memory.
`Unfortunately, using more complex cache organizations
`to reduce miss rates
`results in additional
`levels of logic and increases cache access times, which can
`affect the clock cycle time of the machine. Thus, as many recent studies have
`pointed out (e.g., [15, 22, 24]), implementation
`issues often favor simple organi-
`zations for improved overall system performance. Direct-mapped caches, which
`do not suffer from the complications of associative caches arising from replace-
`ment, multiplexing, and associative searches, may offer a cost-effective means of
`obtaining a low average memory access time. Because cache access time T, is a
`major bottleneck
`in many current high-performance system designs (e.g., VAX
`8800 [12]), and especially in RISC-style architectures (e.g., SPUR [16], MIPSX
`[17], R2000 [21]), some designers have chosen to use caches with a low degree of
`associativity causing an increased miss rate for the same total cache size. Direct-
`mapped caches can also show pathological performance for some workloads.
`Since the miss rate must be kept low to sustain performance, large caches become
`necessary. Although
`large caches have an access time penalty, it is often less
`than the clock-speed penalty associated with greater set-sizes. Thus a viable
`design alternative
`is to overcome the absence of associativity
`in direct-mapped
`caches by substantially
`increasing the cache size. Large caches are also necessi-
`tated by the increasing performance of processors and the increasing gap this
`generates between processor bandwidth requirements and main-memory band-
`width capabilities.
`The increased reliance on large caches introduces a number of important
`questions that have not been adequately addressed in published cache studies.
`For instance most current research has focused on small (less than 64K bytes)
`caches for predominantly
`single-process user programs. Most earlier studies
`either roughly modeled the effects of multiprogramming and system references
`or excluded them altogether. This omission was not significant
`in the study of
`small caches as shown by simulation
`results of Smith
`[27] that correspond
`reasonably well with actual measurements of systems [7]. In small caches the
`contribution
`to the miss rate generated by a single-process dominates the total
`miss rate. However, the same argument does not apply to large caches, where
`ACM Transactions
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`2
`
`TI-EXHIBIT 1015
`
`

`
`Cache Performance of Operating System and Multiprogramming Workloads
`
`l
`
`395
`
`the
`
`can play a large role in determining
`
`both system references and multitasking
`overall miss rate.
`for realistic workloads has
`of cache performance
`Accurate characterization
`been difficult
`thus
`far because large traces of multitasking workloads
`including
`the operating system were not generally available. A recent study of multipro-
`gramming
`traces of operating system workloads
`[5] concentrates on small caches.
`Some preliminary
`results
`from our research were discussed
`in a paper about
`ATUM
`[4]; the data presented and analyzed here is much more extensive. The
`ATUM
`technique allows
`the capture of real working environments with
`little
`distortion.
`The
`traces obtained using ATUM
`can accurately characterize cache
`performance,
`taking
`into account system references and multiprogramming,
`even
`for large caches.
`introducing
`the transient behavior of caches in detail
`In this paper we analyze
`cache analysis and trace stitching
`to obtain suffi-
`trace sampling
`for efficient
`ciently
`long traces to achieve steady-state miss rates. Using
`finite-length
`trace
`samples of program execution
`(such as ATUM
`traces)
`to characterize
`cache
`performance
`for the entire
`trace, which we call trace sampling,
`is an effective
`means of reducing both computation
`time and trace storage space. Naive use of
`traces that are not long enough
`to reach steady-state
`leads to very pessimistic
`miss rates. This
`is more severe for large caches, because the larger a cache the
`longer
`the traces need to be to achieve steady-state miss rates. Long realistic
`traces are hard to obtain;
`trace stitching overcomes these difficulties
`by allowing
`traces to be catenated under certain conditions.
`the traces
`Section 2 introduces
`the ATUM
`tracing
`technique and describes
`used in this study. We also review pertinent
`cache terminology. An analysis of
`transient behavior
`is provided
`in Section 3 to validate our trace sampling and
`trace stitching methodology,
`and indicate
`the minimum
`trace length needed to
`derive steady-state
`cache miss rate statistics. These
`techniques are then used
`to analyze cache performance
`for system
`references
`in Section 4. Section 5
`presents our results on multiprogramming
`issues, followed by our conclusions.
`
`2. METHODOLOGY
`is the
`to study cache performance. TDS
`(TDS)
`simulation
`We use trace-driven
`technique by which a model of a proposed system is evaluated using previously
`recorded address traces as the external stimuli. Because of its flexibility
`and ease
`of use, TDS
`is a popular
`technique
`for cache performance evaluation.
`Some of the other
`techniques of cache performance evaluation,
`such as hard-
`ware measurement
`[7], or analytical modeling
`[3], are not suited
`to a study of
`this nature. Hardware measurement
`is an accurate
`technique, but is limited
`to
`existing cache organizations.
`Prototype hardware systems that can modify
`the
`cache organization
`at will are hard to build. Mathematical models can yield quick
`estimates of cache performance; however they are inadequate
`if fine-tuned
`results
`are required. Trace-driven
`simulation
`does not suffer
`from
`these drawbacks:
`Accurate cache performance
`statistics
`for any given cache organization
`can be
`obtained. Smith discusses further
`the merits of trace-driven
`simulation
`for cache
`studies
`[28].
`
`ACM Transactions
`
`on Computer Systems, Vol. 6, No. 4, NOV. 1988.
`
`DOCKET NO.: IPR-5463750
`
`3
`
`TI-EXHIBIT 1015
`
`

`
`396
`
`l
`
`A. Agarwal et al.
`
`is that the results are only
`simulation
`One of the key properties of trace-driven
`as accurate as the traces that drive the simulation. Earlier
`traces seldom included
`operating
`system and multiprogramming
`activity.
`Indeed, Clark
`[7] argues in
`favor of hardware measurement over trace-driven
`simulation
`for this
`reason.
`Exclusion
`of system and multiprogramming
`references have caused miss rate
`estimates
`to be optimistic
`because (1) the working sets of system references or
`multiprogramming
`workloads are much larger than single process user workloads;
`(2) system code is less repetitive
`than user; and (3) interruption
`of user activity
`by system calls, or by other user processes, tends
`to flush out live portions of
`user code and data.
`to model the effects of system references and multi-
`Some studies have tried
`programming. Smith
`[27] interleaves
`traces to simulate multiprogramming
`with
`round-robin
`scheduling and constant duration
`time slices. However, Kobayashi’s
`results
`for MVS
`[19] and ours for VMS and Ultrix’
`indicate
`that actual
`traces
`show a wide variation
`in
`time slice duration
`and execution order. Therefore
`Smith’s miss rate estimates are expected
`to be pessimistic because round-robin
`scheduling
`results
`in worst-case miss rates with
`respect
`to scheduling policy.
`Goodman
`[13] simulated
`the effect of multiprogramming
`and system activity by
`flushing
`the cache on every context switch or system kernel call. A similar model
`was used by Peuto and Shustek 1231 to observe
`the effect of user-initiated
`supervisor calls on cache performance. The results presented here show that this
`model is inaccurate
`for large caches where significant
`data retention occurs across
`interrupts.
`trace-driven
`component,
`traces do not exclude any workload
`Since ATUM
`simulation
`results using ATUM
`traces are accurate. We briefly describe
`the
`ATUM methodology
`in this paper. For a detailed discussion of the ATUM
`tracing
`technique and its merits relative
`to other tracing schemes please refer to [4].
`
`2.1 ATUM: Address Tracing Using Microcode
`The ATUM
`scheme is very simple and can collect complete address information
`without
`the need to build special-purpose
`hardware. The basic idea is to do the
`tracing
`“below”
`the operating system-in
`microcode. By making minor changes
`to the existing microcode of a machine, a trace of all addresses that are touched
`by the processor are stashed away
`in a reserved area of main memory, and
`periodically written out to secondary storage.
`that has either a writable
`Microcode modification
`is possible
`in any processor
`or patchable control store. We used a VAX 8200 processor at Digital Equipment
`Corporation, Hudson,
`to obtain our traces. The patchable microcode of the 8200
`was designed
`to allow microcode errors
`to be corrected without
`the expense of
`replacing all the ROMs
`in existing machines. We were able to use this facility
`to
`modify
`the microcode
`to record all memory
`traffic.
`for macroinstruc-
`routines
`The addresses generated by appropriate microcode
`tion
`fetches and data directly
`correspond
`to
`the addresses specified by
`the
`machine architecture.
`The addresses are not tainted by implementation-specific
`
`’ Ultrix and VMS are trademarks
`
`of Digital Equipment Corporation.
`
`ACM Transactions
`
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`4
`
`TI-EXHIBIT 1015
`
`

`
`Cache Performance of Operating System and Multiprogramming Workloads
`
`l
`
`397
`
`resources such as prefetch buffers, caches, or bus sizes. Implementation-specific
`details,
`if needed, can be incorporated
`into a trace by a postprocessor.
`
`2.2 Traces Used in This Study
`on a VAX 8200
`systems running
`We traced both Ultrix
`and VMS operating
`processor and gathered a comprehensive
`set of traces covering a wide range of
`workloads and multiprogramming
`levels. The
`traces are a sampling of typical
`programs used in general-purpose
`systems. Each ATUM
`trace sample, roughly
`400,000 references
`long,
`is a half-second
`snapshot execution on a machine
`the
`speed of a VAX-11/780.
`In all, 30 traces are used in this study.
`For the effect of system references we concentrated on traces that had just one
`or two processes to exclude multitasking
`behavior. Twenty
`traces of 11 programs
`running under VMS and Ultrix are used.
`
`running
`
`TLB, a TLB
`
`simulator
`
`a behavioral
`
`simulator
`
`at DEC, simulating
`
`(VMS);
`trace sample of a microcode address allocator
`ALLC-a
`diagnostics program
`for the VAX computer
`(VMS);
`DIAO-a
`IVEX3-two
`samples of a DEC program,
`Interconnect Verify, checking
`IVEXO,
`net lists in a VLSI chip (VMS);
`UMILS-MIPS
`instruction
`level simulator
`(VMS);
`of DECSIM,
`DECO, DECl-trace
`some cache hardware
`(VMS);
`(VMS);
`compile of LINPACK
`FORLO, FORLl-FORTRAN
`(VMS);
`compile of a microcode parser program
`PASCO, PASCl-Pascal
`simulating
`a 2-input
`tri-state NAND buffer
`(VMS);
`SPICO, SPICl-SPICE
`runs of BOYER
`(a theorem prover)
`(VMS);
`LISPO, LISPl-LISP
`a name list from an executable
`tile (Ultrix);
`NMLST-produces
`SAVECO, SAVECl, SAVEC2, SAVEC3-samples
`of the C compiler
`
`(Ultrix).
`
`traces between 5 and 50 percent of the references are due to system
`In the VMS
`activity, with an average user/system
`ratio of 4 to 1. In the Ultrix workloads,
`about 50 percent of the references
`in NMLST
`are system, while
`the SAVEC
`traces show about 75 percent system references.
`levels three, six, and
`We used nine VMS
`trace samples of multiprogramming
`ten-three
`trace samples at each multiprogramming
`level-to
`study
`the impact
`of process switching. The trace samples for a particular multiprogramming
`level
`showed the same component processes. The three samples of each multiprogram-
`ming
`level were concatenated
`to form
`three
`large traces called MUL3, MULG,
`and MULlO.
`The processes active
`in MUL3
`are a FORTRAN
`compile of
`LINPACK,
`the microcode address allocator, and a directory
`search;
`those
`in
`and a program called 4 X 1 X
`MUL6 are two FORTRAN
`compiles of LINPACK
`5, the microcode address allocator, a SPICE run, a Pascal compile of a microcode
`parser, and LINPACK;
`and those in MULlO
`include
`the six programs
`in MULG,
`a numerical benchmark JACOBI, a string search in a file, MACRO-an
`assembly
`level compile, an octal dump, and a linker. For multiprogramming
`under Ultrix,
`
`ACM Transactions
`
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`5
`
`TI-EXHIBIT 1015
`
`

`
`398
`
`.
`
`A. Agarwal et al.
`
`traces. The resulting
`four samples of the SAVEC
`the
`we concatenated
`called SAVEC, has the C compiler, and two other tasks.
`In addition
`to the above traces, we will use a VMS
`trace of Interconnect Verify
`stripped of all extraneous processes-called
`IVEX-for
`some of the examples.
`Table
`II in the Appendix
`summarizes benchmark
`trace characteristics.
`
`trace,
`
`2.3 Cache Terminology and Definitions
`the cache
`include
`to our discussion
`The cache and workload parameters pertinent
`1271. The
`size, number of sets, set size, block size, and replacement algorithm
`number of sets (S) has also been called the number of rows; the set size (D), or
`the scope of associative search, has also been called degree of associativity
`or
`number of columns. Block size (B) is defined as the amount of storage associated
`with an address tag and is sometimes
`referred
`to as line size. We assume that
`the entire block is fetched on a miss to any word in that block. The replacement
`algorithm
`is the process used to select one of the blocks
`in a given set for
`occupation by a newly
`referenced block. The
`important
`schemes include LRU
`(least recently used), random, and FIFO
`(first
`in first out). LRU
`replacement
`is
`used throughout
`this paper. Our simulations also assume demand fetch and write
`allocate. Demand fetch means that a block is fetched from memory
`into the cache
`only on a cache miss; and write allocate
`is the policy where an entire block
`is
`fetched
`into
`the cache on a write
`into the block if it is absent from the cache.
`We present cache miss rate results
`throughout
`this paper because
`
`impact of the cache on the processor can be easily deter-
`(1) The performance
`mined once the miss rate is known;
`independent of processor
`largely
`(2) The miss rate is a cache performance metric
`and system
`implementation;
`other metrics such as average memory access
`time
`involve assumptions about cache access time, bus bandwidth, memory
`bus traffic, etc.
`the miss
`(3) Since most major studies have used the miss rate as a key metric,
`rate is convenient
`for comparing
`the various
`results. However we will also
`use the average cache access time and the bus traffic generated by the cache
`when consideration
`of just the miss rate might be misleading.
`
`2.4 Simulation Models
`is to obtain a “steady
`evaluation
`One of the chief aims of cache performance
`state” miss rate for a given organization. What do we mean by steady-state cache
`miss rate? While
`it
`is hard
`to define steady-state
`cache miss rate precisely,
`generally speaking
`it is the average miss rate of the cache over a long period of
`time-long
`enough
`that a good cross-section of the intended workload
`is seen.
`Unfortunately
`only measurement over this period can yield an exact steady-state
`miss rate. A slightly
`less accurate method
`is to derive the statistical mean of miss
`rates obtained by simulating
`the cache against
`long traces of a number of typical
`programs, preferably
`running
`to completion, either singly or multiprogrammed.
`While we will not actually simulate
`the cache against an entire execution
`trace
`of a program
`to get the steady-state program miss rate, we will examine more
`ACM Transactions
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`6
`
`TI-EXHIBIT 1015
`
`

`
`Cache Performance of Operating System and Multiprogramming Workloads
`
`*
`
`399
`
`techniques,
`practical alternate
`estimates
`from several smaller
`
`trace sampling and trace stitching,
`trace samples.
`
`to obtain good
`
`3. TRANSIENT BEHAVIOR AND TRACE SAMPLING
`Trace sampling
`is the method by which
`finite segments of a trace, relatively
`small
`in size compared
`to the entire
`trace generated by a program, are used to
`accurately estimate steady-state cache performance
`for the program. Studying
`trace sampling
`is important
`for two reasons. First, enormous potential exists for
`reducing
`the simulation
`time in characterizing
`large caches. Second, since ATUM
`traces are of finite
`length
`(typically
`a half-million
`references)
`it could be argued
`that artificial
`truncation
`of long execution
`traces
`into multiple
`small samples
`introduces start-up distortion
`in simulations
`of large caches. The need to study
`sampling effects and the transient behavior of caches is even greater in operating
`system and multiprogramming
`simulations,
`where start-up
`effects are more
`pronounced due to larger working sets and context switching. We will describe a
`technique
`to overcome
`these difficulties
`and show how trace sampling can be
`used to obtain accurate estimates of cache performance efficiently.
`This section deals primarily with
`trace sample size. The number of samples
`required, an important
`consideration, will depend both on the desired accuracy
`and on workload characteristics. Nonuniformly
`behaved programs will need more
`samples than well-behaved ones. In general, a simple statistical analysis will yield
`confidence
`limits on the accuracy of the mean miss rate obtained
`from a number
`of samples (see [20] on a related
`topic).
`behavior, we provide a set of
`After motivating
`the study of cache transient
`working definitions
`for the transient behavior of caches to assess start-up effects.
`The aim of this section will be to find
`the minimum
`trace length necessary
`to
`obtain steady-state cache miss rates for a given workload. From
`the results of
`this analysis, we then go on to show how small trace samples can be effectively
`used to obtain accurate steady-state cache performance statistics.
`
`3.1 Transient Behavior Analysis
`We examine
`the transient nature of the miss rate to motivate studying start-up
`behavior and finite
`trace-size effects, especially
`for large caches. Figure 1 shows
`a plot of miss rate versus time for cache sizes 4K and 64K bytes for the benchmark
`IVEX. While
`the dependence of miss rate on time is not strong for the 4K-cache,
`it is a decreasing
`function
`of trace
`length
`for the larger cache, implying
`that
`start-up cost can dominate behavior. Use of a short trace (say 150,000 references
`for
`the
`IVEX
`benchmark)
`for
`the 64K cache will
`introduce
`severe start-up
`distortion. Even when
`long traces were available, one may have to wait a long
`time
`for
`the miss rate
`to stabilize. Clearly, an understanding
`of the exact
`dependence of miss rate on trace
`length
`is essential
`to obtain
`steady-state
`performance data from short trace lengths.
`Easton and Fagin
`[ll]
`and Streker
`[31] also stress the need to accurately
`analyze
`transient
`effects
`in caches. Easton and Fagin define
`two types of miss
`rates for fully-associative
`caches to address this issue: cold start and warm start.
`By their definition,
`the cold-start miss rate is obtained by simulating an initially
`ACM Transactions
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`7
`
`TI-EXHIBIT 1015
`
`

`
`400
`
`*
`
`A. Agarwal et al.
`
`for
`time
`versus
`rate
`Fig. 1. Miss
`benchmark
`IVEX. Set size is 1; block
`size is 4.
`
`4K bytes
`
`4
`
`50 1001502002503&73504lW4505W5506W
`Time (K tefs)
`
`the
`(or steady-state) miss rate by first allowing
`empty cache and warm-start
`issue
`cache to fill up. While many previous studies did not address the cold-start
`in their cache simulations,
`some studies that used small trace samples (e.g., [13,
`271) applied Easton and Fagin’s warm-start
`terminology
`for
`fully-associative
`caches to their set-associative and direct-mapped
`caches as well. Although
`these
`definitions
`are satisfactory
`for small,
`fully associative caches (where
`the cache
`fills up quickly),
`they become
`inadequate
`for
`large
`low-associativity
`caches
`because one might have to wait an arbitrarily
`long time for the cache to fill up.
`The basic problem with
`these definitions
`is the implicit
`notion of reaching
`steady state when the cache fills up. This notion
`is based on the assumption
`that
`a large proportion
`of cache misses in the steady state is due to interference. An
`interference miss
`is caused by a reference
`to a block
`that was purged
`from
`the cache by an intervening
`reference.
`Interference
`is predominantly
`seen in
`small caches where multiple,
`simultaneously
`active, program-blocks map into the
`same cache block causing strong contention
`for limited cache resources.
`The same situation does not apply to large caches. A new reference that purges
`a valid cache entry will cause an interference miss only if the purged block is still
`be.
`In other words,
`interference misses are caused only by purged blocks in the
`current working
`set of the program. Program working
`sets tend
`to be much
`smaller
`than overall program size owing
`to the locality property of programs
`[9].
`Consequently,
`large caches, which can comfortably
`sustain
`the working sets of
`multiple processes, show little
`interference because purged blocks are often dead.
`The misses that occur in such a large cache are those caused by bringing
`in the
`initial working sets into
`the cache for the first
`time, called start-up misses, and
`from
`then on, in fetching additional
`blocks
`for the first
`time due to changing
`program
`localities, or nonstationary misses [3].
`from the
`The need for a modified definition
`of steady state becomes apparent
`following
`hypothetical
`example. Consider an infinte
`cache. Previous definitions
`based on cache fill-up would
`imply
`that such a cache would never reach steady
`state. But clearly a program would
`reach steady state after an initial
`transient.
`Recalling
`that our aim is to obtain
`the miss rate of the program,
`the desired
`steady-state miss rate of the trace sample should correspond
`to program miss
`rate. Our definition will yield precisely
`this steady-state miss rate.
`
`3.1.1 Definitions. We define cold start and warm start using the more general
`notion of steady state. Let the warm-start
`region be the portion of a trace after
`
`ACM Transactions
`
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`8
`
`TI-EXHIBIT 1015
`
`

`
`Cache Performance of Operating System and Multiprogramming Workloads
`
`-
`
`401
`
`3lw
`
`2500
`
`2000
`
`15W
`
`TOW
`
`SW
`
`(or
`cold misses
`Fig. 2. Cumulative
`for
`misses to empty cache block
`frames)
`IVEX
`for a 16K-byte
`cache and block
`size 4 bytes.
`
`0 i
`0
`
`I
`20
`
`I
`40
`
`I
`60
`
`I
`so
`
`I
`loo
`
`I
`
`I
`
`120
`Time
`
`140
`(K n?fs)
`
`the cache has filled up, or the initial working set of the
`the point where either
`program or workload
`is cache resident. The former effect is called cache saturation
`and the latter
`trace saturation. The steady-state miss rate or warm-start miss rate
`is the miss rate in the warm-start
`region. As before, the cold-start miss rate is for
`the entire
`trace assuming an initially
`empty cache. We also define a cold miss as
`a miss in an empty cache location, and cumulative
`cold misses as the aggregate
`number of cold misses in an initially
`empty cache. A nice feature of our warm-
`start definition
`is that it gracefully degenerates to Easton and Fagin’s
`for a small
`fully-associative
`cache.
`for small and
`is different
`The mechanism by which steady state is reached
`large caches. In small caches, cache saturation
`occurs first, and in large caches
`trace saturation
`sets in earlier. A uniform and convenient method of detecting
`the onset of the warm phase for both cases is to observe when
`the rate of cold
`misses in a cache drops sharply. Alternatively,
`if we plot
`the cumulative
`cold
`misses for a given trace and a given cache as a function of time, the knee of the
`curve suggests the beginning of the steady-state
`region.
`in
`trace depicted
`IVEX
`Consider
`this curve
`for a 16K-byte
`cache and the
`Figure 2. The
`figure shows the bilinear nature of the cumulative
`cold misses
`curve, with
`the knee around 30K
`references. The
`initial
`high slope portion
`corresponds
`to the cache acquiring parts of the program working set. In a small
`cache the knee occurs when there are no more empty cache blocks, or the cache
`saturates;
`in a large cache, the knee arises when the rate of new references drops
`sharply, or the trace saturates. The
`inclusion
`of trace saturation
`distinguishes
`our steady-state
`definition
`from earlier ones, which
`considered
`just cache
`saturation.
`
`first
`in Single Process Traces. We will
`3.1.2 Analysis of Start- Up Effects
`address start-up effects
`in uniprogrammed
`traces. The goal is to obtain steady-
`state miss rates for the program
`from minimal-length
`traces and to ascertain
`whether
`the length of ATUM
`traces is sufficient
`for this purpose. The procedure
`involves observing
`the curve showing cumulative
`cold misses as a function
`of
`time, where the knee of the curve marks the onset of trace or cache saturation
`and the end of the cold-start portion
`in the trace. Trace
`length must be much
`longer than
`the cold-start period
`to obtain useful steady-state cache miss rates.
`We will use the trace IVEX as an example
`to analyze how cache organization
`affects start-up behavior. Figure 3(a) shows cumulative
`cold misses for a number
`
`ACM Transactions
`
`on Computer Systems, Vol. 6, No. 4, Nov. 1988.
`
`DOCKET NO.: IPR-5463750
`
`9
`
`TI-EXHIBIT 1015
`
`

`
`402
`
`l
`
`A. Agatwal
`
`et al.
`
`loo0
`
`0
`
`inn
`
`16K
`
`fy
`
`,
`
`,
`
`,
`
`,
`
`,
`
`,
`
`0
`
`20
`
`40
`
`60
`
`60
`
`1m
`
`140
`120
`77ms (K refs)
`
`0
`
`20
`
`40
`
`60
`
`60
`
`120 140
`loo
`nm
`(K refs)
`
`ia)
`
`@I
`
`Fig. 3.
`
`(a) Cumulative
`
`cold misses and (b) cumulative misses for direct-mapped
`
`caches.
`
`increases as the cache
`portion
`of cache sizes. In small caches, the cold-start
`grows: from about 5K references
`in a 4K-byte cache, to 10K references
`in a 16K-
`byte cache. The length of the cold-start portion ultimately
`attains a fixed value
`of about 25K references
`irrespective of cache size. The dotted curves representing
`cumulative misses
`in Figure 3(b), which
`correspond
`to all
`types of misses,
`corroborate
`the above argument. The number of cumulative misses is very close
`to the number of cold misses for large caches, showing
`that most of the misses
`occur in bringing
`in the program
`references
`for the first
`time.
`In small caches,
`because cold misses account
`for a very small portion of the overall number of
`misses, most misses are caused by interference.
`The length of the cold-start phase also depends on the other cache parameters
`such as block size and associativity.
`The start-up period
`is longer
`for a higher
`degree of associativity
`because of reduced
`interference;
`it is shorter
`for larger
`block sizes because a small cache can fill up with
`fewer misses [l], and in a large
`cache the working
`set can be brought
`in with
`fewer misses (assuming
`that
`the
`entire block
`is fetched on a miss). Workload characteristics
`are also important
`because start-up
`is longer
`in programs with
`large working sets.
`is for
`For IVEX,
`the longest cold-start
`region (equal to about 25K references)
`caches bigger than 64K bytes. The start-up
`region will only decrease if the block
`size is increased
`from its current value of 16 bytes or if the cache size is decreased.
`Therefore,
`in general, a good steady-state estimate of miss rate can be obtained
`by letting
`the cache warm up on the initial 25K references from the trace provided
`the block size is not decreased. Clea

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket