`
`ALAN JAY SMITH
`
`University of California, Berkeley, California 94720
`
`Cache memories are used in modem, medium and high-speed CPUs to hold temporarily
`those portions of the contents of main memory which are (believed to be) currently in
`use. Since instructions and data in cache memories can usually be referenced in IO to 25
`percent of the time required to access main memory, cache memories permit the
`execut10n rate of the machine to be substantially increased. In order to function
`effectively, cache memories must be carefully designed and implemented. In this paper,
`we explain the various aspects of cache memones and discuss in some detail the design
`features and trade-offs. A large number of original, trace-driven simulation results are
`presented. Considerat10n is given to practical implementation questions as well as to more
`abstract design issues.
`Specific aspects of cache memories that are investigated include: the cache fetch
`algorithm (demand versus prefetch), the placement and replacement algorithms, line size,
`store-through versus copy-back updating of main memory, cold-start versus warm-start
`mJSs ratios, multicache consistency, the effect of input/output through the cache, the
`behavior of split data/instruction caches, and cache size. Our discussion includes other
`aspects of memory system architecture, including translation lookaside buffers.
`Throughout the paper, we use as examples the implementation of the cache in the
`Amdahl 470V /6 and 470V /7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780.
`An extensive bibliography is provided.
`
`Categories and Subject Descriptors: B.3.2 [Memory Structures]: Design Styles-cache
`memories; B.3.3 [Memory Structures]: Performance Analysis and Design Aids; C.O.
`[Computer Systems Organization]: General; C.4 [Computer Systems Organiza(cid:173)
`tion]: Performance of Systems
`General Terms: Design, Experimentation, Measurement, Performance
`Additional Key Words and Phrases· Buffer memory, paging, prefetching, TLB, store(cid:173)
`through, Amdahl 470, IBM 3033, BIAS
`
`INTRODUCTION
`
`Definition and Rationale
`Cache memories are small, high-speed
`buffer memories used in modern computer
`systems to hold temporarily those portions
`of the contents of main memory which are
`(believed to be) currently in use. Informa(cid:173)
`tion located in cache memory may be ac(cid:173)
`cessed in much less time than that located
`in main memory (for reasons discussed
`throughout this paper). Thus, a central
`processing unit (CPU) with a cache mem(cid:173)
`ory needs to spend far less time waiting for
`
`instructions and operands to be fetched
`and/ or stored. For exawf)le, in typical large,
`high-speed computers (t!.g., Amdahl 470V /
`7, IBM 3033), main memory can be ac(cid:173)
`cessed in 300 to 600 nanoseconds; informa(cid:173)
`tion can be obtained from a cache, on the
`other hand, in 50 to 100 nanoseconds. Since
`the performance of such machines is al(cid:173)
`ready limited in instruction execution rate
`by cache memory access time, the absence
`of any cache memory at all would produce
`a very substantial decrease in execution
`speed.
`Virtually all modern large computer sys-
`
`Permission to copy without fee all or part of this matenal is granted provided that the copies are not made or
`d1str1buted for direct commercial advantage, the ACM copyright notice and the title of the publication and its
`date appear, and notice is given that copying 1s by permission of the Association for Computing Machinery. To
`copy otherwise, or to republish, requires a fee and/or specific permission.
`© 1982 ACM 0010-4892/82/0900-0473 $00. 75
`
`Computing Surveys, Vol. 14, No, 3, September 1982
`
`Qualcomm, Ex. 1017, Page 1
`
`
`
`474
`
`•
`CONTENTS
`
`A. J. Smith
`
`INTRODUCTION
`Defimtion and Rat10nale
`Overvtew of Cache Design
`Cache Aspects
`1. DATA AND MEASUREMENTS
`1 1 Rationale
`1 2 Trace-Driven Sunulat10n
`1 3 Simulation Evaluat10n
`1 4 The Traces
`1 5 Sunulat10n Methods
`2 ASPECTS OF CACHE DESIGN AND OPERA(cid:173)
`TION
`2.1 Cache Fetch Algonthm
`2.2 Placement Algonthm
`2.3 Lme SIZe
`2.4 Replacement Algonthm
`2.5 Wnte-Through versus Copy-Back
`2.6 Effect of Multiprogramming Cold-Start and
`Warm-Start
`2. 7 Multicache ConsJStency
`2.8 Data/Instruction Cache
`2 9 V Jrtual Address Cache
`2.10 User /Supervtsor Cache
`2.11 Input/Output through the Cache
`2 12 Cache SJZe
`2 13 Cache Bandwidth, Data Path Width, and Ac-
`cess Resolution
`2 14 Multtlevel Cache
`2 15 P1pelmmg
`2 16 Translation Lookas1de Buffer
`2 17 Translator
`2 18 Memory-Based Cache
`2 19 SpecialJZed Caches and Cache Components
`3 DIRECTIONS FOR RESEARCH AND DEVEL(cid:173)
`OPMENT
`3 1 On-Chtp Cache and Other Technology Ad-
`vances
`3.2 Mult1cache ConsJStency
`Implementation Evaluation
`3.3
`3.4 Hit Ratio versus SJZe
`3.5 TLB Design
`3.6 Cache Parameters versus Architecture and
`Workload
`APPENDIX EXPLANATIONOFTRACENAMES
`ACKNOWLEDGMENTS
`REFERENCES
`
`terns have cache memories; for example,
`the Amdahl 470, the IBM 3081 [IBM82,
`REIL82, GusT82], 3033, 370/168, 360/195,
`the Univac 1100/80, and the Honeywell 66/
`80. Also, many medium and small size ma(cid:173)
`chines have cache memories; for example,
`the DEC VAX 11/780, 11/750 [ARMS81],
`and PDP-11/70 [STRE76, 8Now78], and
`the Apollo, which uses a Motorolla 68000
`microprocessor. We believe that within
`
`Computmg Surveys, Vol 14, No. 3, September 1982
`
`two to four years, circuit speed and density
`will progress sufficiently to permit cache
`memories in one chip microcomputers.
`( On-chip addressable memory is planned
`for the Texas Instruments 99000 [LAFF81,
`ELEC81].) Even microcomputers could
`benefit substantially from an on-chip cache,
`since on-chip access times are much smaller
`than off-chip access times. Thus, the ma(cid:173)
`terial presented in this paper should be
`relevant to almost the full range of com(cid:173)
`puter architecture implementations.
`The success of cache memories has been
`explained by reference to the "property of
`locality" [DENN72]. The property of local(cid:173)
`ity has two aspects, temporal and spatial.
`Over short periods of time, a program dis(cid:173)
`tributes its memory references nonuni(cid:173)
`formly over its address space, and which
`portions of the address space are favored
`remain largely the same for long periods of
`time. This first property, called temporal
`locality, or locality by time, means that the
`information which will be in use in the near
`future is likely to be in use already. This
`type of behavior can be expected from pro(cid:173)
`gram loops in which both data and instruc(cid:173)
`tions are reused. The second property, lo(cid:173)
`cality by space, means that portions of the
`address space which are in use generally
`consist of a fairly small number of individ(cid:173)
`ually contiguous segments of that address
`space. Locality by space, then, means that
`the loci of reference of the program in the
`near future are likely to be near the current
`loci of reference. This type of behavior can
`be expected from common knowledge of
`programs: related data items (variables, ar(cid:173)
`rays) are usually stored together, and in(cid:173)
`structions are mostly executed sequentially.
`Since the cache memory buffers segments
`of information that have been recently
`used, the property of locality implies that
`needed information is also likely to be
`found in the cache.
`Optimizing the design of a cache memory
`generally has four aspects:
`(1) Maximizing the probability of finding a
`memory reference's target in the cache
`(the hit ratio),
`(2) minimizing the time to access informa(cid:173)
`tion that is indeed in the cache (access
`time),
`(3) minimizing the delay due to a miss, and
`
`Qualcomm, Ex. 1017, Page 2
`
`
`
`(4) minimizing the overheads of updating
`main memory, maintaining multicache
`consistency, etc.
`(All of these have to he accomplished
`under suitable cost constraints, of course.)
`There is also a trade-off between hit ratio
`and access time. This trade-off has not been
`sufficiently stressed in the literature and it
`is one of our major concerns in this paper.
`In this paper, each aspect of cache memo(cid:173)
`ries is discussed at length and, where avail(cid:173)
`able, measurement results are presented. In
`order for these detailed discussions to be
`meaningful, a familiarity with many of the
`aspects of cache design is required. In the
`remainder of this section, we explain the
`operation of a typical cache memory, and
`then we briefly discuss several aspects of
`cache memory design. These discussions
`are expanded upon in Section 2. At the end
`of this paper, there is an extensive bibliog(cid:173)
`raphy in which we have attempted to cite
`all relevant literature. Not all of the items
`in the bibliography are referenced in the
`paper, although we have referred to items
`there as appropriate. The reader may wish
`in particular to refer to BADE79, BARS72,
`Grns67, and KAPL73 for other surveys of
`some aspects of cache design. CLAR81 is
`particularly interesting as it discusses the
`design details of a real cache. (See also
`LAMP80.)
`
`Overview of Cache Design
`
`Many CPUs can he partitioned, concep(cid:173)
`tually and sometimes physically, into three
`parts: the I-unit, the E-unit, and the S-unit.
`The I-unit (instruction) is responsible for
`instruction fetch and decode. It may have
`some local buffers for lookahead prefetch(cid:173)
`ing of instructions. The E-unit (execution)
`does most of what is commonly referred to
`as executing an instruction, and it contains
`the logic for arithmetic and logical opera(cid:173)
`tions. The S-unit (storage) provides the
`memory interface between the I-unit and
`E-unit. (IBM calls the S-unit the PSCF, or
`processor storage control function.)
`The S-unit is the part of the CPU of
`primary interest in this paper. It contains
`several parts or functions, some of which
`are shown in Figure 1. The major compo(cid:173)
`nent of the S-unit is the cache memory.
`
`Cache Memories
`
`•
`
`475
`
`Main Memory
`
`.---~-__,--Cache
`TLB
`S U 't
`- n,
`Translator
`
`I-Unit E-Unit \ ~£gJ
`
`{
`CPU
`
`~rite Through Buffer~
`
`Figure 1. A typical CPU design and the S-unit.
`
`There is usually a translator, which trans(cid:173)
`lates virtual to real memory addresses, and
`a TLB (translation lookaside buffer) which
`buffers (caches) recently generated (virtual
`address, real address) pairs. Depending on
`machine design, there can be an ASIT (ad(cid:173)
`dress space identifier table), a BIAS (buffer
`invalidation address stack), and some write(cid:173)
`through buffers. Each of these is discussed
`in later sections of this paper.
`Figure 2 is a diagram of portions of a
`typical S-unit, showing only the more im(cid:173)
`portant parts and data paths, in particular
`the cache and the TLB. This design is
`typical of that used by IBM (in the 370/168
`and 3033)and by Amdahl(in the 470 series).
`Figure 3 is a flowchart that corresponds
`to the operation of the design in Figure
`2. A discussion of this flowchart follows.
`The operation of the cache commences
`with the arrival of a virtual address, gener(cid:173)
`ally from the CPU, and the appropriate
`control signal. The virtual address is passed
`to both the TLB and the cache storage.
`The TLB is a small associative memory
`which maps virtual to real addresses. It is
`often organized as shown, as a number of
`groups (sets) of elements, each consisting
`of a virtual address and a real address. The
`TLB accepts the virtual page number, ran(cid:173)
`domizes it, and uses that hashed number to
`select a set of elements. That set of ele(cid:173)
`ments is then searched associatively for a
`match to the virtual address. If a match is
`found, the corresponding real address is
`passed along to the comparator to deter(cid:173)
`mine whether the target line is in the cache.
`Finally, the replacement status of each en(cid:173)
`try in the TLB set is updated.
`If the TLB does not contain the ( virtual
`address, real address} pair needed for the
`translation, then the translator (not shown
`in Figure 2) is invoked. It uses the high(cid:173)
`order bits of the virtual address as an entry
`into the segment and page tables for the
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Qualcomm, Ex. 1017, Page 3
`
`
`
`476
`
`•
`
`A. J. Smith
`
`From translator
`Virtual Real
`Address Address
`
`y
`
`!
`I Hash
`Function
`
`,/'
`
`CPU
`•
`Virtual Address
`
`Number
`
`Number
`
`Line
`
`"
`
`/\
`
`I
`
`To
`
`translator
`
`Address Data Line I
`
`I Line I Byte 1n
`I Page
`I ~
`
`f
`cyv
`
`)Real Address Data
`,I,
`,I,
`
`.,,
`
`~
`
`Cache
`Address a
`Data
`Arrays
`
`-(s
`
`• I
`
`:
`
`I
`
`Translation
`Lookas1de
`Buffer
`
`y
`
`Ks
`
`I
`
`l
`
`I
`
`'
`
`1 1 l l
`
`ICornpare Virtual I
`Addresses
`
`Real
`Address
`
`A
`
`$•select
`
`To Main Memory
`
`'--
`
`L;
`
`Compare Addresses a Select
`!-Doto
`Byte Select a Align
`i
`Data Out
`
`Data
`
`Figure 2. A typical cache and TLB design.
`
`CACHE OPERATION FLOW CHART
`
`Receive Virtual Address
`
`yes
`
`Send Virtual Address
`to Translator
`
`Update Replacement
`Status 1n TLB
`
`yes
`
`Update Replacement
`Status
`
`Send Real Address
`to Main Memory
`
`Store Line
`1n Cache
`
`Figure 3. Cache operation flow chart.
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Qualcomm, Ex. 1017, Page 4
`
`
`
`process and then returns the address pair
`to the TLB (which retains it for possible
`future use), thus replacing an existing TLB
`entry.
`The virtual address is also passed along
`initially to a mechanism which uses the
`middle part of the virtual address (the line
`number) as an index to select a set of entries
`in the cache. Each entry consists primarily
`of a real address tag and a line of data (see
`Figure 4). The line is the quantum of stor(cid:173)
`age in the cache. The tags of the elements
`of all the selected set are read into a com(cid:173)
`parator and compared with the real address
`from the TLB. (Sometimes the cache stor(cid:173)
`a~e stores the data and address tags to(cid:173)
`gether, as shown in Figures 2 and 4. Other
`times, the address tags and data are stored
`separately in the "address array" and "data
`array/' respectively.) If a match is found,
`the line (or a part of it) containing the
`target locations is read into a shift register
`and the replacement status of the entries in
`the cache set are updated. The shift register
`is then shifted to select the target bytes
`which are in turn transmitted to the sourc~
`of the original data request.
`If a miss occurs (i.e., addresss tags in the
`cache do not match), then the real address
`of the desired line is transmitted to the
`main memory. The replacement status in(cid:173)
`formation is used to determine which line
`to remove from the cache to make room for
`the target line. If the line to be removed
`from the cache has been modified, and main
`memory has not yet been updated with the
`modification, then the line is copied back to
`main memory; otherwise, it is simply de(cid:173)
`leted from the cache. After some number of
`machine cycles, the target line arrives from
`main memory and is loaded into the cache
`storage. The line is also passed to the shift
`register for the target bytes to be selected.
`Cache Aspects
`The cache description given above is both
`simplified and specific; it does not show
`design alternatives. Below, we point out
`some of the design alternatives for the
`cache memory.
`Cache Fetch Algorithm. The cache fetch
`algorithm is used to decide when to bring
`information into the cache. Several possi-
`
`Cache Memories
`
`•
`
`477
`
`I Real Address Tag I Data I Valid
`
`Cache Entry
`
`Entry I Entry 2
`
`Entry E Replacement Status
`
`Cache Set
`
`Figure 4. Structure of cache entry and cache set.
`
`bilities exist: information can be fetched on
`demand (when it is needed) or prefetched
`(before it is needed). Prefetch algorithms
`attempt to guess what information will soon
`be needed and obtain it in advance. It is
`also possible for the cache fetch algorithm
`to omit fetching some information (selec(cid:173)
`tive fetch) and designate some information,
`such as shared writeable code (sema(cid:173)
`phores), as unfetchable. Further, there may
`be no fetch-on-write in systems which use
`write-through (see below).
`Cache Placement Algorithm. Informa(cid:173)
`tion is generally retrieved from the cache
`associatively, and because large associative
`memories are usually very expensive and
`somewhat slow, the cache is generally or(cid:173)
`ganized as a group of smaller associative
`memories. Thus, only one of the associative
`memories has to be searched to determine
`whether the desired information is located
`in the cache. Each such (small) associative
`memory is called a set and the number of
`elements over which the associative search
`is conducted is called the set size. The
`placement algorithm is used to determine
`in which set a piece (line) of information
`will be placed. Later in this paper we con(cid:173)
`sider the problem of selecting the number
`of sets, the set size, and the placement
`algorithm in such a set-associative memory.
`Line Size. The fixed-size unit of infor(cid:173)
`mation transfer between the cache and
`main memory is called the line. The line
`corresponds conceptually to the page,
`which is the unit of transfer between the
`main memory and secondary storage. Se(cid:173)
`lecting the line size is an important part of
`the memory system design. (A line is also
`sometimes referred to as a block.)
`Replacement Algorithm. When infor(cid:173)
`mation is requested by the CPU from main
`memory and the cache is full, some infor(cid:173)
`mation in the cache must be selected for
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Qualcomm, Ex. 1017, Page 5
`
`
`
`478
`
`•
`
`A. J. Smith
`
`replacement. Various replacement algo(cid:173)
`rithms are possible, such as FIFO (first in,
`first out), LRU (least recently used), and
`random. Later, we consider the first two of
`these.
`Main Memory Update Algorithm. When
`the CPU performs a write (store) to mem(cid:173)
`ory, that operation can actually be reflected
`in the cache and main memories in a num(cid:173)
`ber of ways. For example, the cache mem(cid:173)
`ory can receive the write and the main
`memory can be updated when that line is
`replaced in the cache. This strategy is
`known as copy-back. Copy-back may also
`require that the line be fetched if it is absent
`from the cache (i.e., fetch-on-write). An(cid:173)
`other strategy, known as write-through, im(cid:173)
`mediately updates main memory when a
`write occurs. Write-through may specify
`that if the information is in the cache, the
`cache be either updated or purged from
`main memory. If the information is not in
`the cache, it may or may not be fetched.
`The choice between copy-back and write(cid:173)
`through strategies is also influenced by the
`need to maintain consistency among the
`cache memories in a tightly coupled multi(cid:173)
`processor system. This requirement is dis(cid:173)
`cussed later.
`Cold-Start versus Wann-Start Miss Ra(cid:173)
`tios and Multiprogramming. Most com(cid:173)
`puter systems with cache memories are
`multiprogrammed; many processes run on
`the CPU, though only one can run at a
`time, and they alternate every few millisec(cid:173)
`onds. This means that a significant fraction
`of the cache miss ratio is due to loading
`data and instructions for a new process,
`rather than to a single process which has
`been running for some time. Miss ratios
`that are measured when starting with an
`empty cache are called cold-start miss ra(cid:173)
`tios, and those that are measured from the
`time the cache becomes full are called
`warm-start miss ratios. Our simulation
`studies consider this multiprogramming en(cid:173)
`vironment.
`User/Supervisor Cache. The frequent
`switching between user and supervisor
`state in most systems results in high miss
`ratios because the cache is often reloaded
`(i.e., cold-start). One way to address this is
`to incorporate two cache memories, and
`allow the supervisor to use one cache and
`
`Computing Surveys, Vol. 14, No 3, September 1982
`
`the user programs to use the other. Poten(cid:173)
`tially, this could result in both the super(cid:173)
`visor and the user programs more fre(cid:173)
`quently finding upon initiation what they
`need in the cache.
`Multicache Consistency. A multiproces(cid:173)
`sor system with multiple caches faces the
`problem of making sure that all copies of a
`given piece of information (which poten(cid:173)
`tially could exist in every cache, as well as
`in the main memory) are the same. A mod(cid:173)
`ification of any one of these copies should
`somehow be reflected in all others. A num(cid:173)
`ber of solutions to this problem are possible.
`The three most popular solutions are essen(cid:173)
`tially: (1) to transmit all stores to all caches
`and memories, so that all copies are up(cid:173)
`dated; (2) to transmit the addresses of all
`stores to all other caches, and purge the
`corresponding lines from all other caches;
`or (3) to permit data that are writeable
`(page or line flagged to permit modifica(cid:173)
`tion) to be in only one cache at a time. A
`centralized or distributed directory may be
`used to control making and updating of
`copies.
`Input/Output. Input/output (from and
`to 1/0 devices) is an additional source of
`references to information in memory. It is
`important that an output request stream
`reference the most current values for the
`information transferred. Similarly, it is also
`important that input data be immediately
`reflected in any and all copies of those lines
`in memory. Several solutions to this prob(cid:173)
`lem are possible. One is to direct the 1/0
`stream through the cache itself (in a single
`processor system); another is to use a write(cid:173)
`through policy and broadcast all writes so
`as to update or invalidate the target line
`wherever found. In the latter case, the
`channel accesses main memory rather than
`the cache.
`Data/ Instruction Cache. Another cache
`design strategy is to split the cache into two
`parts: one for data and one for instructions.
`This has the advantages that the band(cid:173)
`width of the cache is increased and the
`access time (for reasons discussed later) can
`be decreased. Several problems occur: the
`overall miss ratio may increase, the two
`caches must be kept consistent, and self(cid:173)
`modifying code and execute instructions
`must be accommodated.
`
`Qualcomm, Ex. 1017, Page 6
`
`
`
`Virtual versus Real Addressing. In com(cid:173)
`puter systems with virtual memory, the
`cache may potentially be accessed either
`with a real address (real address cache) or
`a virtual address (virtual address cache). If
`real addresses are to be used, the virtual
`addresses generated by the processor must
`first be translated as in the example above
`(Figure 2); this is generally done by a TLB.
`The TLB is itself a cache memory which
`stores recently used address translation in(cid:173)
`formation, so that translation can occur
`quickly. Direct virtual address access is
`faster (since no translation is needed), but
`causes some problems. In a virtual address
`cache, inverse mapping (real to virtual ad(cid:173)
`dress) is sometimes needed; this can be
`done by an RTB (reverse translation buffer).
`Cache Size. It is obvious that the larger
`the cache, the higher the probability of
`finding the needed information in it. Cache
`sizes cannot be expanded without limit,
`however, for several reasons: cost (the most
`important reason in many machines, espe(cid:173)
`cially small ones), physical size (the cache
`must fit on the boards and in the cabinets),
`and access time. (The larger the cache, the
`slower it may become. Reasons for this are
`discussed in Section 2.12.). Later, we ad(cid:173)
`dress the question of how large is large
`enough.
`Multilevel Cache. As the cache grows in
`size, there comes a point where it may be
`usefully split into two levels: a small, high(cid:173)
`level cache, which is faster, smaller, and
`more expensive per byte, and a larger, sec(cid:173)
`ond-level cache. This two-level cache struc(cid:173)
`ture solves some of the problems that afflict
`caches when they become too large.
`Cache Bandwidth. The cache bandwidth
`is the rate at which data can be read from
`and written to the cache. The bandwidth
`must be sufficient to support the proposed
`rate of instruction execution and 1/0.
`Bandwidth can be improved by increasing
`the width of the data path, interleaving the
`cache and decreasing access time.
`
`1. DATA AND MEASUREMENTS
`
`1 .1 Rationale
`As noted earlier, our in-depth studies of
`some aspects of cache design and optimi(cid:173)
`zation are based on extensive trace-driven
`
`479
`
`•
`Cache Memories
`simulation. In this section, we explain the
`importance of this approach, and then dis(cid:173)
`cuss the presentation of our results.
`One difficulty in providing definitive
`statements about aspects of cache opera(cid:173)
`tion is that the effectiveness of a cache
`memory depends on the workload of the
`computer system; further, to our knowl(cid:173)
`edge, there has never been any (public)
`effort to characterize that workload with
`respect to its effect on the cache memory.
`Along the same lines, there is no generally
`accepted model for program behavior, and
`still less is there one for its effect on the
`uppermost level of the memory hierarchy.
`(But see ARoR72 for some measurements,
`and LEHM78 and LEHM80, in which a model
`is used.)
`For these reasons, we believe that it is
`possible for many aspects of cache design
`to make statements about relative perform(cid:173)
`ance only when those statements are based
`on trace-driven simulation or direct mea(cid:173)
`surement. We have therefore tried through(cid:173)
`out, when examining certain aspects of
`cache memories, to present a large number
`of simulation results and, if possible, to
`generalize from those measurements. We
`have also made an effort to locate and
`reference other measurement and trace(cid:173)
`driven simulation results reported in the
`literature. The reader may wish, for exam(cid:173)
`ple, to read WIND78, in which that author
`discusses the set of data used for his simu(cid:173)
`lations.
`
`1.2 Trace-Driven Simulation
`Trace-driven simulation is an effective
`method for evaluating the behavior of a
`memory hierarchy. A trace is usually gath(cid:173)
`ered by interpretively executing a program
`and recording every main memory location
`referenced by the program during its exe(cid:173)
`cution. (Each address may be tagged in any
`way desired, e.g., instruction fetch, data
`fetch, data store.) One or more such traces
`are then used to drive a simulation model
`of a cache (or main) memory. By varying
`parameters of the simulation model, it is
`possible to simulate directly any cache size,
`placement, fetch or replacement algorithm,
`line size, and so forth. Programming tech(cid:173)
`niques allow a range of values for many of
`these parameters to be measured simulta-
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Qualcomm, Ex. 1017, Page 7
`
`
`
`480
`
`•
`
`A. J. Smith
`
`neously, during the same simulation run
`[GEcs74, MATT70, SLUT72]. Trace-driven
`simulation has been a mainstay of memory
`hierarchy evaluation for the last 12 to 15
`years; see BELA66 for an early example of
`this technique, or see PoHM73. We assume
`only a single cache in the system, the one
`that we simulate. Note that our model does
`not include the additional buffers com(cid:173)
`monly found in the instruction decode and
`ALU portions of many CPUs.
`In many cases, trace-driven simulation is
`preferred to actual measurement. Actual
`measurements require access to a computer
`and hardware measurement tools. Thus, if
`the results of the experiments are to be
`even approximately repeatable, standalone
`time is required. Also, if one is measuring
`an actual machine, one is unable to vary
`most (if any) hardware parameters. Trace(cid:173)
`driven simulation has none of these diffi(cid:173)
`culties; parameters can be varied at will and
`experiments can be repeated and repro(cid:173)
`duced precisely. The principal advantage of
`measurement over simulation is that it re(cid:173)
`quires 1 to 0.1 percent as much running
`time and is thus very valuable in establish(cid:173)
`ing a genuine, workload-based, actual level
`of performance (for validation). Actual
`workloads also include supervisor code, in(cid:173)
`terrupts, context switches, and other as(cid:173)
`pects of workload behavior which are hard
`to imitate with traces. The results in
`this paper are mostly of the trace-driven
`variety.
`
`1.3 Simulation Evaluation
`There are two aspects to the performance
`of a cache memory. The first is access time:
`How long does it take to get information
`from or put information into the cache? It
`is very difficult to make exact statements
`about the effect of design changes on access
`time without specifying a circuit technology
`and a circuit diagram. One can, though,
`indicate trends, and we do that throughout
`this paper.
`The second aspect of cache performance
`is the miss ratio: What fraction of all mem(cid:173)
`ory references attempt to access something
`which is not resident in the cache memory?
`Every such miss requires that the CPU wait
`the desired
`information can be
`until
`reached. Note that the miss ratio is a func-
`
`Computing Surveys, Vol 14, No. 3, September 1982
`
`tion not only of how the cache design affects
`the number of misses, but also of how the
`machine design affects the number of cache
`memory references. (A memory reference
`represents a cache access. A given instruc(cid:173)
`tion requires a varying number of memory
`references, depending on the specific imple(cid:173)
`mentation of the machine.) For example, a
`different number of memory references
`would be required if one word at a time
`were obtained from the cache than if two
`words were obtained at once. Almost all of
`our trace-driven studies assume a cache
`with a one-word data path (370 words = 4
`bytes, PDP-11 word = 2 bytes). The WA(cid:173)
`TEX, WATFIV, FFT, and APL traces as(cid:173)
`sume a two-word (eight-byte) data path.
`We measure the miss ratio and use it as the
`major figure of merit for most of our stud(cid:173)
`ies. We display many of these results as
`x/y plots of miss ratios versus cache size in
`order to show the dependence of various
`cache design parameters on the cache size.
`
`1.4 The Traces
`We have obtained 19 program address
`traces, 3 of them for the PDP-11 and the
`other 16 for the IBM 360/370 series of
`computers. Each trace is for a program
`developed for normal production use.
`(These traces are listed in the Appendix,
`with a brief description of each.) They have
`been used in groups to simulate multipro(cid:173)
`gramming; five such groups were formed.
`Two represent a scientific workload (WFV,
`APL, WTX, FFT, and FGOl, FGO2, FGO3,
`FGO4), one a business (commercial) work(cid:173)
`load (CGOl, CGO2, CGO3, PGO2), one a
`miscellaneous workload, including compi(cid:173)
`lations and a utility program (PGOl,
`CCOMP, FCOMP, IEBDG), and one a
`PDP-11 workload
`(ROFFAS, EDC,
`TRACE). The miss ratio as a function of
`cache size is shown in Figure 5 for most of
`the traces; see SMIT79 for the miss ratios of
`the remaining traces. The miss ratios for
`each of the traces in Figure 5 are cold-start
`values based on simulations of 250,000
`memory references for the IBM traces, and
`333,333 for the PDP-11 traces.
`
`1.5 Simulation Methods
`Almost all of the simulations that were run
`used 3 or 4 traces and simulated multipro-
`
`Qualcomm, Ex. 1017, Page 8
`
`
`
`~ 0.005
`\ ·-.
`I-
`<
`1-,1-....,.._.......,....,._........,........,...,.._ ....................................................... _ _._.""'-l-_......,.......,_.......,._._ __ 9'-I
`O::'.
`(f) 0.050
`(f)
`~ 0.020
`0.010
`
`--R0FFAS
`·EDC
`- - - -TRACEEDC
`
`.....
`
`.....
`
`----
`
`-PG02
`·· · ·· CC0MP
`---·FC0MP
`·---·-IEBDG
`.....
`0.020
`_ _ _ _ _ _ ..;;i0.010
`
`0.100
`
`0.050
`
`0.020
`
`0.010
`
`0.002
`
`0.100
`
`0.050
`
`0.005
`
`0.002
`
`Cache Memories
`TRACE MISS RA TI0S
`40000
`60000
`10000
`20000
`
`30000
`
`•
`
`481
`
`-FG 1
`· · ··· · FG02
`---- FG03
`·-·- - FG04
`
`20000
`
`0.100
`
`0.050
`
`0.020
`
`0.010
`
`0.005
`
`0.002
`
`0.001
`
`.... ·····
`
`.. . . . ...
`
`" ·-
`
`2000
`
`20000
`6000
`4000
`MEM0RY CAPACITY
`
`40000
`
`Figure 5.
`
`Individual trace miss ratios.
`
`gramming by switching the trace in use
`every Q time-units (where Q was usually
`10,000, a cache memory reference takes 1
`time-unit, and a miss requires 10). Multi(cid:173)
`programmed simulations are used for two
`reasons: they are considered to be more
`representative of usual computer system
`operation than uniprogrammed ones, and
`they also allow many more traces to be
`included without increasing the n