throbber
Cache Memories
`
`ALAN JAY SMITH
`
`University of California, Berkeley, California 94720
`
`Cache memories are used in modem, medium and high-speed CPUs to hold temporarily
`those portions of the contents of main memory which are (believed to be) currently in
`use. Since instructions and data in cache memories can usually be referenced in IO to 25
`percent of the time required to access main memory, cache memories permit the
`execut10n rate of the machine to be substantially increased. In order to function
`effectively, cache memories must be carefully designed and implemented. In this paper,
`we explain the various aspects of cache memones and discuss in some detail the design
`features and trade-offs. A large number of original, trace-driven simulation results are
`presented. Considerat10n is given to practical implementation questions as well as to more
`abstract design issues.
`Specific aspects of cache memories that are investigated include: the cache fetch
`algorithm (demand versus prefetch), the placement and replacement algorithms, line size,
`store-through versus copy-back updating of main memory, cold-start versus warm-start
`mJSs ratios, multicache consistency, the effect of input/output through the cache, the
`behavior of split data/instruction caches, and cache size. Our discussion includes other
`aspects of memory system architecture, including translation lookaside buffers.
`Throughout the paper, we use as examples the implementation of the cache in the
`Amdahl 470V /6 and 470V /7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780.
`An extensive bibliography is provided.
`
`Categories and Subject Descriptors: B.3.2 [Memory Structures]: Design Styles-cache
`memories; B.3.3 [Memory Structures]: Performance Analysis and Design Aids; C.O.
`[Computer Systems Organization]: General; C.4 [Computer Systems Organiza(cid:173)
`tion]: Performance of Systems
`General Terms: Design, Experimentation, Measurement, Performance
`Additional Key Words and Phrases· Buffer memory, paging, prefetching, TLB, store(cid:173)
`through, Amdahl 470, IBM 3033, BIAS
`
`INTRODUCTION
`
`Definition and Rationale
`Cache memories are small, high-speed
`buffer memories used in modern computer
`systems to hold temporarily those portions
`of the contents of main memory which are
`(believed to be) currently in use. Informa(cid:173)
`tion located in cache memory may be ac(cid:173)
`cessed in much less time than that located
`in main memory (for reasons discussed
`throughout this paper). Thus, a central
`processing unit (CPU) with a cache mem(cid:173)
`ory needs to spend far less time waiting for
`
`instructions and operands to be fetched
`and/ or stored. For exawf)le, in typical large,
`high-speed computers (t!.g., Amdahl 470V /
`7, IBM 3033), main memory can be ac(cid:173)
`cessed in 300 to 600 nanoseconds; informa(cid:173)
`tion can be obtained from a cache, on the
`other hand, in 50 to 100 nanoseconds. Since
`the performance of such machines is al(cid:173)
`ready limited in instruction execution rate
`by cache memory access time, the absence
`of any cache memory at all would produce
`a very substantial decrease in execution
`speed.
`Virtually all modern large computer sys-
`
`Permission to copy without fee all or part of this matenal is granted provided that the copies are not made or
`d1str1buted for direct commercial advantage, the ACM copyright notice and the title of the publication and its
`date appear, and notice is given that copying 1s by permission of the Association for Computing Machinery. To
`copy otherwise, or to republish, requires a fee and/or specific permission.
`© 1982 ACM 0010-4892/82/0900-0473 $00. 75
`
`Computing Surveys, Vol. 14, No, 3, September 1982
`
`Qualcomm, Ex. 1017, Page 1
`
`

`

`474
`
`•
`CONTENTS
`
`A. J. Smith
`
`INTRODUCTION
`Defimtion and Rat10nale
`Overvtew of Cache Design
`Cache Aspects
`1. DATA AND MEASUREMENTS
`1 1 Rationale
`1 2 Trace-Driven Sunulat10n
`1 3 Simulation Evaluat10n
`1 4 The Traces
`1 5 Sunulat10n Methods
`2 ASPECTS OF CACHE DESIGN AND OPERA(cid:173)
`TION
`2.1 Cache Fetch Algonthm
`2.2 Placement Algonthm
`2.3 Lme SIZe
`2.4 Replacement Algonthm
`2.5 Wnte-Through versus Copy-Back
`2.6 Effect of Multiprogramming Cold-Start and
`Warm-Start
`2. 7 Multicache ConsJStency
`2.8 Data/Instruction Cache
`2 9 V Jrtual Address Cache
`2.10 User /Supervtsor Cache
`2.11 Input/Output through the Cache
`2 12 Cache SJZe
`2 13 Cache Bandwidth, Data Path Width, and Ac-
`cess Resolution
`2 14 Multtlevel Cache
`2 15 P1pelmmg
`2 16 Translation Lookas1de Buffer
`2 17 Translator
`2 18 Memory-Based Cache
`2 19 SpecialJZed Caches and Cache Components
`3 DIRECTIONS FOR RESEARCH AND DEVEL(cid:173)
`OPMENT
`3 1 On-Chtp Cache and Other Technology Ad-
`vances
`3.2 Mult1cache ConsJStency
`Implementation Evaluation
`3.3
`3.4 Hit Ratio versus SJZe
`3.5 TLB Design
`3.6 Cache Parameters versus Architecture and
`Workload
`APPENDIX EXPLANATIONOFTRACENAMES
`ACKNOWLEDGMENTS
`REFERENCES
`
`terns have cache memories; for example,
`the Amdahl 470, the IBM 3081 [IBM82,
`REIL82, GusT82], 3033, 370/168, 360/195,
`the Univac 1100/80, and the Honeywell 66/
`80. Also, many medium and small size ma(cid:173)
`chines have cache memories; for example,
`the DEC VAX 11/780, 11/750 [ARMS81],
`and PDP-11/70 [STRE76, 8Now78], and
`the Apollo, which uses a Motorolla 68000
`microprocessor. We believe that within
`
`Computmg Surveys, Vol 14, No. 3, September 1982
`
`two to four years, circuit speed and density
`will progress sufficiently to permit cache
`memories in one chip microcomputers.
`( On-chip addressable memory is planned
`for the Texas Instruments 99000 [LAFF81,
`ELEC81].) Even microcomputers could
`benefit substantially from an on-chip cache,
`since on-chip access times are much smaller
`than off-chip access times. Thus, the ma(cid:173)
`terial presented in this paper should be
`relevant to almost the full range of com(cid:173)
`puter architecture implementations.
`The success of cache memories has been
`explained by reference to the "property of
`locality" [DENN72]. The property of local(cid:173)
`ity has two aspects, temporal and spatial.
`Over short periods of time, a program dis(cid:173)
`tributes its memory references nonuni(cid:173)
`formly over its address space, and which
`portions of the address space are favored
`remain largely the same for long periods of
`time. This first property, called temporal
`locality, or locality by time, means that the
`information which will be in use in the near
`future is likely to be in use already. This
`type of behavior can be expected from pro(cid:173)
`gram loops in which both data and instruc(cid:173)
`tions are reused. The second property, lo(cid:173)
`cality by space, means that portions of the
`address space which are in use generally
`consist of a fairly small number of individ(cid:173)
`ually contiguous segments of that address
`space. Locality by space, then, means that
`the loci of reference of the program in the
`near future are likely to be near the current
`loci of reference. This type of behavior can
`be expected from common knowledge of
`programs: related data items (variables, ar(cid:173)
`rays) are usually stored together, and in(cid:173)
`structions are mostly executed sequentially.
`Since the cache memory buffers segments
`of information that have been recently
`used, the property of locality implies that
`needed information is also likely to be
`found in the cache.
`Optimizing the design of a cache memory
`generally has four aspects:
`(1) Maximizing the probability of finding a
`memory reference's target in the cache
`(the hit ratio),
`(2) minimizing the time to access informa(cid:173)
`tion that is indeed in the cache (access
`time),
`(3) minimizing the delay due to a miss, and
`
`Qualcomm, Ex. 1017, Page 2
`
`

`

`(4) minimizing the overheads of updating
`main memory, maintaining multicache
`consistency, etc.
`(All of these have to he accomplished
`under suitable cost constraints, of course.)
`There is also a trade-off between hit ratio
`and access time. This trade-off has not been
`sufficiently stressed in the literature and it
`is one of our major concerns in this paper.
`In this paper, each aspect of cache memo(cid:173)
`ries is discussed at length and, where avail(cid:173)
`able, measurement results are presented. In
`order for these detailed discussions to be
`meaningful, a familiarity with many of the
`aspects of cache design is required. In the
`remainder of this section, we explain the
`operation of a typical cache memory, and
`then we briefly discuss several aspects of
`cache memory design. These discussions
`are expanded upon in Section 2. At the end
`of this paper, there is an extensive bibliog(cid:173)
`raphy in which we have attempted to cite
`all relevant literature. Not all of the items
`in the bibliography are referenced in the
`paper, although we have referred to items
`there as appropriate. The reader may wish
`in particular to refer to BADE79, BARS72,
`Grns67, and KAPL73 for other surveys of
`some aspects of cache design. CLAR81 is
`particularly interesting as it discusses the
`design details of a real cache. (See also
`LAMP80.)
`
`Overview of Cache Design
`
`Many CPUs can he partitioned, concep(cid:173)
`tually and sometimes physically, into three
`parts: the I-unit, the E-unit, and the S-unit.
`The I-unit (instruction) is responsible for
`instruction fetch and decode. It may have
`some local buffers for lookahead prefetch(cid:173)
`ing of instructions. The E-unit (execution)
`does most of what is commonly referred to
`as executing an instruction, and it contains
`the logic for arithmetic and logical opera(cid:173)
`tions. The S-unit (storage) provides the
`memory interface between the I-unit and
`E-unit. (IBM calls the S-unit the PSCF, or
`processor storage control function.)
`The S-unit is the part of the CPU of
`primary interest in this paper. It contains
`several parts or functions, some of which
`are shown in Figure 1. The major compo(cid:173)
`nent of the S-unit is the cache memory.
`
`Cache Memories
`
`•
`
`475
`
`Main Memory
`
`.---~-__,--Cache
`TLB
`S U 't
`- n,
`Translator
`
`I-Unit E-Unit \ ~£gJ
`
`{
`CPU
`
`~rite Through Buffer~
`
`Figure 1. A typical CPU design and the S-unit.
`
`There is usually a translator, which trans(cid:173)
`lates virtual to real memory addresses, and
`a TLB (translation lookaside buffer) which
`buffers (caches) recently generated (virtual
`address, real address) pairs. Depending on
`machine design, there can be an ASIT (ad(cid:173)
`dress space identifier table), a BIAS (buffer
`invalidation address stack), and some write(cid:173)
`through buffers. Each of these is discussed
`in later sections of this paper.
`Figure 2 is a diagram of portions of a
`typical S-unit, showing only the more im(cid:173)
`portant parts and data paths, in particular
`the cache and the TLB. This design is
`typical of that used by IBM (in the 370/168
`and 3033)and by Amdahl(in the 470 series).
`Figure 3 is a flowchart that corresponds
`to the operation of the design in Figure
`2. A discussion of this flowchart follows.
`The operation of the cache commences
`with the arrival of a virtual address, gener(cid:173)
`ally from the CPU, and the appropriate
`control signal. The virtual address is passed
`to both the TLB and the cache storage.
`The TLB is a small associative memory
`which maps virtual to real addresses. It is
`often organized as shown, as a number of
`groups (sets) of elements, each consisting
`of a virtual address and a real address. The
`TLB accepts the virtual page number, ran(cid:173)
`domizes it, and uses that hashed number to
`select a set of elements. That set of ele(cid:173)
`ments is then searched associatively for a
`match to the virtual address. If a match is
`found, the corresponding real address is
`passed along to the comparator to deter(cid:173)
`mine whether the target line is in the cache.
`Finally, the replacement status of each en(cid:173)
`try in the TLB set is updated.
`If the TLB does not contain the ( virtual
`address, real address} pair needed for the
`translation, then the translator (not shown
`in Figure 2) is invoked. It uses the high(cid:173)
`order bits of the virtual address as an entry
`into the segment and page tables for the
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Qualcomm, Ex. 1017, Page 3
`
`

`

`476
`
`•
`
`A. J. Smith
`
`From translator
`Virtual Real
`Address Address
`
`y
`
`!
`I Hash
`Function
`
`,/'
`
`CPU
`•
`Virtual Address
`
`Number
`
`Number
`
`Line
`
`"
`
`/\
`
`I
`
`To
`
`translator
`
`Address Data Line I
`
`I Line I Byte 1n
`I Page
`I ~
`
`f
`cyv
`
`)Real Address Data
`,I,
`,I,
`
`.,,
`
`~
`
`Cache
`Address a
`Data
`Arrays
`
`-(s
`
`• I
`
`:
`
`I
`
`Translation
`Lookas1de
`Buffer
`
`y
`
`Ks
`
`I
`
`l
`
`I
`
`'
`
`1 1 l l
`
`ICornpare Virtual I
`Addresses
`
`Real
`Address
`
`A
`
`$•select
`
`To Main Memory
`
`'--
`
`L;
`
`Compare Addresses a Select
`!-Doto
`Byte Select a Align
`i
`Data Out
`
`Data
`
`Figure 2. A typical cache and TLB design.
`
`CACHE OPERATION FLOW CHART
`
`Receive Virtual Address
`
`yes
`
`Send Virtual Address
`to Translator
`
`Update Replacement
`Status 1n TLB
`
`yes
`
`Update Replacement
`Status
`
`Send Real Address
`to Main Memory
`
`Store Line
`1n Cache
`
`Figure 3. Cache operation flow chart.
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Qualcomm, Ex. 1017, Page 4
`
`

`

`process and then returns the address pair
`to the TLB (which retains it for possible
`future use), thus replacing an existing TLB
`entry.
`The virtual address is also passed along
`initially to a mechanism which uses the
`middle part of the virtual address (the line
`number) as an index to select a set of entries
`in the cache. Each entry consists primarily
`of a real address tag and a line of data (see
`Figure 4). The line is the quantum of stor(cid:173)
`age in the cache. The tags of the elements
`of all the selected set are read into a com(cid:173)
`parator and compared with the real address
`from the TLB. (Sometimes the cache stor(cid:173)
`a~e stores the data and address tags to(cid:173)
`gether, as shown in Figures 2 and 4. Other
`times, the address tags and data are stored
`separately in the "address array" and "data
`array/' respectively.) If a match is found,
`the line (or a part of it) containing the
`target locations is read into a shift register
`and the replacement status of the entries in
`the cache set are updated. The shift register
`is then shifted to select the target bytes
`which are in turn transmitted to the sourc~
`of the original data request.
`If a miss occurs (i.e., addresss tags in the
`cache do not match), then the real address
`of the desired line is transmitted to the
`main memory. The replacement status in(cid:173)
`formation is used to determine which line
`to remove from the cache to make room for
`the target line. If the line to be removed
`from the cache has been modified, and main
`memory has not yet been updated with the
`modification, then the line is copied back to
`main memory; otherwise, it is simply de(cid:173)
`leted from the cache. After some number of
`machine cycles, the target line arrives from
`main memory and is loaded into the cache
`storage. The line is also passed to the shift
`register for the target bytes to be selected.
`Cache Aspects
`The cache description given above is both
`simplified and specific; it does not show
`design alternatives. Below, we point out
`some of the design alternatives for the
`cache memory.
`Cache Fetch Algorithm. The cache fetch
`algorithm is used to decide when to bring
`information into the cache. Several possi-
`
`Cache Memories
`
`•
`
`477
`
`I Real Address Tag I Data I Valid
`
`Cache Entry
`
`Entry I Entry 2
`
`Entry E Replacement Status
`
`Cache Set
`
`Figure 4. Structure of cache entry and cache set.
`
`bilities exist: information can be fetched on
`demand (when it is needed) or prefetched
`(before it is needed). Prefetch algorithms
`attempt to guess what information will soon
`be needed and obtain it in advance. It is
`also possible for the cache fetch algorithm
`to omit fetching some information (selec(cid:173)
`tive fetch) and designate some information,
`such as shared writeable code (sema(cid:173)
`phores), as unfetchable. Further, there may
`be no fetch-on-write in systems which use
`write-through (see below).
`Cache Placement Algorithm. Informa(cid:173)
`tion is generally retrieved from the cache
`associatively, and because large associative
`memories are usually very expensive and
`somewhat slow, the cache is generally or(cid:173)
`ganized as a group of smaller associative
`memories. Thus, only one of the associative
`memories has to be searched to determine
`whether the desired information is located
`in the cache. Each such (small) associative
`memory is called a set and the number of
`elements over which the associative search
`is conducted is called the set size. The
`placement algorithm is used to determine
`in which set a piece (line) of information
`will be placed. Later in this paper we con(cid:173)
`sider the problem of selecting the number
`of sets, the set size, and the placement
`algorithm in such a set-associative memory.
`Line Size. The fixed-size unit of infor(cid:173)
`mation transfer between the cache and
`main memory is called the line. The line
`corresponds conceptually to the page,
`which is the unit of transfer between the
`main memory and secondary storage. Se(cid:173)
`lecting the line size is an important part of
`the memory system design. (A line is also
`sometimes referred to as a block.)
`Replacement Algorithm. When infor(cid:173)
`mation is requested by the CPU from main
`memory and the cache is full, some infor(cid:173)
`mation in the cache must be selected for
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Qualcomm, Ex. 1017, Page 5
`
`

`

`478
`
`•
`
`A. J. Smith
`
`replacement. Various replacement algo(cid:173)
`rithms are possible, such as FIFO (first in,
`first out), LRU (least recently used), and
`random. Later, we consider the first two of
`these.
`Main Memory Update Algorithm. When
`the CPU performs a write (store) to mem(cid:173)
`ory, that operation can actually be reflected
`in the cache and main memories in a num(cid:173)
`ber of ways. For example, the cache mem(cid:173)
`ory can receive the write and the main
`memory can be updated when that line is
`replaced in the cache. This strategy is
`known as copy-back. Copy-back may also
`require that the line be fetched if it is absent
`from the cache (i.e., fetch-on-write). An(cid:173)
`other strategy, known as write-through, im(cid:173)
`mediately updates main memory when a
`write occurs. Write-through may specify
`that if the information is in the cache, the
`cache be either updated or purged from
`main memory. If the information is not in
`the cache, it may or may not be fetched.
`The choice between copy-back and write(cid:173)
`through strategies is also influenced by the
`need to maintain consistency among the
`cache memories in a tightly coupled multi(cid:173)
`processor system. This requirement is dis(cid:173)
`cussed later.
`Cold-Start versus Wann-Start Miss Ra(cid:173)
`tios and Multiprogramming. Most com(cid:173)
`puter systems with cache memories are
`multiprogrammed; many processes run on
`the CPU, though only one can run at a
`time, and they alternate every few millisec(cid:173)
`onds. This means that a significant fraction
`of the cache miss ratio is due to loading
`data and instructions for a new process,
`rather than to a single process which has
`been running for some time. Miss ratios
`that are measured when starting with an
`empty cache are called cold-start miss ra(cid:173)
`tios, and those that are measured from the
`time the cache becomes full are called
`warm-start miss ratios. Our simulation
`studies consider this multiprogramming en(cid:173)
`vironment.
`User/Supervisor Cache. The frequent
`switching between user and supervisor
`state in most systems results in high miss
`ratios because the cache is often reloaded
`(i.e., cold-start). One way to address this is
`to incorporate two cache memories, and
`allow the supervisor to use one cache and
`
`Computing Surveys, Vol. 14, No 3, September 1982
`
`the user programs to use the other. Poten(cid:173)
`tially, this could result in both the super(cid:173)
`visor and the user programs more fre(cid:173)
`quently finding upon initiation what they
`need in the cache.
`Multicache Consistency. A multiproces(cid:173)
`sor system with multiple caches faces the
`problem of making sure that all copies of a
`given piece of information (which poten(cid:173)
`tially could exist in every cache, as well as
`in the main memory) are the same. A mod(cid:173)
`ification of any one of these copies should
`somehow be reflected in all others. A num(cid:173)
`ber of solutions to this problem are possible.
`The three most popular solutions are essen(cid:173)
`tially: (1) to transmit all stores to all caches
`and memories, so that all copies are up(cid:173)
`dated; (2) to transmit the addresses of all
`stores to all other caches, and purge the
`corresponding lines from all other caches;
`or (3) to permit data that are writeable
`(page or line flagged to permit modifica(cid:173)
`tion) to be in only one cache at a time. A
`centralized or distributed directory may be
`used to control making and updating of
`copies.
`Input/Output. Input/output (from and
`to 1/0 devices) is an additional source of
`references to information in memory. It is
`important that an output request stream
`reference the most current values for the
`information transferred. Similarly, it is also
`important that input data be immediately
`reflected in any and all copies of those lines
`in memory. Several solutions to this prob(cid:173)
`lem are possible. One is to direct the 1/0
`stream through the cache itself (in a single
`processor system); another is to use a write(cid:173)
`through policy and broadcast all writes so
`as to update or invalidate the target line
`wherever found. In the latter case, the
`channel accesses main memory rather than
`the cache.
`Data/ Instruction Cache. Another cache
`design strategy is to split the cache into two
`parts: one for data and one for instructions.
`This has the advantages that the band(cid:173)
`width of the cache is increased and the
`access time (for reasons discussed later) can
`be decreased. Several problems occur: the
`overall miss ratio may increase, the two
`caches must be kept consistent, and self(cid:173)
`modifying code and execute instructions
`must be accommodated.
`
`Qualcomm, Ex. 1017, Page 6
`
`

`

`Virtual versus Real Addressing. In com(cid:173)
`puter systems with virtual memory, the
`cache may potentially be accessed either
`with a real address (real address cache) or
`a virtual address (virtual address cache). If
`real addresses are to be used, the virtual
`addresses generated by the processor must
`first be translated as in the example above
`(Figure 2); this is generally done by a TLB.
`The TLB is itself a cache memory which
`stores recently used address translation in(cid:173)
`formation, so that translation can occur
`quickly. Direct virtual address access is
`faster (since no translation is needed), but
`causes some problems. In a virtual address
`cache, inverse mapping (real to virtual ad(cid:173)
`dress) is sometimes needed; this can be
`done by an RTB (reverse translation buffer).
`Cache Size. It is obvious that the larger
`the cache, the higher the probability of
`finding the needed information in it. Cache
`sizes cannot be expanded without limit,
`however, for several reasons: cost (the most
`important reason in many machines, espe(cid:173)
`cially small ones), physical size (the cache
`must fit on the boards and in the cabinets),
`and access time. (The larger the cache, the
`slower it may become. Reasons for this are
`discussed in Section 2.12.). Later, we ad(cid:173)
`dress the question of how large is large
`enough.
`Multilevel Cache. As the cache grows in
`size, there comes a point where it may be
`usefully split into two levels: a small, high(cid:173)
`level cache, which is faster, smaller, and
`more expensive per byte, and a larger, sec(cid:173)
`ond-level cache. This two-level cache struc(cid:173)
`ture solves some of the problems that afflict
`caches when they become too large.
`Cache Bandwidth. The cache bandwidth
`is the rate at which data can be read from
`and written to the cache. The bandwidth
`must be sufficient to support the proposed
`rate of instruction execution and 1/0.
`Bandwidth can be improved by increasing
`the width of the data path, interleaving the
`cache and decreasing access time.
`
`1. DATA AND MEASUREMENTS
`
`1 .1 Rationale
`As noted earlier, our in-depth studies of
`some aspects of cache design and optimi(cid:173)
`zation are based on extensive trace-driven
`
`479
`
`•
`Cache Memories
`simulation. In this section, we explain the
`importance of this approach, and then dis(cid:173)
`cuss the presentation of our results.
`One difficulty in providing definitive
`statements about aspects of cache opera(cid:173)
`tion is that the effectiveness of a cache
`memory depends on the workload of the
`computer system; further, to our knowl(cid:173)
`edge, there has never been any (public)
`effort to characterize that workload with
`respect to its effect on the cache memory.
`Along the same lines, there is no generally
`accepted model for program behavior, and
`still less is there one for its effect on the
`uppermost level of the memory hierarchy.
`(But see ARoR72 for some measurements,
`and LEHM78 and LEHM80, in which a model
`is used.)
`For these reasons, we believe that it is
`possible for many aspects of cache design
`to make statements about relative perform(cid:173)
`ance only when those statements are based
`on trace-driven simulation or direct mea(cid:173)
`surement. We have therefore tried through(cid:173)
`out, when examining certain aspects of
`cache memories, to present a large number
`of simulation results and, if possible, to
`generalize from those measurements. We
`have also made an effort to locate and
`reference other measurement and trace(cid:173)
`driven simulation results reported in the
`literature. The reader may wish, for exam(cid:173)
`ple, to read WIND78, in which that author
`discusses the set of data used for his simu(cid:173)
`lations.
`
`1.2 Trace-Driven Simulation
`Trace-driven simulation is an effective
`method for evaluating the behavior of a
`memory hierarchy. A trace is usually gath(cid:173)
`ered by interpretively executing a program
`and recording every main memory location
`referenced by the program during its exe(cid:173)
`cution. (Each address may be tagged in any
`way desired, e.g., instruction fetch, data
`fetch, data store.) One or more such traces
`are then used to drive a simulation model
`of a cache (or main) memory. By varying
`parameters of the simulation model, it is
`possible to simulate directly any cache size,
`placement, fetch or replacement algorithm,
`line size, and so forth. Programming tech(cid:173)
`niques allow a range of values for many of
`these parameters to be measured simulta-
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Qualcomm, Ex. 1017, Page 7
`
`

`

`480
`
`•
`
`A. J. Smith
`
`neously, during the same simulation run
`[GEcs74, MATT70, SLUT72]. Trace-driven
`simulation has been a mainstay of memory
`hierarchy evaluation for the last 12 to 15
`years; see BELA66 for an early example of
`this technique, or see PoHM73. We assume
`only a single cache in the system, the one
`that we simulate. Note that our model does
`not include the additional buffers com(cid:173)
`monly found in the instruction decode and
`ALU portions of many CPUs.
`In many cases, trace-driven simulation is
`preferred to actual measurement. Actual
`measurements require access to a computer
`and hardware measurement tools. Thus, if
`the results of the experiments are to be
`even approximately repeatable, standalone
`time is required. Also, if one is measuring
`an actual machine, one is unable to vary
`most (if any) hardware parameters. Trace(cid:173)
`driven simulation has none of these diffi(cid:173)
`culties; parameters can be varied at will and
`experiments can be repeated and repro(cid:173)
`duced precisely. The principal advantage of
`measurement over simulation is that it re(cid:173)
`quires 1 to 0.1 percent as much running
`time and is thus very valuable in establish(cid:173)
`ing a genuine, workload-based, actual level
`of performance (for validation). Actual
`workloads also include supervisor code, in(cid:173)
`terrupts, context switches, and other as(cid:173)
`pects of workload behavior which are hard
`to imitate with traces. The results in
`this paper are mostly of the trace-driven
`variety.
`
`1.3 Simulation Evaluation
`There are two aspects to the performance
`of a cache memory. The first is access time:
`How long does it take to get information
`from or put information into the cache? It
`is very difficult to make exact statements
`about the effect of design changes on access
`time without specifying a circuit technology
`and a circuit diagram. One can, though,
`indicate trends, and we do that throughout
`this paper.
`The second aspect of cache performance
`is the miss ratio: What fraction of all mem(cid:173)
`ory references attempt to access something
`which is not resident in the cache memory?
`Every such miss requires that the CPU wait
`the desired
`information can be
`until
`reached. Note that the miss ratio is a func-
`
`Computing Surveys, Vol 14, No. 3, September 1982
`
`tion not only of how the cache design affects
`the number of misses, but also of how the
`machine design affects the number of cache
`memory references. (A memory reference
`represents a cache access. A given instruc(cid:173)
`tion requires a varying number of memory
`references, depending on the specific imple(cid:173)
`mentation of the machine.) For example, a
`different number of memory references
`would be required if one word at a time
`were obtained from the cache than if two
`words were obtained at once. Almost all of
`our trace-driven studies assume a cache
`with a one-word data path (370 words = 4
`bytes, PDP-11 word = 2 bytes). The WA(cid:173)
`TEX, WATFIV, FFT, and APL traces as(cid:173)
`sume a two-word (eight-byte) data path.
`We measure the miss ratio and use it as the
`major figure of merit for most of our stud(cid:173)
`ies. We display many of these results as
`x/y plots of miss ratios versus cache size in
`order to show the dependence of various
`cache design parameters on the cache size.
`
`1.4 The Traces
`We have obtained 19 program address
`traces, 3 of them for the PDP-11 and the
`other 16 for the IBM 360/370 series of
`computers. Each trace is for a program
`developed for normal production use.
`(These traces are listed in the Appendix,
`with a brief description of each.) They have
`been used in groups to simulate multipro(cid:173)
`gramming; five such groups were formed.
`Two represent a scientific workload (WFV,
`APL, WTX, FFT, and FGOl, FGO2, FGO3,
`FGO4), one a business (commercial) work(cid:173)
`load (CGOl, CGO2, CGO3, PGO2), one a
`miscellaneous workload, including compi(cid:173)
`lations and a utility program (PGOl,
`CCOMP, FCOMP, IEBDG), and one a
`PDP-11 workload
`(ROFFAS, EDC,
`TRACE). The miss ratio as a function of
`cache size is shown in Figure 5 for most of
`the traces; see SMIT79 for the miss ratios of
`the remaining traces. The miss ratios for
`each of the traces in Figure 5 are cold-start
`values based on simulations of 250,000
`memory references for the IBM traces, and
`333,333 for the PDP-11 traces.
`
`1.5 Simulation Methods
`Almost all of the simulations that were run
`used 3 or 4 traces and simulated multipro-
`
`Qualcomm, Ex. 1017, Page 8
`
`

`

`~ 0.005
`\ ·-.
`I-
`<
`1-,1-....,.._.......,....,._........,........,...,.._ ....................................................... _ _._.""'-l-_......,.......,_.......,._._ __ 9'-I
`O::'.
`(f) 0.050
`(f)
`~ 0.020
`0.010
`
`--R0FFAS
`·EDC
`- - - -TRACEEDC
`
`.....
`
`.....
`
`----
`
`-PG02
`·· · ·· CC0MP
`---·FC0MP
`·---·-IEBDG
`.....
`0.020
`_ _ _ _ _ _ ..;;i0.010
`
`0.100
`
`0.050
`
`0.020
`
`0.010
`
`0.002
`
`0.100
`
`0.050
`
`0.005
`
`0.002
`
`Cache Memories
`TRACE MISS RA TI0S
`40000
`60000
`10000
`20000
`
`30000
`
`•
`
`481
`
`-FG 1
`· · ··· · FG02
`---- FG03
`·-·- - FG04
`
`20000
`
`0.100
`
`0.050
`
`0.020
`
`0.010
`
`0.005
`
`0.002
`
`0.001
`
`.... ·····
`
`.. . . . ...
`
`" ·-
`
`2000
`
`20000
`6000
`4000
`MEM0RY CAPACITY
`
`40000
`
`Figure 5.
`
`Individual trace miss ratios.
`
`gramming by switching the trace in use
`every Q time-units (where Q was usually
`10,000, a cache memory reference takes 1
`time-unit, and a miss requires 10). Multi(cid:173)
`programmed simulations are used for two
`reasons: they are considered to be more
`representative of usual computer system
`operation than uniprogrammed ones, and
`they also allow many more traces to be
`included without increasing the n

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket