`Write-Back and Sector Memories
`
`JAMES G. THOMPSON
`U.S. Air Force
`and
`ALAN JAY SMITH
`University of California, Berkeley
`
`techniques
`existing analysis
`known as stack algorithms,
`algorithms
`the class of replacement
`For
`permit
`the computation
`of memory miss ratios
`for all memory sizes simultaneously
`in one pass over
`a memory
`reference string. We extend
`the class of computations
`possible by this methodology
`in two
`ways. First, we show how
`to compute
`the effects of copy-backs
`in write-back
`caches. The key
`observation
`here is that a given block
`is clean for all memory sizes less than or equal to C blocks and
`is dirty
`for all larger memory sizes. Our technique permits efficient
`computations
`for algorithms
`or
`systems using periodic write-back
`and/or block deletion. The second extension permits stack analysis
`simulation
`for sector (or subblock) caches in which a sector (associated with an address tag) consists
`of subsectors
`(or subblocks)
`that can be loaded
`independently.
`The key observation
`here is that a
`subsector
`is present only
`in caches of size C or greater. Load
`forward prefetching
`in a sector cache is
`shown
`to be a stack algorithm
`and is easily simulated using our technique. Running
`times
`for our
`methods are only slightly
`higher
`than
`for a simulation
`of a single memory size using nonstack
`techniques.
`
`[Memory
`Design Styles; B.3.3
`Structures]:
`[Memory
`Categories and Subject Descriptors: B.3.2
`Structures]:
`Performance Analysis and Design Aids; B.4.4
`[Input/Output
`and Data Communi-
`cations]:
`Performance Analysis and Design Aids; C.4 [Computer
`Systems Organization]:
`Per-
`formance of systems
`
`General Terms: Design, Theory
`
`Additional Key Words and Phrases: Cache, copyback, disk cache, memory hierarchy analysis, sector
`cache, simulation,
`stack analysis
`
`1. INTRODUCTION
`aspects of
`is one of the most important
`Analysis of memory system performance
`computer system design. Frequently
`this analysis
`is done using the technique of
`trace-driven
`simulation,
`in which a trace of memory
`references
`from a similar
`
`the National
`in part by
`is based on research supported
`here
`presented
`The material
`Foundation
`under grants CCR-8202591 and MIP-8713274,
`by NASA under consortium
`NCA2-128, by the State of California
`under
`the MICRO program, and by the International
`Machines Corporation, Digital Equipment Corporation, Hewlett-Packard,
`and Signetics.
`Authors’
`current
`addresses: Lt. Col. J. G. Thompson,
`Joint Data Systems Support Center
`(JDSSC/CBlO),
`The Pentagon, Washington, DC 20301-7010; A. J. Smith, University
`of California,
`Dept. of Electrical Engineering
`and Computer Sciences, Berkeley, CA 94720.
`the copies are not
`that
`Permission
`to copy without
`fee all or part of this material
`is granted provided
`made or distributed
`for direct commercial advantage,
`the ACM copyright notice and the title of the
`publication
`and its date appear, and notice
`is given that copying
`is by permission of the Association
`for Computing Machinery.
`To copy otherwise,
`or
`to
`republish,
`requires a fee and/or
`specific
`permission.
`0 1989 ACM 0734-2071/89/0200-0078
`
`Science
`agreement
`Business
`
`$01.50
`
`ACM Transactions
`
`on Computer Systems, Vol. 7, No. 1, February 1989, Pages 78-117.
`
`DOCKET NO.: IPR-5463750
`
`1
`
`TI-EXHIBIT 1013
`
`
`
`Efficient (Stack) Algorithms
`
`for Write-Back and Sector Memories
`
`79
`
`in
`
`to a simulation of the system under study. This technique
`system is used as input
`has been applied
`to all
`levels of the memory hierarchy
`from microprocessor
`caches to file system design.
`the discovery by
`Trace-driven
`simulation
`became much more practical with
`Mattson, Gecsei, Slutz, and Traiger
`[22] that,
`for certain
`replacement policies,
`the performance
`of all memory sizes could be determined with a single pass
`through
`the
`trace
`file. Their stack analysis
`technique
`relies on the
`inclusion
`property of these policies, such that the contents of any size memory
`is a subset
`of the contents of any larger memory. Thus,
`the contents of all memories can be
`represented by a stack, where the top k items in the stack are the contents of a
`memory of size k. Policies
`that obey this property are known as stack algorithms.
`An equivalent
`characteristic
`of stack algorithms
`is that each possesses a total
`priority
`ordering of all blocks at any
`instant
`time
`that
`is independent
`of
`memory size.
`situations,
`to some important
`Until now, stack analysis has not been applied
`forcing
`the designer to fall back on the one-size-at-a-time method. One example
`of this
`is the write-back policy, also called copy-back, where a write
`to a block
`causes the block
`to be marked “dirty”
`in the cache, but the write
`to secondary
`storage
`is delayed until
`some later
`time. Write-back
`is particularly
`desirable
`where memory bandwidth may be a limiting
`resource, such as in a shared-bus
`multiprocessor
`or network
`file system. The alternative
`policy, write-through
`or
`store-through,
`in which all modifications
`go directly
`to secondary storage, severely
`restricts
`the performance
`improvement
`due to caching. The consideration
`of
`writes
`is important even with write-back since, in many cases, a write can cause
`twice the memory accesses of a read; one to fetch the block prior
`to modification
`and another
`to rewrite
`the block
`(copy-back). Discussions
`of stack analysis,
`however, have either
`ignored writes altogether, or considered only write-through.
`The problem with stack analysis of write-back
`is that
`it appears to violate
`the
`inclusion property. For example, suppose that a dirty block at level k in the stack
`is read. It must come to the top of the stack, but it is now clean for some sizes
`and dirty
`for others. We show that by maintaining
`a “dirty
`level”
`for each block,
`the stack analysis
`technique can be extended
`to analyze write-back. This dirty
`level is the smallest cache in which
`the block is dirty, or equivalently,
`the lowest
`level
`in the stack to which
`the block has been pushed since its last write, or
`infinity
`if the block has never been written. The dirty
`level is used to count
`the
`number of write-backs
`for each cache size by assuming
`that each write results
`in
`a write-back,
`then waiting
`to see where the write
`is avoided.
`If a dirty block
`is
`rewritten,
`then
`the previous write has been avoided
`in all dirty sizes since both
`the previous and current write can be written back with one memory access.
`Stack analysis can be similarly extended
`to analyze sector or subblock caches
`[14, 211. In a microprocessor
`cache, access time, on-chip data path widths, and
`pin-counts
`favor small blocks, whereas
`the size of associative
`lookup circuitry
`and tags favor fewer, larger blocks. A possible compromise
`is to break each block
`into independent
`subblocks, any or all of which may be present. Again, we show
`that stack analysis can be applied by maintaining
`additional
`data with each
`block.
`techniques
`is a review of previous stack analysis
`The remainder of this section
`and definition
`of terms. Section 2 presents
`the stack algorithm
`for write-back,
`
`ACM Transactions
`
`on Computer Systems, Vol. 7, No. 1, February 1989.
`
`DOCKET NO.: IPR-5463750
`
`2
`
`TI-EXHIBIT 1013
`
`
`
`80
`
`l
`
`J. G. Thompson and A. J. Smith
`
`to handle deletions, periodic
`in Section 3 by several simple extensions
`followed
`write-back, and cache flush. Section 4 similarly presents the algorithm
`for sector
`caches, including
`an extension
`for a useful
`form of prefetch. Section 5 presents
`a comparison of the time required
`to perform analysis using the stack technique.
`Finally, we show how stack analysis makes possible other useful measurements,
`exemplified by a computation
`of the probability
`of a write-back as a function of
`memory size.
`
`1 .l Cache Memories
`Analysis of memory systems has always been an important part of the design of
`computer systems. The early concentration
`on virtual program memory
`[3] has
`been recently
`replaced by emphasis on high-speed processor caches [32] and by
`disk or file system buffering or caching
`[34]. In reality,
`these can all be viewed
`as parts of a hierarchy of memory
`types, where the upper levels of the hierarchy
`are faster but generally more costly. Although many of the analysis
`techniques
`discussed
`in this paper generalize
`to multilevel
`hierarchies
`[23], we restrict our
`discussions
`to a two-level hierarchy and refer to the top level as a cache.
`in
`The purpose of caching
`is to improve
`the average access time
`to items
`memory by keeping the most frequently used items in a small, fast, cache memory
`and by leaving
`the remainder
`in a larger, slower memory. The contents of cache
`are checked on each reference;
`if the referenced
`item
`is present
`in cache, then
`the item is available at the speed of the cache. If not, then the item is read into
`cache
`from memory,
`replacing
`something
`already cached. The speed of the
`combined memory system
`is a function
`of the
`two memory speeds and
`the
`probability
`that
`the referenced
`item is in cache.
`says
`[7]. This principle
`Caches are effective because of the principle of locality
`that the items most likely
`to be referenced next are those “near”
`the items that
`have been recently
`referenced. Two aspects of locality are temporal
`locality and
`spatial
`locality. Temporal
`locality
`implies
`that an item
`that has been recently
`referenced has a good chance of being referenced again in ‘a short
`time. Spatial
`locality
`implies
`that
`items close
`to a referenced
`item are also
`likely
`to be
`referenced. This
`is particularly
`evident
`in
`the sequential
`reference behavior
`observed in instructions
`and within
`files.
`to any cache, most of which
`There are a large number of design parameters
`must be considered
`in any analysis of that design. We briefly present definitions
`of a number of these. For more detail see [31].
`
`blocks or variable-size
`fixed-size
`into
`Blocking. The cache may be divided
`segments. Blocks are also referred
`to as pages in the context of virtual memory
`and lines or sectors in the context of a processor cache. The cache block or line
`size may be equal to the amount of data retrievable
`in one memory cycle, or it
`may require several memory cycles to fetch a block. A larger block size reduces
`per-block overhead and provides a form of prefetch, discussed below. This paper
`discusses only block caches, although most of the algorithms presented can be
`generalized
`to analyze segment caches [23], for example, a virtual memory
`that
`manages entire programs as a unit or file system caches that manage whole files.
`Replacement Policy. The replacement policy determines which block to remove
`when
`the cache is full and a new block must be fetched. Commonly
`suggested
`
`ACM
`
`Transactions
`
`on Computer
`
`Systems,
`
`Vol.
`
`7, No.
`
`1, February
`
`1989.
`
`DOCKET NO.: IPR-5463750
`
`3
`
`TI-EXHIBIT 1013
`
`
`
`Efficient (Stack) Algorithms
`
`for Write-Back and Sector Memories
`
`l
`
`81
`
`(FIFO),
`the Least Recently Used (LRU) policy, First-In First-Out
`include
`policies
`Least Frequently Used (LFU), and Random
`(RAND). An optimal policy, MIN,
`exists, but is unrealizable
`in practice because it requires knowledge of the future
`[22]. The MIN policy does not consider writes or deletes and is known
`to be
`nonoptimal
`if writes are considered
`[38].
`to
`is presented
`Write Policy. The write policy determines when a modification
`secondary storage. Writes may always go directly
`to secondary storage using the
`write-through
`or store-through policy. Alternatively,
`the write may go to the cache
`to be written at some later time, usually when the block is about to be replaced,
`using the write-back or copy-back policy. Write-back
`is motivated by the expec-
`tation
`that
`the block will be modified several times before it has to be written.
`Clearly, write-back can never cause more accesses than write-through
`and usually
`far fewer. On the other hand, since it deals in blocks rather
`than words, write-
`back may increase
`the number of bytes written.
`In addition, dirty blocks may
`remain
`in the cache for a long time,
`leading
`to reliability
`issues in large volatile
`caches such as file system caches in main memory. The decrease in memory
`traffic
`from write-back makes it very valuable
`in systems with
`limited memory
`bandwidth
`such as shared-bus multiprocessor
`systems. Write-back
`is also desir-
`able in file system caches because many files are temporary and may neuer have
`to be written.
`cache,
`in a write-through
`Write Allocate. When a written block is not present
`the block may be inserted
`in the cache (write allocate) or the cache may be
`bypassed altogether. Write allocate
`is again motivated by locality-the
`expecta-
`tion
`that
`the written block will soon be referenced again. A write-back
`cache
`always allocates a cache block to the written block.
`Write-Fetch.
`If write allocate
`is used by a cache where partial-block modifi-
`cation
`is allowed, and the block to be written
`is not in the cache (a write miss),
`then
`it is usually necessary to fetch the block prior
`to modifying
`it. This write-
`fetch is needed, for example,
`if one word of a multiword
`block
`is being written.
`The alternative
`is to keep track of the portion(s)
`of the cache block
`that are
`“valid,” which becomes costly when several disjoint portions of a large block are
`written. However,
`there are situations
`in which write-fetch
`can be avoided such
`as when the entire block
`is being overwritten,
`or when the contents of the rest
`of the block are predictable
`(e.g., when
`the block
`is a “new” block
`in a file
`system).
`that
`implies
`to a block often
`locality, a reference
`Prefetch. Because of spatial
`the next block will soon be referenced.
`It is possible
`to take advantage of this
`anticipated
`reference and to prefetch
`the next block in advance. This reduces the
`delay when the next block is actually
`referenced. Prefetch
`is advantageous when
`it can be overlapped with processing of other
`references or when
`two or more
`blocks can be fetched
`in much less time than all of them
`individually,
`as is the
`case with disk secondary storage. Although
`it reduces the delay, prefetch
`increases
`memory
`traffic
`unless all prefetched
`blocks are referenced before
`they are
`replaced. It may also result
`in memorypollution
`in which a soon-to-be-referenced
`block
`is displaced
`to make room
`to prefetch an unnecessary block
`[30].
`If a
`prefetch
`is only permitted
`in conjunction with a fetch, then the policy
`is a demand
`prefetch policy. Demand prefetch
`is desirable when
`the overhead of a fault
`is
`ACM Transactions
`on Computer Systems, Vol. 7. No. 1, February 1989.
`
`DOCKET NO.: IPR-5463750
`
`4
`
`TI-EXHIBIT 1013
`
`
`
`82
`
`l
`
`J. G. Thompson and A. J. Smith
`
`this over two (or more) blocks. With modern
`large; demand prefetch amortizes
`memory systems and file system caches, it is simple and inexpensive
`to initiate
`a prefetch even if the referenced block is present.
`
`1.2 Metrics
`The performance of a memory system can be measured in several ways. Perhaps
`the most widely used is the miss ratio, which
`is the fraction of references
`that
`were not satisfied by the cache. Conversely,
`the hit ratio is the fraction
`that were
`satisfied by the cache. The miss ratio
`is a latency metric since it determines
`the
`apparent access time of the memory system. The effective access time
`for any
`multilevel
`hierarchy
`is given by 2 tihi, where ti is the access time to the ith
`level,
`and hi is the fraction of references satisfied by the ith
`level cache. Sometimes
`overlooked
`is the
`fact
`that
`the access time
`to each level should
`include any
`queuing delays. These are usually negligible
`in a single-processor
`system, but
`may become important when several processors compete
`for access to a single
`secondary store [ 11.
`the
`varies with
`simulation
`of miss ratios during
`The actual computation
`parameters of the cache. Let N be the total number of references, and m(C) be
`the number of misses to a cache of size C. If all references are assumed to be
`reads, then the miss ratio
`for a cache of size C is given by
`
`MRn(C) = m(C)/N,
`
`(1.1)
`
`hence the name.
`where every write
`With write-through,
`secondary storage), the miss ratio
`is
`
`is a “miss”
`
`(i.e., causes an access to
`
`+ WIN
`= h(C)
`MRwT(C)
`(1.2)
`is the number of reads that “miss,” and W is the number of write
`
`is used, a write could
`the block and another
`
`result
`later
`
`in two accesses to secondary
`to write
`it. The miss ratio
`is
`
`where m,(C)
`references.
`When write-back
`storage, one to fetch
`now given by
`
`MRwB(C)
`
`=
`
`h(C)
`
`+ m,(C)
`
`+
`
`&dC))/N,
`
`fetches), and dp(C)
`is the number of write misses (i.e., write
`where m,(C)
`number of dirty blocks “pushed”
`from a cache of size C.
`This becomes
`
`(1.3)
`is the
`
`= (m(C) + dp(C))lN
`(1.4)
`is actually
`just a read reference and occurs if
`
`M&B(C)
`by using the fact that a write-fetch
`the block reference “misses.”
`for the write
`the processor must wait
`All of the expressions so far assume that
`to secondary storage
`to complete before continuing.
`It
`is often
`reasonable
`to
`buffer
`the writes so that
`the processor can continue almost
`immediately.
`In this
`case delay occurs only
`if there are enough accesses to create contention.
`In [31]
`it
`is observed
`that when memory bandwidth
`is adequate,
`four store-through
`
`ACM Transactions
`
`on Computer Systems, Vol. 7, No.
`
`I, February 1989.
`
`DOCKET NO.: IPR-5463750
`
`5
`
`TI-EXHIBIT 1013
`
`
`
`Efficient (Stack) Algorithms
`
`for Write-Back and Sector Memories
`
`*
`
`83
`
`buffers are sufficient to largely eliminate queuing for writes. Under this assump-
`tion, the write-back miss ratio with write-fetch
`is again simply
`
`MRwF(C) = m(C)/iV.
`
`(1.5)
`
`A related metric is the traffic ratio, which is the ratio of traffic between cache
`and secondary storage, measured in bytes, compared to the traffic that would be
`present without a cache [14]. The traffic ratio is increasingly
`important
`for
`analyzing shared-bus systems such as multiprocessor architectures or a network
`file system. Although buffering may eliminate write-back from consideration in
`the miss ratio, the write traffic is not eliminated, so writes must be considered
`in the traffic ratio. Also, prefetch may result in increased traffic since some
`prefetched blocks may not be actually referenced.
`The traffic ratio is dependent on the same factors as the miss ratio and, in
`addition, depends on the size of the data blocks transferred. Suppose that the
`processor accesses BP bytes per average memory reference. The traffic without a
`cache is then B, times the number of references. Frequently,
`the cache block
`size, B, is larger than B,. We assume that each cache miss causes B, bytes to be
`transferred. Then a large cache block size may act as a form of prefetch and
`reduce the miss ratio, but it may also increase the amount of traffic.
`The general form of the traffic ratio computation is
`
`TR(C,Bc) = h(C)
`
`+ m,(C) +f(C) + dp(C)]*B,/N*B,,
`
`(1.6)
`
`is again the number of write misses; dp(C) is again the number of
`where m,,(C)
`write-backs; and f(C)
`is the number of prefetched blocks. This expression
`assumes that write-fetch
`is used. Notice that the traffic ratio is identical to the
`miss ratio when there is no prefetching, no write buffering, and the cache block
`size is the same as BP.
`ratio, which is the ratio of secondary storage
`A third metric is the transfer
`accesses with and without cache [30]. This metric has also been called the
`[G. Gibson, personal communication 19861, the I/O ratio
`[18],
`transaction
`ratio
`and the swapping
`[19]. The transfer ratio is similar to the traffic ratio but
`ratio
`is more appropriate when performance is dominated by the cost of a memory
`access, relatively
`independent of the number of bytes transferred. Thus it is
`appropriate
`for disk caches and often for networks using small (1K or less)
`messages. For example, the transfer ratio decreases if two blocks are read from
`disk in a single I/O, whereas the traffic ratio is the same regardless of the number
`of I/OS used to transfer the data. The transfer ratio also has an indirect effect
`on the access time if there are enough transfers to create contention, particularly
`in multiple processor systems with shared memory. Assuming that prefetches
`occur only when the referenced block is not in cache (demand prefetch), then
`they do not affect the transfer ratio. A general expression for the transfer
`ratio is
`
`WC) = [m,(C) + m,(C) + @(C)I/N
`
`(1.7)
`
`which is almost proportional
`
`to the traffic ratio using constant block sizes.
`ACM Transactions
`on Computer Systems, Vol. 7, No. 1, February 1989.
`
`DOCKET NO.: IPR-5463750
`
`6
`
`TI-EXHIBIT 1013
`
`
`
`84
`
`-
`
`J. G. Thompson and A. J. Smith
`
`1.3 Trace-Driven Simulation
`simula-
`is to use trace-driven
`these metrics
`One common method
`for calculating
`from a system believed
`to be similar
`to the
`tion. Memory
`references are gathered
`system being modeled. These references are then used to drive a simulation
`of
`the system under study with varying design parameters. To the extent
`that
`the
`traces apply
`to the modeled system, simulation
`is a relatively
`simple way to
`observe the effect of changes to the memory hierarchy. Unfortunately,
`it could
`take a large number of simulations
`if only a single combination
`of memory sizes
`could be simulated at a time.
`replacement policies
`for certain
`In a classic paper, Mattson et al. showed that
`the miss ratios
`for all cache sizes could be calculated
`in a single pass over the
`reference
`trace
`[22]. These policies are collectively
`known as stuck algorithms.
`The technique depends on the inclusion property of these policies;
`the contents
`of any size cache includes
`(i.e., is a superset of)
`the contents of any smaller
`cache. Thus
`the cache at any time can be represented as a stack, with
`the upper
`k elements of the stack representing
`the blocks present
`in a cache of size k. The
`current stack level of any block
`is therefore
`the minimum
`cache size for which
`the block
`is resident.
`If a block
`is referenced while at level k, it is a “hit,” and
`therefore
`resident,
`for all sizes k and larger. The level at which
`the block is found
`is referred
`to as its stuck distance; see Figure 1. Using stack analysis,
`it is possible
`to compute
`the miss ratio of Equation
`(1.1) for all sizes by recording
`the hits to
`each level. The miss ratio
`for a cache size C is
`
`i hits(i)
`i=l
`
`N,
`
`(1.8)
`
`MRR(C) =
`
`N -
`)/
`(
`is
`that, since hits(i)
`the total number of references. Notice
`where N is again
`never negative,
`this is a nonincreasing
`function of cache size. All stack algorithms
`possess this characteristic,
`whereas nonstack algorithms may show points at
`which performance declines with
`increased cache size [4].
`The simplest example of a stack algorithm
`is the Least Recently Used (LRU)
`policy. The stack always contains
`the blocks
`in order of last reference, with
`the
`most recently
`referenced block on the top. For any cache size C, the LRU block
`for that cache size is the block at level C in the stack. When a block at level k is
`referenced,
`it is not in any cache smaller
`than k, and therefore
`it must be fetched.
`The block that must be removed from any cache of size j, j smaller than k, is the
`block at level j. The stack is updated by simply
`“pulling”
`the referenced block
`out of the stack and placing
`it on top. All blocks down to level k are effectively
`“pushed” down one level. Since the referenced block was in all caches k or larger,
`all blocks below level k remain unchanged. Figure 1 illustrates
`these operations
`for the case where the referenced block
`is at stack level 4 and the case in which
`the block is not currently
`in the stack.
`More generally, Mattson et al. [22] showed that any stack algorithm possesses
`a “priority
`function” which
`imposes a total ordering on all blocks at any given
`time, independent of cache size. Notice
`that LRU
`imposes such an ordering based
`on the time of last reference. However,
`in the more general case the relative
`priority of two blocks may change without either of them being referenced.
`(See,
`for example, Figure 4 where
`the relative positions of blocks A and B reverse
`ACM Transactions
`on Computer Systems, Vol. 7. No. 1, February 1989.
`
`DOCKET NO.: IPR-5463750
`
`7
`
`TI-EXHIBIT 1013
`
`
`
`Efficient (Stack) Algorithms
`
`for Write-Back and Sector Memories
`
`*
`
`85
`
`x,
`
`in the stack at level 4
`
`xr not in the stack
`
`replacement.
`using LRU
`Fig. 1. Examples of stack maintenance
`to the top of the cache
`The referenced block
`is always
`“pulled”
`stack. All blocks with smaller stack distance are pushed down one
`level.
`
`xl
`
`in the stack at level 4
`
`I, not in the stack
`
`Fig. 2. Examples of stack maintenance using a stack replacement
`algorithm. For each level C a single comparison
`(indicated by a
`circled cross) between
`the prior block at the level (sIW1 (C)) and the
`block pushed from above (y, (C - 1)) determines
`the new block at
`the level and the block pushed
`from
`the level. Update continues
`down to the current
`level of the referenced block or to the bottom
`of the stack if X, is not in the stack.
`
`the block at level j is
`the case that
`is no longer
`times 5 and 6.) It
`between
`necessarily
`the one to be pushed from that size cache. This complicates
`the stack
`update procedure, but only slightly. The stack can still be updated
`in a single
`pass that
`is similar
`to one pass of a bubble sort. A single comparison at each
`level determines
`the new block at the level and the block pushed from the level.
`First,
`the referenced block
`is still pulled
`to the top of the stack since it must
`become resident
`in all cache sizes. Using
`the terminology
`from
`[22], let yI (C) be
`the block pushed
`(“yanked?“)
`from a cache of size C. To make room for the
`referenced block, the top block in the stack, s,-] (l), must be pushed from a one-
`block cache, becoming y,(l). Some block must also be pushed from a two-block
`cache-the
`one with
`the lowest priority. A single comparison between y, (1) =
`stPl (1) and slml (2) determines which becomes yt (2). (Ties are broken by some
`arbitrary
`rule.) Similarly,
`the block pushed
`from a three-block
`cache should be
`ACM Transactions
`on Computer Systems, Vol. 7, No. 1, February 1989.
`
`DOCKET NO.: IPR-5463750
`
`8
`
`TI-EXHIBIT 1013
`
`
`
`86
`
`l
`
`J. G. Thompson and A. J. Smith
`
`Algorithm
`
`1. General Stack Analysis Algorithm
`
`1. FORlst<NDO
`2.
`IF z, 4 S,-, THEN A = 03
`3.
`ELSE DO
`4.
`FIND A SUCH THAT
`+ 1
`5.
`h(A)
`=&(A)
`6.
`IFA#l
`7.
`Y,(l) = St-1 (1)
`FOR2=i=ADO
`Y,(i) = pmin[y,(i
`
`- 11, s,-,(i)1
`
`st-, (A) = xr
`
`FORirADOy,(i)=0
`SC (1) = x1
`FORi>lDOs,(i)=s,-,(i)+y,(i-1)-y,(i)
`
`8.
`
`Notes:
`
`For all events.
`Zf not referenced before.
`
`the stack distance.
`Find
`Update
`the read hits.
`lf stack needs updating.
`Calculate push from each level
`down to the referenced block.
`Push is the minimum of the
`block at level and push from above.
`No pushes below reference.
`Establish new stack.
`
`(1)
`
`(2)
`
`(3)
`
`are assumed to remain unchanged at the next time
`
`that are not incremented
`In step 5, all counts
`t is not used.
`interval;
`thus the subscript
`as defined by the replacement algorithm.
`In step 7,pmin
`returns
`the block with
`the lowest priority,
`Pmin
`is the comparison
`function
`in the circles of Figure 2.
`to a set or removing a
`In step 8, plus and minus have the intuitive meaning of adding a member
`member.
`In this context, adding a member
`that
`is already present, or subtracting
`a member
`that
`is not present, have no effect. The same is true of adding or subtracting
`4. Thus,
`the block kept
`at level i, st (i), is either sl-, (i) or the block pushed from above, yr (i -
`l), whichever
`is not pushed
`from
`level
`i.
`
`Figure 3
`
`the lowest
`of the three blocks previously present. However,
`the lowest priority
`priority block can be determined
`in a single comparison of yt (2) and stP1 (3) since
`the third block, now .st (2), has already
`“won” a comparison against yt (2), and
`thus cannot have the lowest priority. Similar
`logic applies
`for all levels down to
`level lz, the original
`level of the referenced block; only
`the block currently at the
`level and the one pushed
`from above need be compared
`to find
`the block to be
`pushed. The contents of all sizes larger than k are again unchanged.
`The stack analysis algorithm
`is formally presented
`in Figure 3. This algorithm
`is used as the basis for the extensions
`in Section 2. Let:
`
`*** XN be a trace, where 3~~ is the reference at time t.
`x=xl,x*
`S, = the cache stack just after reference
`to xt, with st (C) = the block at
`stack level C. so(C) = 4 for all C.
`A = the stack distance of xt, that
`is, .s-~ (A) = xt.
`yt (C) = the block pushed (“yanked”)
`from cache of size C by reference xt.
`rh(C) = a count of the number of hits to level C by time t.
`
`it is possible to search the stack for the referenced block
`Note that, in practice,
`and update
`the stack simultaneously,
`since the priority
`function
`cannot depend
`on where (or even if) the referenced block is in the stack. The update stops when
`the referenced block
`is found. The block being pushed
`takes the place of the
`referenced block, which
`is inserted on top of the stack.
`ACM Transactions
`on Computer Systems, Vol. 7, No. 1, February 1989.
`
`DOCKET NO.: IPR-5463750
`
`9
`
`TI-EXHIBIT 1013
`
`
`
`Efficient (Stack) Algorithms
`
`for Write-Back and Sector Memories
`
`87
`
`Time
`Reference String
`
`2
`1
`A
`.+I
`---------
`
`3
`PI
`
`Al
`
`A2
`
`A3
`
`Cache
`Stack
`
`4
`B
`
`Bl
`.43
`
`5
`R
`
`B2
`.43
`
`0
`(’
`
`c’l
`A3
`B2
`
`7
`C
`
`c‘2
`A3
`B2
`
`8
`D
`
`Dl
`A3
`BL’
`c2
`
`0
`B
`
`B3
`-43
`Dl
`c?
`
`Fig. 4. Cache contents using the Least Frequently Used policy. The
`number beside the block is the priority, that is, the number of references.
`
`to
`of a Least Frequently Used policy
`the application
`As an example, consider
`the reference string
`{AAABBCCDB).
`Using
`this policy,
`the block pushed
`from
`any cache is the one that has been used the fewest total
`times. Figure 4 shows
`the contents of the stack after each reference, where
`the number beside each
`block
`is the priority
`(i.e., the number of uses of the block). Notice
`that a block
`may be pushed several levels because of a reference, as seen at time 8. Note too
`that blocks below the level where the referenced block
`is found are unchanged,
`even though
`they may have higher priorities, as seen after the last reference.
`
`1.4 Nonstack Algorithms
`that depends on cache size prevents
`function
`The prohibition
`against a priority
`some otherwise simple policies
`from being stack algorithms
`such as the First-In
`First-Out
`(FIFO)
`rule
`[22]. Another
`common
`technique
`that
`is not a stack
`algorithm
`is the use of demand prefetch or prefetch-on-miss
`[30]. Suppose that
`the prefetch policy
`is to fetch the following block along with any fetched block,
`but not to prefetch
`if the referenced block is already present. Assume an arbitrary
`stack algorithm
`for replacement.
`It
`is easy to construct
`counterexamples
`that
`violate
`inclusion because the priority
`of a prefetched block depends on when
`it
`is fetched, which varies with cache size. For example, consider
`the examples of
`Figure 5, where
`the contents of a larger cache are clearly not a superset of a
`smaller cache after the final
`reference.
`For
`that are stack algorithms.
`It
`is possible
`to construct prefetch policies
`example, the nondemand policy
`that always prefetches
`the next block, regardless
`of whether
`the referenced block is resident,
`is a stack algorithm. This policy
`is a
`form of One Block Lookahead or OBL
`[2]. From the point of view of the stack,
`this
`is equivalent
`to the insertion
`of a reference
`to the next block after each
`reference. Nondemand prefetch
`is not practical
`if the cost of a fault
`is high, as it
`is in virtual memory systems,
`for example, because the penalty
`for faulting
`to
`prefetch a block
`that may not be needed
`is greater
`than
`the potential
`gain.
`Nondemand prefetch
`is practical when it is possible to look for the next block in
`the cache and prefetch
`it if necessary without
`significantly
`slowing down pro-
`cessing the current
`reference. This
`is the case for many