`ARCHITECTURE
`A
`QUANTITATIVE
`APPROACH
`
`
`
`JOHN L HENNESSY
`&
`DAVID A PATTERSON
`
`
`
`Qualcomm, Ex. 1018, Page 1
`
`
`
`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Head
`Cover Design David Lance Goines
`Copy Editor Linda Medoff
`Proofreader Paul Medoff
`Computer Typesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging-in-Publication Data
`Patterson, David A.
`Computer architecture : a quantitative approach / David A.
`Patterson, John L. Hennessy
`.
`em,
`
`Includes bibliographical references
`ISBN 1-55880- 069-8
`1. Computerarchitecture. 2. Electronic digital computers
`--Design and construction.
`I. Hennessy, John L. II. Title.
`QA76.9.A73P377 1990
`004.2'2--de20
`
`Morgan Kaufmann Publishers. Inc.
`Editorial Office: 2929 Campus Drive. San Mateo, CA 94403
`Order from: P.O. Box 50490, Palo Alta, CA 94303-9953
`
`©1990 by Morgan Kaufmann Publishers, Inc.
`All rights reserved,
`
`89-85227
`CIP
`
`Nopart of this publication may be reproduced, stored in a retrieval system, or transmitted
`in any form or by any means—electronic, mechanical, recording, or otherwise—without
`the prior permission of the publisher.
`
`All instruction sets and other design information of the DLX computer system contained
`herein is copyrighted by the publisher and may not be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`ADVICE, PRAISE, & ERRORS: Any correspondencerelated to this publication or
`intended for the authors should be addressed to the editorial offices of Morgan Kaufmann
`Publishers, Inc., Dept. P&H APE. Information regarding error sightings 1s encouraged.
`Any error sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction. Electronic mail
`can be sent to bugs@vsop.stanford.edu.
`INSTRUCTOR SUPPORT: For information on classroom software and other instructor
`materials available to adopters, please contact the editorial offices of Morgan Kaufmann
`Publishers, Inc.
`
`
`
`
`
`Qualcomm, Ex. 1018, Page 2
`
`Qualcomm, Ex. 1018, Page 2
`
`
`
`408
`
`8.3 Caches
`
`
`
`
`
`
`8.3
`
`Caches
`
`Cache: a safe place for hiding or storing things.
`
`
`
`Webster's New World Dictionaryof the American La
`Second Callege Edition ~
`
`
`Cache is the namefirst chosen to represent the level of the memory hies
`between the CPU and main memory, and that is the dominant use of the
`While the concept of cuches is younger than the [BM 360 architecture. oo
`appear todayin every class of computer and in some computers more than «
`
`In fact, the word has become so popular that it has replaced “buffer” in
`computer-science circles.
`coo
`The general
`terms defined in the prior section can be used for
`although the word fine is often used instead of block. Figure 8.5 show)
`typical range of memory-hierarchy parameters for caches.
`
`
`
`
`
`
`|Block (ine) size 4— 128 bytes
`
`| Hit eine
`| -—4 clock cycles (normally 1) il
`
`
`|Miss penalty 8 = 32 clock cycles )
`
`(Access time)
`(6— 10 clock cycles)
`
`
`(Transfer time)
`(2 —22 clock cycles)
`
`
`
`Miss rate
`1% -— 20%
`— oe ————
`| Cache size
`| KB - 256 KB
`
`
`
`
`
`|
`
`i
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters for caches in
`workstations and minicomputers.
`
`Nowlet’s examine caches in more detail by answering the four mete
`hierarchy questions.
`
`Q1: Where Can a Block Be Placed in a Cache?
`
`Restrictions on where a block is placed create three categories of
`organization:
`
`ca-™
`
`a
`
`=
`
`Ifeach block has only one place it can appear in the cache, the cache is =
`to be direct mapped. The mapping is usually (block-frame address) moc
`
`(number of blocks in cache).
`
`Ifa block can be placed anywhere in the cache, the cache is said to he
`associauve.
`
`
`
`Qualcomm, Ex. 1018, Page 3
`
`Qualcomm, Ex. 1018, Page 3
`
`
`
` Mamorv-Hierarchy Design
`
`Sei aseociative: biook 12 can po
`no.
`
`black 12 mat go
`anywhere it ser 0
`(12 mod)
`Or1234567
`
`| ll
`
`Set Set Set Sat
`Oo
`+
`2.93
`
`Block-imnme aakiress
`
`Block
`
`FIGURE 8.6 The cache has & blocks, while memory has 32 blocks. The set-
`associalive organization has 4 sets with 2 blocks per sel, called two-way sel associative.
`(Real caches contain hundreds of blocks and real memories contain hundreds of thousands
`of blocks.) Assumethat [here is nothing in the cache and that the block-frame addressIn
`question identities lower-level black 12. The three options for caches are shownieft to right.
`In fully associative, black 12 from the lower level can go into any of the 8 blocks o! the
`cache. With direct mapped, block 12 can only be placed into block 4 (12 modulo 4). Set
`associative, which has someof both features, allows the block to be placed anywhere in set
`0 (12 modulo 4). With two blocks per sei. this means block 12 can be placedeither in block
`0 or block 1 of the cache.
`
`Qualcomm, Ex. 1018, Page 4
`
`409
`
`a
`
`Ifa block can be placed in a restricted set of places in the cache, the cacheis
`said to be set associative. A set is a group of two or more blocks in the cuche.
`A blockis first mapped onto a set, and then the block can be placed anywhere
`within the set. The set is usually chosen bybit selection; that is, (block-frame
`address) modulo (number of sets in cacbe). [f there are n blocks in aset. the
`cache placement is called #-wayset associanve.
`
`The range of caches from direct mapped to fully associative is really a
`continuum of levels of set associativity: Direct mapped is simply one-way set
`associative and a fully associative eache with m blocks could be called m7-way
`set associative. Figure §.6 shows where block 12 can be placed in a cache
`according to the block-placement policy.
`
`Fully associative
`
`anywhere
`
`Bock OG1234567
`
`Block.
`
`
`
`Qualcomm, Ex. 1018, Page 4
`
`
`
`410
`
`8.3 Caches
`
`Q2: How Is a Block Found If It Is in the Cache?
`
`
`
`
`
`
`Caches include an address tag on each block that gives the block-frarme
`The tag of every cache block that might contain the desired infty
`
`checked to see if it matches the block-frame address from the CPL. Fo
`
`gives an example. Because speedis of the essence, all possible tags are
`in parallel; serial search would make set associativity counterproducti\ =
`
`
`
`
`Fully assocsativn
`
`Direct mapped
`
`OF FO4567 ~ a1234567
`
`no.
`Block
`
`Data
`
`Tag
`
`TT
`
`sewer TUETTTTT
`
`Sel associative
`
`megs 01234567
`
`en - Saf Set
`
`
`
`
`
`FIGURE 8.7 in fully associative placement, the block for block-frame add)
`appearin any of the 8 blocks; thus, all 8 tags must be searched. The desra
`
`found in cache block 6 in thig example. In direct-mapped placementthere is ony
`
`black where memory block 12 can be found. In set-associative placement, with =
`=
`memory block 12 must be in set 0 (12 mod 4); thus. the tags of cache blocks 0
`
`checked, In this case the data is found in cache biock 1. Speed of cache access =
`that searching must be performed in parallel for fully associative and set-asson©
`
`mappings.
`
`
`There must be a way to knowthat a cache block does not o>
`information. The most common procedure is to add a valid bit to the
`
`whether or not this entry contains a valid address. If the bit is nov
`
`cannot be a match on this address.
`
`
`A common omission in finding the cost of caches is to forget the
`tag memory. One tag is required for each block. An advantage of
`
`block sizes is that the tag overhead per cache entry becomes a smaller ~
`
`the total cost of the cache.
`
`
`Before proceeding to the next question, let’s explore the relatioe
`
`CPU address to the cache. Figure 8.8 shows howan address is divide’
`
`fields to find data in a set-associative cache: the bluck-offset field use”
`the desired data from the block, the index field used 10 select the set.
`field used for the comparison. While the comparison could be maile =
`the address than the tag, there ts no need:
`
` Qualcomm, Ex. 1018, Page 5
`
`Qualcomm, Ex. 1018, Page 5
`
`
`
`
`
`
`ick-frame
`td informe
`: CPU.
`Figs
`Tags are ae
`Bduective.
`
`Memory-Hierarchy Design
`
`
`
`4ii
`
`s Checking the index would be redundant, since it was used to select the set to
`be checked (an address stored in set 0. for example, must have 0 in the index
`field or it couldn’t be stored in set 0).
`
`« Theoffset is unnecessary in the comparison because all block offsets match
`and the entire block is present or nol.
`
`If the total sive is kept the same, increasing associativity increases the number of
`blocks per set, thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 moves to the right with
`increasing associativity.
`
`
`
`
`FIGURE 8.8 The 3 portions of an addressin a set-associative or direct-mapped cache.
`The tag is used to checkall the blocks in the set and the index is used to selact ihe set. The
`block offset is the address of the desired data within the block.
`
`
`
`
`a Least-recently used (.RU)}—To reduce the chance of throwing out informa-
`tion that will be needed soon, accesses to blocks are recorded. The block
`replaced is the one that hus been unusedforthe longest time. This makes use
`of a corollary of temporal locality: If recently used blocks are likely to be
`used again, then the best candidate for disposal
`is the least recently used.
`Figure 8.9 (page 412) shows which block is the least-recently used for a
`sequence of block-frame addresses in a fully associative memory hierarchy.
`
`Q3: Which Block Should He Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn’t,
`then it would be easy fo select which block to replace. Alas, the high hit rate of
`caches means that the overwhelming decision is between blocks that have valid
`data.
`
`is that hardware decisions are
`A benefit of direct-mapped placement
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set-
`associative placement, there are several blocks to choose from on a miss. There
`are two primary strategies employed for selecting which block to replace:
`
`a Random—Tospread allocation uniformly, candidate blocks are randomly
`selected. Some systems use a scheme for spreading data across a set of blocks
`in a pseudorandomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`Qualcomm, Ex. 1018, Page 6
`
`Qualcomm, Ex. 1018, Page 6
`
`
`
`
`
`A virtue of randomis that it is simple to build in hardware. As the nun
`
`blocks to keep track of increases, LRU becomes increasingly expensive
`frequently only approximated. Figure 8.10 shows the difference in mix: ~
`between LRU and random replacement. Replacementpolicy plays a greats
`
`in smaller caches than in larger caches where there are more choices of
`»
`
`replace.
`
`
`
`Block-frame addresses
`3
`| 2
` Taetic tates 313
`
`
`
`
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-frame addre:
`a fully associative memory hierarchy. This assumes that there are 4 blocks and ©
`
`the beginning the LRU block is number G. The LRU block numberis shown belaw eae
`block reference. Anotherpolicy, First-in-first-out (FIFO), simply discards the block
`
`used N unique accesses before, independent ofits reference pattern in the last N - *
`references. Random replacement generally outperforms FIFO anditis easier to ime
`
`
`4-way
`2-way
`Associativity:
`8-way
`
`Random
`LRU
`Random
`LRU
`| Size
`LRU
`Rantios
`4.67% 5.29%
`
` 5.18% 3.69%
`16 KB
`4.39% 4.96%
`
`|.539%%
`| 64KB
`1.88%
`2.01%
`154% 1.66%
`1.39%
`
`
`oo
`
`
`
`
`
`4i2
`
`8.3 Caches
`
`
`
`|
`
`]3]1
`
`
`
`
`115% L.L7% 1.2% LA2ee 1.13% 13%
`
`
`
`
`
`
`
`
`
`
`FIGURE 8.10 Miss rates comparing least-recently used versus random replac:
`for severai sizes and associativities. This data was collected for a black size of 16 =
`using one of the VAX traces containing user and operating system code (SAVED), Tos
`trace is included in the software supplementfor course use.. Thereislittle difference
`between LRU and random for larger size cachesin this trace.
`
`Q4: What Happens on a Write?
`
`Reads dominate cache accesses. All instruction accesses are reads, and =»
`instructions don’t write to memory. Figure 4.34 (page 181) suggesrs a mix of
`stores and 17% loads for four DLX programs, making writes less than | ~
`the memorytraffic. Making the commoncase fast means optimizing caches ~
`reads, but Amdahl’s Lawreminds us that high-performance designs cas=
`neglect the speed of writes,
`Fortunately, the common case is also the easy case to make fast. The be
`can be read at the same timethat the tag is read and compared. so the block »
`begins as soon as the block-frame address is available. If the read is a hit
`block is passed on to the CPU immediately. If it is a miss, there is no benef
`but also no harm.
`
`
`Qualcomm, Ex. 1018, Page 7
`
`Qualcomm, Ex. 1018, Page 7
`
`
`
`Such is not the case for writes, The processor specifies the size of the write,
`usually between | and 8 bytes; only that portion of a block can be changed. In
`general this means a read-modify-write sequence of operations on the block:
`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block cannot begin until the tag is checked lo seeif it is
`a hit. Because tag checking cannot occur in parallel, then, writes normally take
`longer than reads.
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`a Write through (or store through)—The information is written to both the
`block in the cache and to the block in the lower-level memory.
`
`a Write back (also called copy back or store in}—The information is written
`only to the block in the cache. The moditied cache block is written to main
`memory only whenit is replaced.
`
` Memory-Hierarchy Design
`
`
`Write-back cache blocks are called clean or dirty, depending on whether the
`information in the cache differs fromthat in lower-leve! memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirtybit is
`commonlyused. This status bit indicates whether or not the block was modified
`while in the cache. If it wasn’t. the block is not written, since the lowerlevel has
`the same information as the cache.
`Both write back and wrile through have their advantages. With write back,
`writes occur at the speed of the cache memory, and multiple writes within a
`block require only one write to the lower-level memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through, read misses don’t result
`in writes to the lower level, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the most
`current copy of the data. This is important in multiprocessors and For I/O, which
`we shall examine in Section 8.8. Hence, multiprocessors want write back to
`reduce the memorytraffic per processor and write through to keep the cache and
`memory consistent.
`When the CPU must wait for writes to complete during write throughs, the
`CPU is said to write stall. A common optimization to reduce write stalls is a
`wrué buffer, which allows the processor to continue while the memory is
`updated. As we shall see in Section 8.8, write stalls can occur even with write
`buffers.
`
`There are two options on a write miss:
`
`a Write allocate (also called fetch on write)—The block is loaded, followed by
`the write-hit actions above. This is similar to a read miss.
`
`= No write allocate (also called write around)—The block is modified in the
`lower level and not loaded into the cache,
`
`Qualcomm, Ex. 1018, Page 8
`
`Qualcomm, Ex. 1018, Page 8
`
`
`
`
`
`
`
`414
`
`8.3 Caches
`
`While either write-miss policy could be used with write through or write back.
`generally write-back caches use write allocate (hoping that subsequent writes to
`that block will be captured by the cache) and write-through caches often use nu
`write allocate (since subsequent writes to that block will still have to go to
`memory).
`
`An Example Cache: The VAX-11/780 Cache
`
`To give substance to these ideas, Figure 8.11 shows the organization of the
`cache on the VAX-11/780. The cache contains 8192 bytes of data in 8-byte
`blocks with two-way—set-associative placement, random replacement, write
`through with a one-word write buffer, and no write allocate on a write miss.
`Let’s trace a cache hit through the steps of a hit as laheled in Figure 8.11
`(The five steps are shown as circled numbers.) The address coming into the
`cache is divided into two fields: the 29-bit block-frame address and 3-bit hloch
`offset. The block-frame address is further divided into an address tag and cache
`index. Step | shows this division.
`The cache index selects the set to be tested to see if the block is in the cache
`(A set
`is one block from each bank in Figure 8.11.) The size of the inde»
`depends on cache size, block size, and set associativity. In this case, a 9-h>
`index results:
`
`Blocks _
`=
`Bank
`
`_ &192
`Cache size
`a ee ee SP a 2?
`8 * 2
`Block size * Set associativity
`
`In a two-way-set-associative cache, the index is sent to both banks. This j=
`step 2.
`After reading an address tag from each bank, the tag portion of the block-
`frame address is compared to the tags. This is step 3 in the figure. To be sure the
`tag contains valid information, the valid bit must be set, or the results of the
`comparison are ignored.
`Assuming one of the tags does match, a 2:] multiplexer (step 4) is set I»
`select the block from the matching set, Why can’t both tags match? It is the jo
`of the replacement algorithm to make sure that an address appears in only op
`block. To reduce the hit time, the data is read at the same time as the addrew
`tags; thus, by the time the block multiplexeris ready, the data is also ready.
`This step is needed in sct-associative caches, but
`it can be omitted from
`direct-mapped caches since there is no selection to be made. The multiplexe:
`used in this step can be onthe critical
`timing path, endangering the clock cyc©
`time of the CPU. (The example on pages 418-419 and the fallacy on page 45_
`explore the trade-off of lower miss rates and higher clock cycle time.)
`In the final step the word is sent to the CPU. All five steps occur within —
`|
`single CPU clock cycle.
`What happens on a miss? The cache sends a stall signal to the CPU telling
`to wait, and two words (eight bytes) are read from memory. That takes 6 cloc)
`cycles on the VAX-11/780 (ignoring bus interference). When the data arrive
`
`
`
`Qualcomm, Ex. 1018, Page 9
`
`Qualcomm, Ex. 1018, Page 9
`
`
`
`
`
`
`
`
` He
`
`[:—
`
`rT write back,
`‘nl writes to
`Often use no
`ve tO £0 to
`
`tion ofthe
`tin 8-byte
`lent, write
`miss.
`gure 8.1}.
`2 into the
`bit block
`and cache
`
`he cache.
`he index
`sa 9-bit
`
`
`
`iW
`
`rite
`outer
`
`
`5
`Marary
`
`FIGURE 8.11 The organization of the VAX-11/780 cache. The 8-KB cache is two-way
`sei associative with 8-byte blocks. lt has 512 sets with two blocks perset; the setis
`selected by the 9-bit index. The five steps of a read hit, shown as circled numbersin order
`of occurrence, label this organization. The line from memory to the cache is used on a miss
`to load the cache. Such multiplexing is not needed in a direct-mapped cache. Note that the
`offsel is connected to chip select of the data SRAMs io altow the proper words to be sen! ip
`the 2:1 multiplexer.
`
`Qualcomm, Ex. 1018, Page 10
`
`Memory-Hierarchy Design
`
`415
`
`the cacbe must pick a block to replace; the VAX-11/780 selects one of the two
`sets at random. Replacing a block means updating the data, the address tag, and
`the valid bit. Once this is done, the cache goes through a regular hit cycle and
`returns the data to the CPU.
`Writes are more complicated in the VAX-11/780, as they are in any cache. If
`the word to be writlen is in the cache, the first four steps are the same. The next
`step is to write the data in the block, then write the changed-data portion into the
`
`
`
`
`| CPU
`
`
`address
`Data Data
`
`in out
`
`
`9|2, 2c 6a
`
`Block-irame Block
`address
`=ofiat
`<20>
`<>" <b
`ta
`index
`
`thooks)
`
`Bank 0
`(512
`
`Bank #
`(512
`blocks)
`
`Qualcomm, Ex. 1018, Page 10
`
`
`
`cache. The VAX-11/780 uses no write allocate. Consequently, on a wrile mo
`the CPU writes “around” the cache to lower-level memory and does nol af's—
`the cache.
`
`Since this is a write-through cache, the process isn’t yet over. The won)
`also sent to a one-word write buffer. If the write huffer is empty, the wor!
`full address are written in the buffer, and we are finished. The CPU contin
`working while the write buffer writes the word to memory. If the buffer is (.
`the cache (and CPU) must wait until the buffer 1s empty.
`
`Cache Performance
`
`CPU time can be divided into the clock cycles the CPU spends executing ©
`program and the clock cycles the CPU spends waiting for the memory sys
`Thus,
`
`CPU time = (CPU-execution clock cycles + Memory-stall clock cycles } * Clock cycle ©
`
`+
`
`To simplify evaluation of cache alternatives, sometimes designers asso
`that all memory Stalls are due to the cache. This is (rue for many machine
`machines where this is not true, the cuche still dominates stalls that ure
`exclusively due to the cache. We use this simplifying assumption here, but ~~
`important to accountfor all memory stalls when calculating final performance |
`The formula above raises the question whether the clock cycles for 4 co-
`access should be considered part of CPU-execution clock cycles or part of
`me
`ory-stall clock cycles. While either convention is defensible, the most woe
`acceptedis to include hit clock cycles in CPU-execution clock cycles.
`Memory-stall clock cycles can then be defined in terms of the number
`memory accesses per program, miss penalty (in clock cycles), and miss rate
`reads and writes:
`
`CPU time =1C * (ee etal, Miss rate * Miss penalty)+ Clock cycle Qualcomm, Ex. 1018, Page 11
`
`:
`;
`.
`Reads
`Memory-stall clock cycles = ——_—_ * Read miss rate * Read miss penal
`Program
`’
`
`Program
`
`* Write miss rate * Write miss penalty
`
`Wesimplify the complete formula by combiningthe reads and writes togethe=
`
`Memory accessess
`* Miss rate * Miss penialn —
`
`Memory-stall clock cycles = — reeyee
`Program
`
`Factoring instruction count (IC) from cxecution time and memory
`cycles, we now get a CPU-time formula that includes memory accesses:
`instruction, miss rate, and miss penalty:
`
`~
`
`aca
`
`rn
`
`Memoryaccesses
`
`;
`
`3
`
`Wil
`
`2
`
`Qualcomm, Ex. 1018, Page 11
`
`
`
`
`
`Memory-Hierarchy Design
`
`417
`
`Some designers prefer measuring miss rate as misses per instruction rather
`than misses per memory reference:
`
`_ Memoryaccesses
`Misses
`=
`:
`Instruction
`Instruction
`
`* Miss rate
`
`is independent of the hardware
`it
`The advantage of this measure ig that
`implementation. For example,
`the VAX-11/780 instruction unit can make
`repeated references to a single byte (sce Section 8.7), which can artificially
`reduce the miss rate if measured as misses per memoryreference rather than per
`instruction executed. The drawback is that
`this measure is architecture
`dependent,
`thus it
`is most popular with archiiects working with a single
`computer family, They then use this version of the CPU-time formula:
`
`| write mise
`5 not affecs
`
`he word jx
`> Word ane
`' continue:
`fer is full.
`
`tuting the
`¥ System
`
`yele time
`
`| 2SS0me
`
`lines: on
`are oot
`
`but it is
`Rance’
`a cack
`f meom-
`
`iber of
`me for
`
`CPU time = IC * (crt
`
`Execuuion
`
`_ Misses
`
`be
`
`ire
`
`si
`
`*
`
`Instruction Eas penalty Clock cycle ime
`
`5
`
`Wecan now explore the consequences of caches on performance.
`
`Example
`
`Let's use the VAX-11/780 as a first example. The cache miss penalty is 6 clock
`cycles, and all instructions normally take 8.5 clock cycles (ignoring memory
`stalls). Assume the miss rate is 11%, and there is an average of 3,0 memory
`references per instruction. What is the impact on performance when behavior of
`the cache is included?
`
`Answer
`
`CPU time = IC # (crI , +
`
`Execution
`
`Memory-stall clock cycles
`Instruction
`
`ys Clock cycle
`
`time
`
`The performance, including cache misses, is
`CPU time.ih eyes IC * (8.5 + 3.0 # 11% * 6) * Clock cycle time
`
`= Instruction count * 10.5 * Clock cycle time
`
`The clock cycle time and instruction count are the same, with or without a
`cache, so CPU time increases with CPI from 8.5 to 10.5. Hence, the impact of
`the memoryhierarchyis to stretch the CPU time by 24%,
`
`
`Example
`
`Let’s now calculate the impact on performance when behavior of the cache is
`included on a machine with a lower CPI, Assume that the cache miss penaltyis
`10 clock cycles and, on average, instructions take 1.5 clock cycles; the miss rate
`is 11%, and there is an average of 1.4 memory references per instruction.
`
`Answer
`
`CPU time =IC * (crt ot
`
`Execution
`
`Memory-stall clock cycles
`Instruction
`
`:
`
`-
`
`) * Clock cycle time
`
`Qualcomm, Ex. 1018, Page 12
`
`Qualcomm, Ex. 1018, Page 12
`
`
`
`Making the same assumptionsas in the previous example on cacie hits, to =
`formance, including cache misses, 15
`
`CPU UMEth acne IC # (1.5 + 1,4#11%810) = Clock cycle time
`
`= Instruction count*3.0*Clock cycle time
`
`The clock cycle time and instruction count are the same, with or wi
`cache, so CPU time increases with CPI from !.5 to 3.0. Including
`behavior doubles execution time.
`
`As these examplesillustrate, cache-hehavior penalties range from sign
`lo enermous. Furthermore, cache misses have a double-barreled impact
`CPUwith a low CPI and a fast clock:
`
`L. The lower the CPI, the more pronounced the impact is.
`
`2.
`
`Independent of the CPU, main memories have similar memory-access
`since they are built from the same memory chips. When calculating CP
`cache miss penalty is measured in CPU clock cycles needed for «
`Therefore, a higher CPU clock rate leads to a larger miss penalty, oo
`main memories are the same speed,
`
`The importance of the cache for CPUs with low CPI and high clock rates ©
`greater; and, consequently, greater is the danger of neglecting cache behav —
`assessing performance of such machines.
`While minimizing average memory-access time is a rcasonable goal an”
`will use it in much of this chapter, keep in mind that the final goal is to ps"
`CPU execution ume.
`
`Example
`
`Average memory-access time = Hil time + Miss rate * Miss penalty
`
`What is the impact of two different cache organizations on the performarice
`CPU? Assumethat the CPT is normally 1.5 with a clock cycle ime of 20 ns©
`there are 1.3 memory references per instruction, and that the size of both a>
`is 64 KB. One cache is direct mapped and the other 1s two-waysel associa’
`Since the speed of the CPU ts tied directly to the speed of the caches, assum=
`CPU clock cycle time must be stretched 8.5% to accommodate the sel
`multiplexer of the set-associative cache (step 4 in Figure 8.11 on page 413 |
`the first approximation, the cache miss penalty is 200 ns for either co
`organization, (In practice it must he rounded up or down to an integer numbe
`clock cycles.) First, calculate the average memory-access time, and then
`performance.
`
`Answer
`
`Figure 8.12 on page 421] shows that the miss rate of a direct-mapped f4-)
`cache is 3.9% and the miss rate for a two-way—sel-associative cache ofthe
`size is 3.0%. Average memory-access time is
`
`Qualcomm, Ex. 1018, Page 13
`
`Qualcomm, Ex. 1018, Page 13
`
`
`
` Memory-Hierarchy Desian
`
`persia
`
`hout a
`cache
`
`ficant
`
`mes
`» the
`Vise
`nif
`
`tus
`a
`
`Average memory-access Me.way = 20*1.085 + 030*200 = 27.7 ns
`
`The average memory-access lime is better for the two-way—set-associative
`cache.
`
`SHEA Miss penalty) * Clock cycle time
`CPU time =I1C * (CPligecation ee Instruction
`
`
`
`=IC# (CPbeeemins * Clock cycle time +
`Memory accesses
`* Miss raic * Miss penalty * Clock cycle time)
`Instruction
`Substituting 200ns for (Miss penalty * Clock cycle time), the performance of
`each cache organization is
`
`
`Thus. the time for each organization is
`
`
`Average memory-access UMC) .way = 20 + .039#200 = 27.8 ns
`
`
`
` CPUperformanceis
`on a
`
`
`
`
`
`
`
`CPU time vay = IC#U.5*20 + 1.3+0,039200) = 40.141C
`
`CPUtimes. way = 1C#(1.5*20#1.085 + 1.3*0.030#200) = 40.4% IC
`and relative performanceis
`
`
`
`a CPUtimeyay_40.4*Instruction count
`CPU timey-way 40.1 * Instruction count
`
`
`
`In contrasl to the resulis of average access-time comparison. the direct-mapped
`cache leads to slightly better performance. Since CPU time is our bottom-line
`evaluation (and direct mapped is simpler to build). the preferred cache is direct
`mapped in this example. (See the fallacy on page 45! for more on this Kin oF
`
`trade-off.)
`
`
`
`
`
`
`
`
`
`so the oe
`a Compulsary—The first access to a block is pet i te cache.
`must be brought into the cache. These are piso tales
`ceed
`eteriiss § orfirs
`
`
`Peperence HUSSES.
`
`
`
`
`= Capacity—If the cache cannot contiti 20 se blocks needed eee execution
`of a program, capacity misses wi! cour due [0 blocks being di
`
`later retneved.
`
`
`
`araahe
`
`
`
`
`The Three Sources of Cache Misses: Compulsory,
`Capacity, and Conflicts
`An intuitive model of cache behavior attributes all misses to ane
`sources:
`
`ef the
`thers
`
`s
`
`
`
`
`Qualcomm, Ex. 1018, Page 14
`
`Qualcomm, Ex. 1018, Page 14
`
`
`
`« Conflict—If the block-placement strategy is set associatres of Geet
`conflict misses (in addition 10 compulsory and capacity masses)
`because a block can be discarded and later retieved ff 100 mummy Ele
`to its set, These are also called callision muses
`
`Figure 8.12 shows the relative frequency of cache misses, brakes
`the “three Cs.” To show the benefit of associativity, conflict mics a=
`into misses caused by each decrease in associativity. The catego) oe
`a-way, meaning the misses caused by going to the lower level of a=
`from the next one above. Here are the four categories:
`
`8-way: fromfully associative (no conflicts) to 8-way assecialve
`
`is directly contradictoryto the three C’s model.
`
`Figure 8.13 (page 422) presents the same data graphically. The top=
`absolute miss rates: the bottom graph plots percentage of all the miss = =
`size,
`Having identified the three Cs, what can a computer designer do a
`Conceptually, conflicts are the easiest: Fully associative placement ’
`conflict misses. Associativity is expensive In hardware, however, and
`access time (see the example ahove or the second fallacy in Sesto:
`©
`leading to lower overall performance. There is little to he done about —
`except to buy larger memory chips. If the upper-level memory is muc®
`than what is needed for a program, and a significant percentage of tee
`spent moving data between two levels in the hierarchy, the memory fy
`t=
`said to thrash. Because so many replacements are required, thrashing
`machine runs close to the speed of the lower-leve] memory, or may
`slower due to the miss overhead. Making blocks larger reduces the
`pum
`compulsory misses, but it can increase conflict misses.
`The three C's give insight into the cause of misses, but this simple m=
`its limits. For example, increasing cache size reduces conflict misses os
`capacity misses, since a larger cache spreads out references. Thus, a mi
`move from one category to the other as parameters change. Three C -
`replacement policy, since it is difficult to model and since. in general, it)
`significance. In specific circumstances the replacement policy can actus) =
`to anomalous behavior, such as poorer miss rates for larger associativity.
`
`4-way: from 8-way associative to 4-wayassociative
`
`2-way: from 4-way associative to 2-way associative
`
`t-way: from 2-wayassociative to [-way associative (direct mappe
`
`Qualcomm, Ex. 1018, Page 15
`
`Qualcomm, Ex. 1018, Page 15
`
`
`
`
`
`Memory-Hierarchy Design
`
`421
`
`isl mapped,
`will occur
`olocks map
`
`Beesa
`
`ies divided
`are labeled
`sociativity
`
`Cache size
`
`4-way
`LKB
`[KB 8-way
`2KB
`l-way
`
`2KB
`
`Q-way
`
`0.152
`
`0.149
`0,148
`
`0.122
`
`0.009
`0.0009
`0,009
`
`0.009
`
`Miss-rate components(relative percent)
`Total
`Degree
`
`associative
`miss
`(Sum = 100% of total miss rate)
`
`rate
`Compulsory
`Capacity
`Conflict
`| KB
`I-way
`0.191
`9.009
`5%
`0.141
`13%
`0.042
`22%,
`1KB
`2-way
`0.16
`0.009
`6%
`0.141
`87%
`0.012
`1%
`|
`
`8%
`0.009
`0.115
`4-way
`2KB
`
`__2KB Bway0.1130.009 8%
`
`4KB
`
`l-way
`
`0.109
`
`0.009
`
`RG
`
`2%
`0.003
`92%
`0.141
`6%
`
`
`6% O14 94% 0000 OH
`6%
`0.103
`70%
`0.036
`24%
`
`1%
`
`0.103
`
`0.103
`0.103
`
`0.073
`
`84%
`
`0.010
`
`8%
`
`2%
`0.003
`909%
`
`91% 0.00
`
`0.027
`
`25%
`
`67%
`
`|
`
`|
`
`
`
`:
`
`8KB
`
`2-way
`
`0.069
`
`0.009
`
`13%
`
`0.052
`
`15%
`
`0.008
`
`12%
`
`
`woids allBe8kB Bway 0.063 0.009 14% 052 HO
`
`nay slow
`16KB
`l-way
`0.086
`0,009
`14%
`0.038
`57%
`0.019
`29%
`
`14%
`0.013
`11%
`0.073
`9%
`0.009
`_—0.095
`2-way
`4KB
`6%
`0.005
`84%
`0.073
`10%
`0,009
`0.087
`4-way
`4KB
`j
`87% 0.002, |
`11% 0.073.
`0.084 0,009
`_4KBRway
`
`
`60%
`0.026
`30%
`10%
`0.052
`0.087
`0.009
`8 KB
`L-way
`=ae
`80%
`0.004
`6%
`|
`14%
`0.052
`0.065
`0.009
`8 KB
`4-way
`bee teed
`70%
`0.007
`13%
`|
`17%
`0.038
`«0,084
`0.009
`16 KB
`2-way
`an 8,10),
`16%
`0.003
`0%
`|
`18%
`0.038
`0.049
`0.009
`16