`
`DAVID A PATTERSON
`
`
`
`
`
`ARCHITECTURE"
`
`- A
`
`QUANTITATIVE
`APPROACH
`
`JOHN L HENNESSY
`
`.
`
`&
`
`-
`
`Qualcomm, Ex. 1018, Page 1
`
`
`
`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Haad
`Cover Design David Lance Goines
`Copy Editor Linda Medoff
`Proofreader Paul Medoff
`Computer Typesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging-in-Publication Data
`Patterson, David A.
`Computer architecture : a quantitative approach / David A.
`Patterson, John L. I-Iennessy
`p.
`cm,
`Includes bibliographical references
`ISBN 1-55880- 069-8
`
`1. Computer architecture. 2. Electronic digital computers
`~-Design and construction.
`I. Hennessy, John L. II. Title.
`QA76.9.A73P377 1990
`004.2‘2--dc20
`
`Morgan Kaufmann Publishers. Inc.
`Editorial Office: 2929 Campus Drive. San Mateo, CA 94403
`Order from: PO. Box 50490= Palo Alto, CA 94303-9953
`
`@1990 by Morgan Kaufmann Publishers, Inc.
`All rights reserved.
`
`89-85227
`CIP
`
`No part of this publication may be reproduced, stored in a retrieval system, or transmitted
`in any form or by any means—electronic. mechanical, reCOrding. or otherwise—without
`the prior permission of the publisher.
`
`All instruction sets and other design information of the DLX computer system contained
`herein is copyrighted by the publisher and may not be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`ADVICE. PRAISE. & ERRORS: Any correspondence related to this publication or
`intended for the authors should be addreSsed to the editorial offices of Morgan Kaui'mann
`Publishers. Inc., Dept. P&H APE. In formation regarding error sightings is encouraged.
`Any error sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (US) per correction. Electronic mail
`can be sent to bugs@vsop.stanford.etlu.
`INSTRUCTOR SUPPORT: For infomiation on classroom software and other instructor
`materials available to adopters, please contact the editorial offices of Morgan Kaul‘mann
`Publishers, Inc.
`
`
`
`Qualcomm, Ex. 1018, Page 2
`
`Qualcomm, Ex. 1018, Page 2
`
`
`
`4GB
`
`8.3
`
`
`8 .3 Caches
`
`
`
`Webster’s New World Dictionary of the American :-
`
`Second College Editi‘mt p
`
`
`Cache: a safe place for hiding 0t" storing things.
`
`Caches
`
`Cache is the name first chosen to represent the level of the memory hi .
`between the CPU and main memory, and that is the dominant use of the
`While the concept of caches is younger than the IBM 360 architecture, n-
`appear today in every class of computer and in some computers more than -
`
`In fact, the word has become so popular that it has replaced “buffer" in
`computcrrscience circles.
`-
`The general
`terms defined in the prior section can be used for
`although the word line. is often used instead of block. Figure 8.5 shows
`typical range of memory—hierarchy parameters for caches.
`
`
`-
`
`4 M 128 bytes
`@Ck (hr-Em:
`I - 4 clock cycles (normally I) _‘
`i Hit time
`8 — 32 clock cycles
`I—J'
`bliss penalty
`
`(Access time}
`(6 ~ 10 clock cycles)
`I
`
`
`i
`(Transfer time]
`(2 — 12 clock cycles)
`Miss rate
`1% — 20%
`l__ — _ —
`
`1 KB — 256 KB
`
`i Cache size
`
`—
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters tor caches in
`workstations and mlnlcomputers.
`
`Now let’s examine caches in more detail by answering the four me 1
`hierarchy questions.
`
`.
`
`
`
`
`
`
`
`
`'
`
`
`
`
`
`
`
`Q1: Where Can a Block Be Placed in a Cache?
`
`Restrictions on where a block is placed create three categories of
`organization:
`
`- "
`
`I
`
`.
`
`If each block has only one place it can appear in the cache, the cache is -
`to be direct mapped. The mapping is usually (block-frame address) 111 -
`(number of blocks in cache).
`
`If a block can be placed anywhere in the cache, the cache is said to be '-
`rissm iati'i'e.
`
`,
`
`
`
`Qualcomm, Ex. 1018. Page 3
`
`
`Qualcomm, Ex. 1018, Page 3
`
`
`
`409
`
`
`u
`
`If a block can be placed in a restricted set of places in the cache. the cachc is
`said to be set assoc-tantra A M”! is a group of two or more block}: in the cache.
`A block is first mapped onto a set. and then the block can be placed anywhere
`within the set. The set is usually chosen by bit selection; that is. (block-frame
`address) modulo (number 01’ sets in cache). If there are it blocks in a set. the
`cache placement is called [1—way rc-r associative.
`
`The range of caches from direct mapped to fully associative is really a
`continuum of levels of set associativity: Direct mapped is simply one—way set
`associative and a fully associative eache with in blocks could he called iii-way
`set associative. Figure 3.6 shows where block 12 can be placed in a cache
`acc0rdiug to the block—placement policy.
`
`Full)- umfialwo
`Mfgemge
`anytime-E
`
`Direct *nwtt'l
`mack mango
`one Info mot: d
`{la-nodal
`
`Eluclt
`
`IZ-123AEIE'I'
`
`Bloch
`
`Se‘. aarmialifi.
`titanium-1m
`anywhere at set 5'
`{:2 mud-ill
`“123-5513?
`
`"D II
`
`IIII
`
`"a.
`
` Memory/Hierarchy Design
`
`
`no.
`
`
`
`Set Set Set an
`D
`1
`2
`3
`
`3mi:-imm- address
`
`FIGURE 8.6 The cache has 3 blocks, while memory has 32 blocks. The set-
`associative organization has 4 sets with 2 blocks per set. called Midway set associative.
`(Fleal caches contain hundreds ol blocks and real memorms contain hundreds ol thousands
`oi blocks.) Assume that there is nothing in the cache and that the block-frame address in
`question identities lower-level block 12. The three options lor caches are sh0wn left to right.
`In lolly assoc+ative. block 12 lrom lite lower level can go into any oi the 8 blocks or the
`cache. with direct mapped, block 12 can only he placed into block «1 (12 modulo at. Set
`associative. which has some of both leatures. allows the block to he placed anywhere in set
`0 (12 modulo 4). With two blocks per set. this means block 12 can be placed either in block
`0 or block 1 of the cache.
`
`Qualcomm, Ex. 1018, Page 4
`
`Qualcomm, Ex. 1018, Page 4
`
`
`
`410
`
`
`
`
`
`8.3 Caches
`
`Q2: How Is a Block Found If It Is in the Cache?
`
`
`
`Caches include an address tag on each block that gives the block-Tram:
`The tag of every cache block that might contain the desired info
`checked to see if it matches the block frame address from the CPU. F ,-
`
`gives an example. Because speed is of the essence. all possible tags are
`in parallel; serial search would make set associativity COURIEI‘pl’OdUCll'LE
`
`
`
`
`
`
`Fully associative
`Block 0'23456?
`l‘lO.
`
`Met mapped
`lock 0123-1557
`
`5r. assoclallue
`Block 0l23456?
`
`Data
`
`Tag m
`
`llllllll
`
`l
`
`Sin! Slut 541! SM
`
`
`
`
`
`FIGURE 8.7 In lully associative placement, the block for block-frame a...-
`appear in any of the 8 blocks; thus, all 8 tags must be searched. The desire:
`
`found in cache block 6 in this example. In direct-mapped placement there is only
`block where memory block 12 can be lound. In set-associative placement: wrth I.
`
`-
`memory block 12 must be in set 0 (12 mod 4); thus. the tags of cache blocks 0
`checked. In this case the data is found in cache block 1. Speed of cache access -
`
`.
`that searching must be performed in parallel for fully associative and setrase- -.
`
`mappings.
`
`
`
`
`There must be a way to know that a cache block does not H.
`information. The most common procedure is to add a valid bit to the
`
`whether or not this entry contains a valid address. If the bit is not
`
`cannot be a match on this address.
`
`A common omission in finding the cost of caches is to forget the
`
`tag memory. One tag is required for each block. An advantage of -
`-
`
`block sizes is that the tag overhead per cache entry becomes a smaller +-
`
`the total cost of the cache.
`
`
`Before proceeding to the next question, lct’s explore the retatti.
`
`CPU address to the cache. Figure 8.8 shows how an address is divided
`fields to find data in a setiassociative cache: the block-(fire! field usd
`
`
`the desired data from the block, the index field used to select the set.
`field used for the comparison. While the comparison could be mad: -
`' '
`
`the address than the tag. there is no need:
`
`
`
`Qualcomm, Ex. 1018, Page 5
`
`
`Qualcomm, Ex. 1018, Page 5
`
`
`
`
`
`Memory-Hierarchy Design
`
`41 i
`
`
`
`
`..
`
`[It Jram:
`Ed mfn .
`
`t,
`ECPL'
`Egan-c -
`
`
`
`- Checking the index would be redundant, since it was used to select the set to
`be checked (an address stored in set 0. for example. must have 0 in the index
`field or it couldn’t be stored in set 0).
`
`u The offset is unnecessary in the comparison because all hloek offsets match
`and the entire block is present or not.
`
`If the total sire is kept the same, increasing associativity increases the number of
`blocks per set, thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 moves to the right with
`increasing associativity.
`
`
`
`FlGUl-‘IE 8.8 The 3 portions at an address in a set-associative or direct-mapped cache.
`The tag is used to check all the blocks in the set and the index is used to select the set. The
`block offset is the address of the desired data Within the block.
`
`Q3: Which Bl0ck Should Be Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn‘t.
`
`then it woald be easy to select which block to replace. Alas, the high hit rate of
`caches means that the overwhelming decision is between blocks that have valid
`data.
`
`is that hardware decisions are
`A benefit of direct~mapped placement
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set-
`associative placement, there are several blocks to choose from on a miss. There
`are two primary strategies employed for selecting which block to replace:
`
`- Random—To spread allocation uniformly, candidate blocks are randomly
`selected. Seine systems use a scheme for spreading data across a set of blocks
`in a pseudnrantlomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`- Least-recently used (LRUt—To reduce the chance of throwing out informa-
`tion that will be needed seen. accesses to blocks are recorded. The bloek
`
`replaced is the one that has been unused for the longest time. This makes use
`01' a corollary of tcmporal locality: It recently used blocks are likely to be
`used again. then the best candidate for disposal
`is the least recently used.
`Figure 8.9 (page 412) shOWs which block is the least—recently used for a
`sequence of block—frame addresses in a fully associative memory hierarchy.
`
`Qualcomm, Ex. 1018. Page 6
`
`
`
`Qualcomm, Ex. 1018, Page 6
`
`
`
`412
`
`8.3 Caches
`
`
`
`
`
`l._
`
`
`Block-frame addresses
`3
`l 2
`LRUblocknurnbcrlOiOilOlO 3'3 3
`
`
`
`1
`
`
`A virtue of random is that it is simple to build in hardware. As the n --
`
`blocks to keep track of increases, LRU becomes increasingly expentmre
`frequently only approximated. Figure 8.10 shows the difference in min'
`between LRU and random replacement. Replacement policy plays a g r.
`.
`.
`in smaller caches than in larger caches where there are more choices of
`.
`
`replace.
`
`
`
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-trams addr -
`
`a fully associative memory hierarchy. This assumes that there are 4 blocks and -.
`the beginning the LRU block is number 0. The LRU block number is shown below r-- —
`
`block reference. Another policy. First-in—firsr-out (Fl F0), simply discards the block
`used N unique accesses belore. independent of its reference pattern in the last N - 1
`
`references. Random replacement generally outperforms FIFO and it is easier to i
`+
`
`
`
`
`
`Associativity:
`
`2—way
`
`4-way
`
`8-way
`
`
`
`_ Size
`16 KB
`
`Random
`LRL‘
`
` 5.13% 5.69%
`
`Random
`LRL'
`167% 5.299%-
`
`
`
`LRU
`Ramh
`4.39% 4.96%
`
`
`i5?!-
`I 6F 1.33%
`2.01%
`1.54% 1.66%
`1.39%
`
`llifiKB
`
`FIGUHE 8.10 Miss rates comparing least-recently used versus random repl -
`for several sizes and associatlvities. This data was collected fora block size ol TE -
`using one of the VAX traces containing user and operating system code (SAVED). Thi
`trace is included in the software supplement for course use. There is little ditlerence
`between LRU and random for larger size caches in this trace.
`
`Q4: What Happens on a Write?
`
`'
`
`Qualcomm, Ex. 1018. Page 7
`
`
`
`
`
`
`
`
`
`
`1.5% 1.179? 1.12%- 1.13%- l.l3% 1.12%
`
`
`
`
`Reads dominate cache accesses. All instruction accesses are reads, and -
`
`instructions don't write to mcmory. Figure 4.34 (page 181) suggests a mix d'
`
`stores and 17% loads for four DLX programs. making writes less than it}! '
`the memory traffic. Making the common case fast means optimizing cache;
`"
`
`reads. but Amdahl's Law reminds us that high~perfomtance designs r:
`--
`I
`neglect the speed of writes.
`
`
`-
`Fortunately, the common case is also the easy case to make fast. The .
`
`can be read at the same time that the tag is read and compared. so the block -
`
`
`begins as soon as the block—frame address is available. If the read is a hit.
`block is passed on to the CPU immediately. If it is a miss, there is no ben -
`'
`
`but also no harm.
`
`Qualcomm, Ex. 1018, Page 7
`
`
`
` Memory-Hierarchy Design
`
`Such is not the case for writes. The processor specifies the size of the write.
`usually between i and 8 bytes; only that portion of a block can be changed. In
`general this means a read-modify'write sequence of operations on the block:
`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block cannot begin until the tag is checked to see if it is
`a hit. Because tag checking cannot occur in parallel, then, writes normally take
`longer than reads.
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`. Write through (or store through) The information is written to both the
`block in the cache and to the block in the lower-level memory.
`
`I Write back (also called copy back or store in)—The information is written
`only to the block in the cache. The modified cache block is written to main
`memory only when it is replaced.
`
`Write-back cache blocks are called clean or (lirtr. depending on whether the
`information in the cache differs from that in lower—level memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirty bit is
`commonly used. This status bit indicates whether or not the block was modified
`while in the cache. Ifit wasn't. the block is not written. since the lower level has
`the same information as the cache.
`
`lower level and not loaded into the cache.
`
`Both write back and write through have their advantages. With write back,
`writes occur at the Speed of the cache memory. and multiple writes within a
`block require only one write to the lower—level memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through. read misses don’t result
`in writes to the lower level, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the most
`current copy of the data. This is important in multiprocessors and for I/O, which
`we shall examine in Section 8.3. Hence, multiprocessors want write back to
`reduce the memory traffic per processor and write through to keep the cache and
`memory consistent.
`When the CPU must wait for writes to complete during write throughs, the
`CPU is said to write stall. A common optimization to reduce write stalls is a
`write buffer, which allows the processor to continue while the memory is
`updated. As we shall see in Section 8.8, write stalls can oecur even with write
`buffers.
`
`There are two options on a write miss:
`
`I Write allorato (also calledfctt‘h on write]-- The block is loaded, followed by
`the write~hit actions above. This is similar to a read miss,
`
`I No write allot-ate (also called irritc armmrfl—The block is modified in the
`
`Qualcomm, Ex. 1018. Page 8
`
`Qualcomm, Ex. 1018, Page 8
`
`
`
`
`
`
`
`41‘
`
`8.3 Caches
`
`While either write—miss policy could be used with write through or write back.
`generally write-back caches use write allocate (hoping that subsequent writes to
`that block will be captured by the cache) and write—through caches often use nu
`write allocate (since subsequent writes to that block will still have to go to
`memory).
`
`An Example Cache: The VAX-1 1/780 Cache
`
`To give substance to these ideas, Figure 8.11 shows the organization of the
`cache on the VAX—ll/780. The cache contains 8192 bytes of data in til-byte
`blocks with two-way—set-associativc placement, random replacement, write
`through with a one-word write buffer. and no write allocate on a write miss.
`Let’s trace a cache hit through the steps of a hit as lahcled in Figure 8.11
`(The five steps are shown as circled numbers.) The address coming into the
`cache is divided into two fields: the 29-bit block—frame address and 3-bit block
`
`offset. The block-frame address is further divided into an address tag and cache
`index. Step 1 shows this division.
`The cache index selects the set to be tested to see if the block is in the cache
`
`is one block from each bank in Figure 8.11.) The size of the index
`(A set
`depends on cache size, block size, and set associativity. In this case, a 9-hi1
`index results:
`
`‘’39
`192 : 512 : 29
`gache size
`Blocks _
`Bank _ Block size * Set associativity _ 8 *
`
`In a two-way—set-associative cache, the index is sent to both banks. This I?
`Step 2.
`After reading an address tag from each bank, the tag portion of the block-
`frame address is compared to the tags. This is step 3 in the figure. To be sure the
`tag contains valid information, the valid bit must be set, or the results of the
`comparison are ignored.
`Assuming one of the tags does match, a 2:1 multiplexer (step 4) is set I:-
`select the block from the matching set. Why can’t both tags match? It is the job
`of the replacement algorithm to make sure that an address appears in only on:
`block. To reduce the hit time, the data is read at the same time as the address-
`tags; thus, by the time the block multiplexer is ready, the data is also ready.
`This step is needed in seteassociative caches. but
`it can be omitted from
`directumapped caches since there is no selection to be made. The multiplexer
`used in this stcp can be on the critical
`timing path, endangering the clock cyck
`time of the CPU. (The example on pages 418—419 and the fallacy on page 451
`explore the trade—off of lower miss rates and higher clock cycle time.)
`In the final step the word is sent to the CPU. All five steps occur within a
`'
`single CPU clock cycle.
`What happens on a miss? The cache sends a stall signal to the CPU telling t‘
`to wait, and two words (eight bytes) are read from memory. That takes 6 Clo-cl
`cycles on the VAX-111780 (ignoring bus interference). When the data arrives,
`
`
`
`Qualcomm, Ex. 1018. Page 9
`
`Qualcomm, Ex. 1018, Page 9
`
`
`
`
`
`
`Memory-Hierarchy Design
`
`41 5
`
`the cache must pick a block to replace; the VAX-l 1/780 selects one of the two
`sets at random. Replacing a block means updating the data, the address tag, and
`the valid bit. Once this is done, the cache goes through a regular hit cycle and
`returns the data to the CPU.
`
`Writes are more complicated in the VAX—l 1/780, as they are in any cache. If
`the word to be written is in the cache, the first four steps are the same. The next
`step is to write the data in the block, then write the changed~data portion into the
`
`
`
`
`r Write back,
`em writes to
`often use no
`We to go [0
`
`lion of the
`
`3 in S‘byle
`rlent, write
`miss.
`Sure 8.1].
`3 into the
`l-bil block
`and cachi-
`
`hi: L'atl'l-e.
`he index
`3. a 9-bit
`
`
`
`Block-trill“: Block
`address
`all“:
`(29,
`e9» -:5:_-
`To
`Index
`
`1
`
`\‘alid
`r1;-
`
`Tau
`LEGp
`
`Gnu
`cfifib
`
`L
`
`
`i CPU
`
`adores-t
`Data Data
`..-. W1
`
`
`i ®
`
`
`
`
`
`
`
`
`Bank. I
`5512
`blocks)
`
`
`
`FIGURE 8.11 The organization of the VAX-11l'i’80 cache. The B-KB cache is two-way
`set associative with 8»byte blocks. it has 512 sets with two blocks per set; the set is
`selected by the 9-bit index. The live steps of a read hit, shown as circled numbers in order
`of occurrence, label this organization. The line from memory to the cache is used on a misa
`to load the cache. Such multiplexing is not needed in a direct-mapped cache. Note that the
`offset is connected to chip select ol the data SRAMs to allow the proper words to be sent to
`the 2:1 multiplexer.
`
`Qualcomm, Ex. 1018, Page 10
`
`
`
`Qualcomm, Ex. 1018, Page 10
`
`
`
`cache. The VAX—l 1/780 uses no write allocate. Consequently. on a a rite rail
`the CPU writes "around" the cache to lower-level memory and does not affect
`the cache.
`
`Since this. is a write—through cache. the process isn’t yet over. The word I
`also sent to a one-word write buffer. If the write buffer is empty. the word
`full address are written in the buffer, and we are finished. The CPU eont' - —
`working while the write buffer writes the word to memory. If the buffer is fill.
`the cache (and CPU) must wait until the buffer is empty.
`
`Cache Perlormance
`
`CPU time can be divided into the clock cycles the CPU Spends executing -
`program and the clock cycles the CPU Spends waiting for the memory 51.-
`-
`Thus.
`
`CPU time = (CPU-execution clock cycles + Memory-stall clock cycles ) 5‘ Clock cycle .
`
`To simplify evaluation of cache alternatives, sometimes designers an" -
`that all memory stalls are due to the cache. This is true for many machines -
`machines where this is not true, the cache still dominates stalls that an-
`exclusively due to the cache. We use this simplifying assumption here. but i:
`important to account for all memory stalls when calculating final performanerf -
`The formula above raises the question whether the clock cycles for a '
`-
`access should be considered part of CPU—execution clock cycles or part of
`-
`ory~stall clock cycles. While either convention is defensible, the most in .- .
`accepted is to include hit clock cycles in CPUicxecution clock cycles.
`Memory-stall clock cycles can then be defined in terms of the number
`memory accesses per program, miss penalty (in clock cycles), and miss rat:
`reads and writes:
`
`Instruction
`
`'
`
`Reads
`Memory-stall clock cycles — if,
`Program
`
`Read miss rate * Read miss pertain
`'
`
`.
`.
`.
`.
`Writes
`+' ¥ , r‘~' Wrttc miss rate =;< Write miss penalty
`Program
`
`We simplify the complete fonnula by combining the reads and writes toged'lz
`
`Memoryistall clock cycles
`
`_ Memory accessess
`Program
`
`* Miss rate =3 Miss penahy'
`
`Factoring instruction count (1C) from execution time and memory
`cycles. we now get a CPU-time formula that includes memory aeeessn '
`instruction. miss rate. and miss penalty:
`
`CPU time: IC =t (CPI
`
`.
`Execution
`
`+
`
`.
`.
`Memor t accesses
`.
`3’
`' ,
`""""- 4‘ Miss rate * Miss penalty * Clock cycle
`
`Qualcomm, Ex. 1018, Page 11
`
`Qualcomm, Ex. 1018, Page 11
`
`
`
`
`
`I write miss
`s not affect
`
`he word is
`3 word and
`i continuesr
`tier is fair
`
`Memory-Hierarchy Design
`
`417
`
`Some designers prefer measuring miss rate as misses per instruction rather
`than misses per memory reference:
`
`MES“?
`lnstrut'llon
`
`= Mgrpgryiagitlfliscrj t Miss rate
`Instruction
`
`is independent of the hardware
`it
`The advantage of this measure is that
`implementation. For example,
`the VAX-llil‘tiO instruction unit can make
`repeated references to a single byte (see Section 8.7), which can artificially
`reduce the miss rate iimeasurcd as misses per memory reference rather than per
`instruction executed. The drawback is that
`this measure is architecture
`
`is [11051 popular with architects working with a single
`thus it
`dependent.
`computer family. They then use this version of the CPU—time formula:
`
`CPU time = IC * (CPIExccution
`
`,5
`._
`fl
`*
`.
`_ Misses
`+ Instruction Miss penalty) Clock cycle time
`
`We can now explore the consequences of caches on performance.
`
`Example
`
`Let’s use the VAX—l1/780 as a first example. The cache miss penalty is 6 clock
`cycles, and all instructions normally take 8.5 clock cycles (ignoring memory
`stalls). Assume the miss rate is 11%, and there is an average of 3.0 memory
`references per instruction. What is the impact on performance when behavior of
`the cache is included?
`
`Answer
`
`CPU time = IC *- (CPI .
`.
`Execution
`time
`
`+
`
`
`
`
`Memgy-sitaliclock cycles)* Clock cycle.
`InstructtOn
`
`The performance, including cache misses, is
`
`CPU timewith cache = 1C ’1‘ (8.5 + 3.0 ='-‘
`
`il‘ii- ‘9: 6) * Clock cycle time
`
`= Instruction count
`
`10.5 * Clock cycle time
`
`The clock cycle time and instruction count are the same, with or without a
`cache. so CPU time increases with CPI from 8.5 to 10.5. Hence. the impact of
`the memory hierarchy is to stretch the CPU time by 24%.
`
`
`Example
`
`Let's now calculate the impact on performance when behavior of the cache is
`included on a machine with a lower CPI. Assume that the cache miss penalty is
`10 clock cycles and. on average, instructions take 1.5 clock cycles; the miss rate
`is 11%. and there is an average of 1.4 memory references per instruction.
`
`Answer
`
`CPU time =IC ’-“ (CPI
`
`Execution
`
`+ Miany—stjill £1093 (Eyck-s) * Clock cycle time
`instruction
`
`Qualcomm, Ex. 1018, Page 12
`
`Qualcomm, Ex. 1018, Page 12
`
`
`
`Making the same assumptions as in the previous example on cache ME. I}: _
`t'ormancc. including cache misses. is
`
`CH; timewith cache = [C (1.5 + l,4*11%>2=10) >f= Clock cycle time
`
`= Instruction count-‘I=3.()=E=Cloek cycle titne
`
`The clock c3 cle time and instruction count are the same, with or “1
`cache, so CPU time increases with CPI from 1.5 to 3.0. Including -
`behavior doubles execution time.
`
`As these examples illustrate, cache—behavior penalties range from sig a -
`to enormous, Furthermore. cache misses have a double—barrclcd impart:
`CPU with a low CPI and a fast clock:
`
`1. The lower the CPI, the more pronounced the impact is.
`
`2.
`
`Independent of the CPU, main memories have similar memory-access
`since they are built from the same memory chips. When calculating CPL
`cache miss penalty is measured in {.TI’E‘ clock cycles needcd for 1
`Therefore. a higher CPU clock rate leads to a larger miss penalty. er- .
`main memories are the same speed.
`
`The importance of the cache for CPUs with low CPI and high clock rates it
`greater; and. consequently. greater is the danger of neglecting cache behas'-
`assessing performance of such machines.
`While minimizing average memoryeaccess time is a reasonable goal and
`will use it in much of this chapter, keep in mind that the final goal is to n
`CPU execution time.
`
`Average memory—access time = Hit time + Miss rate a Miss penalty
`
`What is the. impact of two different cache organizations on the perfomiante
`CPU? Assume that the CPI is normally 1.5 with a clock cycle time of 20 Its.
`-
`there are 1.3 memory references per instruction, and that the sixe of both It:
`is 04 KB. One cache is direct mapped and the other is two-way set assoc -
`Since the speed of the CPU is lied directly to the speed of the caches, assume
`CPU clock cycle time must be stretched 8.5% to accommodate the sci
`multiplexer of the setiassociative cache (step 4 in Figure 8.11 on page 415 3
`the first approximation. the cache miss penalty is 200 as for either - -
`organigation. (In practice it must he rounded up or down to an integer numb-E
`clock cycles.) First, calculate the average memory-access time, and thcn
`performance.
`
`-
`
`Figure 8.12 on page 421 shims that the miss rate of a direct-mapped 6-1- !_
`cache is 3.9% and the miss rate for a two—W'ay~sct—associati\fe cache of the
`size is 3.0%. Average memory-access time is
`
`Qualcomm, Ex. 1018, Page 13
`
`Qualcomm, Ex. 1018, Page 13
`
`
`
`
`
` Memory-Hierarchy Design
`
`Thus. the time for each organization is
`
`
`Average memory-aceess timc1_way = 20 + .O39*200 = 27.8 ns
`
`Average memory-access timezway = 20*1085 + 030*200 = 27.7 ns
`
`
`The average memoryraccess time is better for the two-way—set-associative
`cache.
`
`CPU performance is
`.
`.
`Misgcs _
`.
`_
`.
`CPU time — IC * (CPIExmmon + instruction ,, Vliss penalty) - Clock cycle time
`
`he per-
`
`hour a
`cache
`
`..
`tic-am
`On a
`
`
`
`min
`
`fin“:
`HER
`n it
`
`W5
`'53
`
`Z:
`
`
`
`= IC *= (CPIEXBGUUon as Clock cycle time +
`
`Memory access-gs it Miss ratc * Miss penalty * Clock cycle time)
`Instruction
`
`Substituting 200ns for (Miss penalty ’1‘ Clock cycle time), the performance of
`each cache organiaation is
`
`CPU time1_way = IC*(J .5*20 + l.3*=0.039*200) = 40.1%IC
`
`CPU timelw}. = IC*(1.5*20-31.085 + 1.3*0.030=t=200) =40.4* IC
`
`and relative performance is
`
`FPUJietz-m _ 40-i1ilPWUj1E"? cow
`CPU timepwy— 40.1 * Instruction count
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`In contrast to the results of average access—time comparison. the direct-mapped
`cache leads to slightly better performance. Since CPU time is our bottom~1ine
`evaluation (and direct mapped is simpler to build]. 1h: preferred cache is direct
`mapped in this example. (See the fallacy on page 48| for more on this kind r-f
`tradeoft'.)
`
`
`
`The Three Sources of Cache Misses: Camulsnry.
`.
`Capacity, and Conflicts
`
`'HRI'II‘.
`
`
`
`
`- Cunarin'filf the cache cannot cumin 3: ‘i‘: 5-K“ “95916le during execution
`
`
`
`of a pmgram. capacity misses viii xv: dllt‘ “1 PIOCkS 153109, di ' “
`
`later retrieved.
`
`
`An intuitive model of cache behavior attributes all misses toil}!
`sources:
`
`c! .;—._.
`_-.
`
`- Compulwry—The first access- tn .t biock is not “1 Eb: £33.:
`must he brought into the cache. These are guns-Ila:
`—'
`rqfirr't'nc‘e misses.
`
`:1ir: block
`"we” 05'1”“
`
`
`Qualcomm, Ex. 1018, Page 14
`
`Qualcomm, Ex. 1018, Page 14
`
`
`
`‘
`
`I
`_
`
`Having identified the three Cs. what can a computcr designer tin 11m
`Conceptually. conflicts are the easiest: Fully associative ptacemem -
`conflict misses. Associativity is expensive in hardware. however. and -.
`access time {see the example ahove or the second fallacy in Sccticu -
`leading to lower overall performance. There is little to he done about ..
`except to buy larger memory chips. If the upper-level memory is much
`than what is needed for a program, and a significant percentage oi the
`spent moving data between two levels in the hierarchy, the memory In
`-
`said to thrash. Because so many replacements are required. thrashing
`machine runs close to the speed of the lowcrxlevel memory. or may!!!
`slower due to the mists overhead. Making blocks larger reduces the
`compulsory misses. but it can increase conflict misses.
`The three C‘s give insight into the cause of misses. but this simple
`its limits. For example. increasing cache size reduces conflict misses a:
`capacity misses. since a larger cache spreads out references. Thus, a turn
`move from one category to the other as parameters change. Thrcr: C's.”
`:
`replacement policy, since it is difficult to model and since. in general. it it -
`significance. In specific circumstances the replacement policy can actually .
`to anomalous behavior, such as poorer miss rates for larger associativil)‘.
`
`is directly contradictory to the three C’s model.
`
`. Confirm—1f the black—plum mug u «at am urban
`conflict misses tin addition to compulsory in! alpaca} man:
`because a block can be discarded and Inc-r retrieved If we may -'
`to its set. These are also called I‘t’flrtrflh’lfl miner
`
`-
`
`Figure 8.12 shows the relative frequency of cache mum- limit:-
`the “dime Cs." To show the benefit of associating . ermr‘ltc: muse: -
`into misses caused by each decrease in associativity. "Hie tang-cries I:
`n--way, meaning the misses caused by going to the Inner level! of
`-- -- -
`from the next one above. Here are the four categories:
`
`8-way: from fully associative (no conflicts; to 8—way financial“:
`
`4-way: from 8—way associative to 4—way associative
`
`2—way: from 4-way associative to 2—way associative
`
`l»way: from 2—way associative to 1-way associative (direct mapped.
`
`'
`Figure 8.1]: (page 42?.) presents the sanic data graphically. The {up rat-fl
`absolute miss rates: the bottom graph plots percentage of all the rnifln by '
`size.
`
`- -.
`
`-
`
`Qualcomm. Ex. 1018, Page 15
`
`Qualcomm, Ex. 1018, Page 15
`
`
`
`
`
`
`
`MemoryAHierarchy Design
`
`
`
`flq
`0.009
`0.009
`0.009
`
`0.000
`0.009
`0.009
`0.009
`
`0.009
`0.009
`0.009
`0.009
`
`
`as.
`8%:
`9%
`10%
`
`01-03
`0.073
`0.073
`0.073
`
`_“-1541_ 94_9:._
`0.103
`100%
`0.103
`84‘}?
`0.103
`9096'
`
`91%
`67%-
`7'1"}
`84%
`
`1102— M are.
`1091-
`0.052
`600’:
`13%
`0.052
`75%
`14%
`0.052
`8091
`
`1130:
`5791
`70".?
`76%
`
`H9 __14€ar. __ 0.032_
`0.009
`1411
`0.033
`0.009
`17%
`0.038
`0.009
`18%
`0.038
`
`m 192..
`0.009
`18%
`0.009
`22%
`0.009
`23%
`
`_0_l£3
`0.006
`0.054
`0.049
`
`0._04_8_
`0.050
`0.041
`0.038
`0.038
`0.039
`0.030
`0.028
`
`11009
`0.009
`0.009
`0.009
`
`_n.-OB_?W:2
`0.028
`5592'
`0.028
`68%
`0.028
`73%
`
`2401 __0.0'L 74%
`23%
`0.019
`500'.
`30%
`0.019
`65471-
`32%
`0.019
`68‘??-
`
`68%
`1691'
`21%
`25%
`27“}
`
`3221»— 0.0_19_
`34%
`0.004
`46%
`0.004
`55%
`0.004
`
`59%
`0004
`
`_0.0_09
`0.009
`0.009
`0.009
`
`0.009
`
`Cache size
`
`_ AB
`4K3
`4KB
`4K8
`4K3
`8K8
`SKB
`SKB
`
`11KB
`I6 KB
`if: KB
`111 KB
`.—
`1r. KB
`
`32 110
`32 KB
`3: KB
`
`3: KB
`
`54 KB
`1.: KB
`