throbber
COMPUTER
`ARCHITECTURE
`A
`QUANTITATIVE
`APPROACH
`
`
`
`JOHN L HENNESSY
`&
`DAVID A PATTERSON
`
`
`
`Qualcomm, Ex. 1018, Page 1
`
`

`

`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Head
`Cover Design David Lance Goines
`Copy Editor Linda Medoff
`Proofreader Paul Medoff
`Computer Typesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging-in-Publication Data
`Patterson, David A.
`Computer architecture : a quantitative approach / David A.
`Patterson, John L. Hennessy
`.
`em,
`
`Includes bibliographical references
`ISBN 1-55880- 069-8
`1. Computerarchitecture. 2. Electronic digital computers
`--Design and construction.
`I. Hennessy, John L. II. Title.
`QA76.9.A73P377 1990
`004.2'2--de20
`
`Morgan Kaufmann Publishers. Inc.
`Editorial Office: 2929 Campus Drive. San Mateo, CA 94403
`Order from: P.O. Box 50490, Palo Alta, CA 94303-9953
`
`©1990 by Morgan Kaufmann Publishers, Inc.
`All rights reserved,
`
`89-85227
`CIP
`
`Nopart of this publication may be reproduced, stored in a retrieval system, or transmitted
`in any form or by any means—electronic, mechanical, recording, or otherwise—without
`the prior permission of the publisher.
`
`All instruction sets and other design information of the DLX computer system contained
`herein is copyrighted by the publisher and may not be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`ADVICE, PRAISE, & ERRORS: Any correspondencerelated to this publication or
`intended for the authors should be addressed to the editorial offices of Morgan Kaufmann
`Publishers, Inc., Dept. P&H APE. Information regarding error sightings 1s encouraged.
`Any error sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction. Electronic mail
`can be sent to bugs@vsop.stanford.edu.
`INSTRUCTOR SUPPORT: For information on classroom software and other instructor
`materials available to adopters, please contact the editorial offices of Morgan Kaufmann
`Publishers, Inc.
`
`
`
`
`
`Qualcomm, Ex. 1018, Page 2
`
`Qualcomm, Ex. 1018, Page 2
`
`

`

`408
`
`8.3 Caches
`
`
`
`
`
`
`8.3
`
`Caches
`
`Cache: a safe place for hiding or storing things.
`
`
`
`Webster's New World Dictionaryof the American La
`Second Callege Edition ~
`
`
`Cache is the namefirst chosen to represent the level of the memory hies
`between the CPU and main memory, and that is the dominant use of the
`While the concept of cuches is younger than the [BM 360 architecture. oo
`appear todayin every class of computer and in some computers more than «
`
`In fact, the word has become so popular that it has replaced “buffer” in
`computer-science circles.
`coo
`The general
`terms defined in the prior section can be used for
`although the word fine is often used instead of block. Figure 8.5 show)
`typical range of memory-hierarchy parameters for caches.
`
`
`
`
`
`
`|Block (ine) size 4— 128 bytes
`
`| Hit eine
`| -—4 clock cycles (normally 1) il
`
`
`|Miss penalty 8 = 32 clock cycles )
`
`(Access time)
`(6— 10 clock cycles)
`
`
`(Transfer time)
`(2 —22 clock cycles)
`
`
`
`Miss rate
`1% -— 20%
`— oe ————
`| Cache size
`| KB - 256 KB
`
`
`
`
`
`|
`
`i
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters for caches in
`workstations and minicomputers.
`
`Nowlet’s examine caches in more detail by answering the four mete
`hierarchy questions.
`
`Q1: Where Can a Block Be Placed in a Cache?
`
`Restrictions on where a block is placed create three categories of
`organization:
`
`ca-™
`
`a
`
`=
`
`Ifeach block has only one place it can appear in the cache, the cache is =
`to be direct mapped. The mapping is usually (block-frame address) moc
`
`(number of blocks in cache).
`
`Ifa block can be placed anywhere in the cache, the cache is said to he
`associauve.
`
`
`
`Qualcomm, Ex. 1018, Page 3
`
`Qualcomm, Ex. 1018, Page 3
`
`

`

` Mamorv-Hierarchy Design
`
`Sei aseociative: biook 12 can po
`no.
`
`black 12 mat go
`anywhere it ser 0
`(12 mod)
`Or1234567
`
`| ll
`
`Set Set Set Sat
`Oo
`+
`2.93
`
`Block-imnme aakiress
`
`Block
`
`FIGURE 8.6 The cache has & blocks, while memory has 32 blocks. The set-
`associalive organization has 4 sets with 2 blocks per sel, called two-way sel associative.
`(Real caches contain hundreds of blocks and real memories contain hundreds of thousands
`of blocks.) Assumethat [here is nothing in the cache and that the block-frame addressIn
`question identities lower-level black 12. The three options for caches are shownieft to right.
`In fully associative, black 12 from the lower level can go into any of the 8 blocks o! the
`cache. With direct mapped, block 12 can only be placed into block 4 (12 modulo 4). Set
`associative, which has someof both features, allows the block to be placed anywhere in set
`0 (12 modulo 4). With two blocks per sei. this means block 12 can be placedeither in block
`0 or block 1 of the cache.
`
`Qualcomm, Ex. 1018, Page 4
`
`409
`
`a
`
`Ifa block can be placed in a restricted set of places in the cache, the cacheis
`said to be set associative. A set is a group of two or more blocks in the cuche.
`A blockis first mapped onto a set, and then the block can be placed anywhere
`within the set. The set is usually chosen bybit selection; that is, (block-frame
`address) modulo (number of sets in cacbe). [f there are n blocks in aset. the
`cache placement is called #-wayset associanve.
`
`The range of caches from direct mapped to fully associative is really a
`continuum of levels of set associativity: Direct mapped is simply one-way set
`associative and a fully associative eache with m blocks could be called m7-way
`set associative. Figure §.6 shows where block 12 can be placed in a cache
`according to the block-placement policy.
`
`Fully associative
`
`anywhere
`
`Bock OG1234567
`
`Block.
`
`
`
`Qualcomm, Ex. 1018, Page 4
`
`

`

`410
`
`8.3 Caches
`
`Q2: How Is a Block Found If It Is in the Cache?
`
`
`
`
`
`
`Caches include an address tag on each block that gives the block-frarme
`The tag of every cache block that might contain the desired infty
`
`checked to see if it matches the block-frame address from the CPL. Fo
`
`gives an example. Because speedis of the essence, all possible tags are
`in parallel; serial search would make set associativity counterproducti\ =
`
`
`
`
`Fully assocsativn
`
`Direct mapped
`
`OF FO4567 ~ a1234567
`
`no.
`Block
`
`Data
`
`Tag
`
`TT
`
`sewer TUETTTTT
`
`Sel associative
`
`megs 01234567
`
`en - Saf Set
`
`
`
`
`
`FIGURE 8.7 in fully associative placement, the block for block-frame add)
`appearin any of the 8 blocks; thus, all 8 tags must be searched. The desra
`
`found in cache block 6 in thig example. In direct-mapped placementthere is ony
`
`black where memory block 12 can be found. In set-associative placement, with =
`=
`memory block 12 must be in set 0 (12 mod 4); thus. the tags of cache blocks 0
`
`checked, In this case the data is found in cache biock 1. Speed of cache access =
`that searching must be performed in parallel for fully associative and set-asson©
`
`mappings.
`
`
`There must be a way to knowthat a cache block does not o>
`information. The most common procedure is to add a valid bit to the
`
`whether or not this entry contains a valid address. If the bit is nov
`
`cannot be a match on this address.
`
`
`A common omission in finding the cost of caches is to forget the
`tag memory. One tag is required for each block. An advantage of
`
`block sizes is that the tag overhead per cache entry becomes a smaller ~
`
`the total cost of the cache.
`
`
`Before proceeding to the next question, let’s explore the relatioe
`
`CPU address to the cache. Figure 8.8 shows howan address is divide’
`
`fields to find data in a set-associative cache: the bluck-offset field use”
`the desired data from the block, the index field used 10 select the set.
`field used for the comparison. While the comparison could be maile =
`the address than the tag, there ts no need:
`
` Qualcomm, Ex. 1018, Page 5
`
`Qualcomm, Ex. 1018, Page 5
`
`

`

`
`
`
`ick-frame
`td informe
`: CPU.
`Figs
`Tags are ae
`Bduective.
`
`Memory-Hierarchy Design
`
`
`
`4ii
`
`s Checking the index would be redundant, since it was used to select the set to
`be checked (an address stored in set 0. for example, must have 0 in the index
`field or it couldn’t be stored in set 0).
`
`« Theoffset is unnecessary in the comparison because all block offsets match
`and the entire block is present or nol.
`
`If the total sive is kept the same, increasing associativity increases the number of
`blocks per set, thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 moves to the right with
`increasing associativity.
`
`
`
`
`FIGURE 8.8 The 3 portions of an addressin a set-associative or direct-mapped cache.
`The tag is used to checkall the blocks in the set and the index is used to selact ihe set. The
`block offset is the address of the desired data within the block.
`
`
`
`
`a Least-recently used (.RU)}—To reduce the chance of throwing out informa-
`tion that will be needed soon, accesses to blocks are recorded. The block
`replaced is the one that hus been unusedforthe longest time. This makes use
`of a corollary of temporal locality: If recently used blocks are likely to be
`used again, then the best candidate for disposal
`is the least recently used.
`Figure 8.9 (page 412) shows which block is the least-recently used for a
`sequence of block-frame addresses in a fully associative memory hierarchy.
`
`Q3: Which Block Should He Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn’t,
`then it would be easy fo select which block to replace. Alas, the high hit rate of
`caches means that the overwhelming decision is between blocks that have valid
`data.
`
`is that hardware decisions are
`A benefit of direct-mapped placement
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set-
`associative placement, there are several blocks to choose from on a miss. There
`are two primary strategies employed for selecting which block to replace:
`
`a Random—Tospread allocation uniformly, candidate blocks are randomly
`selected. Some systems use a scheme for spreading data across a set of blocks
`in a pseudorandomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`Qualcomm, Ex. 1018, Page 6
`
`Qualcomm, Ex. 1018, Page 6
`
`

`

`
`
`A virtue of randomis that it is simple to build in hardware. As the nun
`
`blocks to keep track of increases, LRU becomes increasingly expensive
`frequently only approximated. Figure 8.10 shows the difference in mix: ~
`between LRU and random replacement. Replacementpolicy plays a greats
`
`in smaller caches than in larger caches where there are more choices of

`
`replace.
`
`
`
`Block-frame addresses
`3
`| 2
` Taetic tates 313
`
`
`
`
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-frame addre:
`a fully associative memory hierarchy. This assumes that there are 4 blocks and ©
`
`the beginning the LRU block is number G. The LRU block numberis shown belaw eae
`block reference. Anotherpolicy, First-in-first-out (FIFO), simply discards the block
`
`used N unique accesses before, independent ofits reference pattern in the last N - *
`references. Random replacement generally outperforms FIFO anditis easier to ime
`
`
`4-way
`2-way
`Associativity:
`8-way
`
`Random
`LRU
`Random
`LRU
`| Size
`LRU
`Rantios
`4.67% 5.29%
`
` 5.18% 3.69%
`16 KB
`4.39% 4.96%
`
`|.539%%
`| 64KB
`1.88%
`2.01%
`154% 1.66%
`1.39%
`
`
`oo
`
`
`
`
`
`4i2
`
`8.3 Caches
`
`
`
`|
`
`]3]1
`
`
`
`
`115% L.L7% 1.2% LA2ee 1.13% 13%
`
`
`
`
`
`
`
`
`
`
`FIGURE 8.10 Miss rates comparing least-recently used versus random replac:
`for severai sizes and associativities. This data was collected for a black size of 16 =
`using one of the VAX traces containing user and operating system code (SAVED), Tos
`trace is included in the software supplementfor course use.. Thereislittle difference
`between LRU and random for larger size cachesin this trace.
`
`Q4: What Happens on a Write?
`
`Reads dominate cache accesses. All instruction accesses are reads, and =»
`instructions don’t write to memory. Figure 4.34 (page 181) suggesrs a mix of
`stores and 17% loads for four DLX programs, making writes less than | ~
`the memorytraffic. Making the commoncase fast means optimizing caches ~
`reads, but Amdahl’s Lawreminds us that high-performance designs cas=
`neglect the speed of writes,
`Fortunately, the common case is also the easy case to make fast. The be
`can be read at the same timethat the tag is read and compared. so the block »
`begins as soon as the block-frame address is available. If the read is a hit
`block is passed on to the CPU immediately. If it is a miss, there is no benef
`but also no harm.
`
`
`Qualcomm, Ex. 1018, Page 7
`
`Qualcomm, Ex. 1018, Page 7
`
`

`

`Such is not the case for writes, The processor specifies the size of the write,
`usually between | and 8 bytes; only that portion of a block can be changed. In
`general this means a read-modify-write sequence of operations on the block:
`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block cannot begin until the tag is checked lo seeif it is
`a hit. Because tag checking cannot occur in parallel, then, writes normally take
`longer than reads.
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`a Write through (or store through)—The information is written to both the
`block in the cache and to the block in the lower-level memory.
`
`a Write back (also called copy back or store in}—The information is written
`only to the block in the cache. The moditied cache block is written to main
`memory only whenit is replaced.
`
` Memory-Hierarchy Design
`
`
`Write-back cache blocks are called clean or dirty, depending on whether the
`information in the cache differs fromthat in lower-leve! memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirtybit is
`commonlyused. This status bit indicates whether or not the block was modified
`while in the cache. If it wasn’t. the block is not written, since the lowerlevel has
`the same information as the cache.
`Both write back and wrile through have their advantages. With write back,
`writes occur at the speed of the cache memory, and multiple writes within a
`block require only one write to the lower-level memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through, read misses don’t result
`in writes to the lower level, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the most
`current copy of the data. This is important in multiprocessors and For I/O, which
`we shall examine in Section 8.8. Hence, multiprocessors want write back to
`reduce the memorytraffic per processor and write through to keep the cache and
`memory consistent.
`When the CPU must wait for writes to complete during write throughs, the
`CPU is said to write stall. A common optimization to reduce write stalls is a
`wrué buffer, which allows the processor to continue while the memory is
`updated. As we shall see in Section 8.8, write stalls can occur even with write
`buffers.
`
`There are two options on a write miss:
`
`a Write allocate (also called fetch on write)—The block is loaded, followed by
`the write-hit actions above. This is similar to a read miss.
`
`= No write allocate (also called write around)—The block is modified in the
`lower level and not loaded into the cache,
`
`Qualcomm, Ex. 1018, Page 8
`
`Qualcomm, Ex. 1018, Page 8
`
`

`

`
`
`
`
`414
`
`8.3 Caches
`
`While either write-miss policy could be used with write through or write back.
`generally write-back caches use write allocate (hoping that subsequent writes to
`that block will be captured by the cache) and write-through caches often use nu
`write allocate (since subsequent writes to that block will still have to go to
`memory).
`
`An Example Cache: The VAX-11/780 Cache
`
`To give substance to these ideas, Figure 8.11 shows the organization of the
`cache on the VAX-11/780. The cache contains 8192 bytes of data in 8-byte
`blocks with two-way—set-associative placement, random replacement, write
`through with a one-word write buffer, and no write allocate on a write miss.
`Let’s trace a cache hit through the steps of a hit as laheled in Figure 8.11
`(The five steps are shown as circled numbers.) The address coming into the
`cache is divided into two fields: the 29-bit block-frame address and 3-bit hloch
`offset. The block-frame address is further divided into an address tag and cache
`index. Step | shows this division.
`The cache index selects the set to be tested to see if the block is in the cache
`(A set
`is one block from each bank in Figure 8.11.) The size of the inde»
`depends on cache size, block size, and set associativity. In this case, a 9-h>
`index results:
`
`Blocks _
`=
`Bank
`
`_ &192
`Cache size
`a ee ee SP a 2?
`8 * 2
`Block size * Set associativity
`
`In a two-way-set-associative cache, the index is sent to both banks. This j=
`step 2.
`After reading an address tag from each bank, the tag portion of the block-
`frame address is compared to the tags. This is step 3 in the figure. To be sure the
`tag contains valid information, the valid bit must be set, or the results of the
`comparison are ignored.
`Assuming one of the tags does match, a 2:] multiplexer (step 4) is set I»
`select the block from the matching set, Why can’t both tags match? It is the jo
`of the replacement algorithm to make sure that an address appears in only op
`block. To reduce the hit time, the data is read at the same time as the addrew
`tags; thus, by the time the block multiplexeris ready, the data is also ready.
`This step is needed in sct-associative caches, but
`it can be omitted from
`direct-mapped caches since there is no selection to be made. The multiplexe:
`used in this step can be onthe critical
`timing path, endangering the clock cyc©
`time of the CPU. (The example on pages 418-419 and the fallacy on page 45_
`explore the trade-off of lower miss rates and higher clock cycle time.)
`In the final step the word is sent to the CPU. All five steps occur within —
`|
`single CPU clock cycle.
`What happens on a miss? The cache sends a stall signal to the CPU telling
`to wait, and two words (eight bytes) are read from memory. That takes 6 cloc)
`cycles on the VAX-11/780 (ignoring bus interference). When the data arrive
`
`
`
`Qualcomm, Ex. 1018, Page 9
`
`Qualcomm, Ex. 1018, Page 9
`
`

`

`
`
`
`
`
` He
`
`[:—
`
`rT write back,
`‘nl writes to
`Often use no
`ve tO £0 to
`
`tion ofthe
`tin 8-byte
`lent, write
`miss.
`gure 8.1}.
`2 into the
`bit block
`and cache
`
`he cache.
`he index
`sa 9-bit
`
`
`
`iW
`
`rite
`outer
`
`
`5
`Marary
`
`FIGURE 8.11 The organization of the VAX-11/780 cache. The 8-KB cache is two-way
`sei associative with 8-byte blocks. lt has 512 sets with two blocks perset; the setis
`selected by the 9-bit index. The five steps of a read hit, shown as circled numbersin order
`of occurrence, label this organization. The line from memory to the cache is used on a miss
`to load the cache. Such multiplexing is not needed in a direct-mapped cache. Note that the
`offsel is connected to chip select of the data SRAMs io altow the proper words to be sen! ip
`the 2:1 multiplexer.
`
`Qualcomm, Ex. 1018, Page 10
`
`Memory-Hierarchy Design
`
`415
`
`the cacbe must pick a block to replace; the VAX-11/780 selects one of the two
`sets at random. Replacing a block means updating the data, the address tag, and
`the valid bit. Once this is done, the cache goes through a regular hit cycle and
`returns the data to the CPU.
`Writes are more complicated in the VAX-11/780, as they are in any cache. If
`the word to be writlen is in the cache, the first four steps are the same. The next
`step is to write the data in the block, then write the changed-data portion into the
`
`
`
`
`| CPU
`
`
`address
`Data Data
`
`in out
`
`
`9|2, 2c 6a
`
`Block-irame Block
`address
`=ofiat
`<20>
`<>" <b
`ta
`index
`
`thooks)
`
`Bank 0
`(512
`
`Bank #
`(512
`blocks)
`
`Qualcomm, Ex. 1018, Page 10
`
`

`

`cache. The VAX-11/780 uses no write allocate. Consequently, on a wrile mo
`the CPU writes “around” the cache to lower-level memory and does nol af's—
`the cache.
`
`Since this is a write-through cache, the process isn’t yet over. The won)
`also sent to a one-word write buffer. If the write huffer is empty, the wor!
`full address are written in the buffer, and we are finished. The CPU contin
`working while the write buffer writes the word to memory. If the buffer is (.
`the cache (and CPU) must wait until the buffer 1s empty.
`
`Cache Performance
`
`CPU time can be divided into the clock cycles the CPU spends executing ©
`program and the clock cycles the CPU spends waiting for the memory sys
`Thus,
`
`CPU time = (CPU-execution clock cycles + Memory-stall clock cycles } * Clock cycle ©
`
`+
`
`To simplify evaluation of cache alternatives, sometimes designers asso
`that all memory Stalls are due to the cache. This is (rue for many machine
`machines where this is not true, the cuche still dominates stalls that ure
`exclusively due to the cache. We use this simplifying assumption here, but ~~
`important to accountfor all memory stalls when calculating final performance |
`The formula above raises the question whether the clock cycles for 4 co-
`access should be considered part of CPU-execution clock cycles or part of
`me
`ory-stall clock cycles. While either convention is defensible, the most woe
`acceptedis to include hit clock cycles in CPU-execution clock cycles.
`Memory-stall clock cycles can then be defined in terms of the number
`memory accesses per program, miss penalty (in clock cycles), and miss rate
`reads and writes:
`
`CPU time =1C * (ee etal, Miss rate * Miss penalty)+ Clock cycle Qualcomm, Ex. 1018, Page 11
`
`:
`;
`.
`Reads
`Memory-stall clock cycles = ——_—_ * Read miss rate * Read miss penal
`Program
`’
`
`Program
`
`* Write miss rate * Write miss penalty
`
`Wesimplify the complete formula by combiningthe reads and writes togethe=
`
`Memory accessess
`* Miss rate * Miss penialn —
`
`Memory-stall clock cycles = — reeyee
`Program
`
`Factoring instruction count (IC) from cxecution time and memory
`cycles, we now get a CPU-time formula that includes memory accesses:
`instruction, miss rate, and miss penalty:
`
`~
`
`aca
`
`rn
`
`Memoryaccesses
`
`;
`
`3
`
`Wil
`
`2
`
`Qualcomm, Ex. 1018, Page 11
`
`

`

`
`
`Memory-Hierarchy Design
`
`417
`
`Some designers prefer measuring miss rate as misses per instruction rather
`than misses per memory reference:
`
`_ Memoryaccesses
`Misses
`=
`:
`Instruction
`Instruction
`
`* Miss rate
`
`is independent of the hardware
`it
`The advantage of this measure ig that
`implementation. For example,
`the VAX-11/780 instruction unit can make
`repeated references to a single byte (sce Section 8.7), which can artificially
`reduce the miss rate if measured as misses per memoryreference rather than per
`instruction executed. The drawback is that
`this measure is architecture
`dependent,
`thus it
`is most popular with archiiects working with a single
`computer family, They then use this version of the CPU-time formula:
`
`| write mise
`5 not affecs
`
`he word jx
`> Word ane
`' continue:
`fer is full.
`
`tuting the
`¥ System
`
`yele time
`
`| 2SS0me
`
`lines: on
`are oot
`
`but it is
`Rance’
`a cack
`f meom-
`
`iber of
`me for
`
`CPU time = IC * (crt
`
`Execuuion
`
`_ Misses
`
`be
`
`ire
`
`si
`
`*
`
`Instruction Eas penalty Clock cycle ime
`
`5
`
`Wecan now explore the consequences of caches on performance.
`
`Example
`
`Let's use the VAX-11/780 as a first example. The cache miss penalty is 6 clock
`cycles, and all instructions normally take 8.5 clock cycles (ignoring memory
`stalls). Assume the miss rate is 11%, and there is an average of 3,0 memory
`references per instruction. What is the impact on performance when behavior of
`the cache is included?
`
`Answer
`
`CPU time = IC # (crI , +
`
`Execution
`
`Memory-stall clock cycles
`Instruction
`
`ys Clock cycle
`
`time
`
`The performance, including cache misses, is
`CPU time.ih eyes IC * (8.5 + 3.0 # 11% * 6) * Clock cycle time
`
`= Instruction count * 10.5 * Clock cycle time
`
`The clock cycle time and instruction count are the same, with or without a
`cache, so CPU time increases with CPI from 8.5 to 10.5. Hence, the impact of
`the memoryhierarchyis to stretch the CPU time by 24%,
`
`
`Example
`
`Let’s now calculate the impact on performance when behavior of the cache is
`included on a machine with a lower CPI, Assume that the cache miss penaltyis
`10 clock cycles and, on average, instructions take 1.5 clock cycles; the miss rate
`is 11%, and there is an average of 1.4 memory references per instruction.
`
`Answer
`
`CPU time =IC * (crt ot
`
`Execution
`
`Memory-stall clock cycles
`Instruction
`
`:
`
`-
`
`) * Clock cycle time
`
`Qualcomm, Ex. 1018, Page 12
`
`Qualcomm, Ex. 1018, Page 12
`
`

`

`Making the same assumptionsas in the previous example on cacie hits, to =
`formance, including cache misses, 15
`
`CPU UMEth acne IC # (1.5 + 1,4#11%810) = Clock cycle time
`
`= Instruction count*3.0*Clock cycle time
`
`The clock cycle time and instruction count are the same, with or wi
`cache, so CPU time increases with CPI from !.5 to 3.0. Including
`behavior doubles execution time.
`
`As these examplesillustrate, cache-hehavior penalties range from sign
`lo enermous. Furthermore, cache misses have a double-barreled impact
`CPUwith a low CPI and a fast clock:
`
`L. The lower the CPI, the more pronounced the impact is.
`
`2.
`
`Independent of the CPU, main memories have similar memory-access
`since they are built from the same memory chips. When calculating CP
`cache miss penalty is measured in CPU clock cycles needed for «
`Therefore, a higher CPU clock rate leads to a larger miss penalty, oo
`main memories are the same speed,
`
`The importance of the cache for CPUs with low CPI and high clock rates ©
`greater; and, consequently, greater is the danger of neglecting cache behav —
`assessing performance of such machines.
`While minimizing average memory-access time is a rcasonable goal an”
`will use it in much of this chapter, keep in mind that the final goal is to ps"
`CPU execution ume.
`
`Example
`
`Average memory-access time = Hil time + Miss rate * Miss penalty
`
`What is the impact of two different cache organizations on the performarice
`CPU? Assumethat the CPT is normally 1.5 with a clock cycle ime of 20 ns©
`there are 1.3 memory references per instruction, and that the size of both a>
`is 64 KB. One cache is direct mapped and the other 1s two-waysel associa’
`Since the speed of the CPU ts tied directly to the speed of the caches, assum=
`CPU clock cycle time must be stretched 8.5% to accommodate the sel
`multiplexer of the set-associative cache (step 4 in Figure 8.11 on page 413 |
`the first approximation, the cache miss penalty is 200 ns for either co
`organization, (In practice it must he rounded up or down to an integer numbe
`clock cycles.) First, calculate the average memory-access time, and then
`performance.
`
`Answer
`
`Figure 8.12 on page 421] shows that the miss rate of a direct-mapped f4-)
`cache is 3.9% and the miss rate for a two-way—sel-associative cache ofthe
`size is 3.0%. Average memory-access time is
`
`Qualcomm, Ex. 1018, Page 13
`
`Qualcomm, Ex. 1018, Page 13
`
`

`

` Memory-Hierarchy Desian
`
`persia
`
`hout a
`cache
`
`ficant
`
`mes
`» the
`Vise
`nif
`
`tus
`a
`
`Average memory-access Me.way = 20*1.085 + 030*200 = 27.7 ns
`
`The average memory-access lime is better for the two-way—set-associative
`cache.
`
`SHEA Miss penalty) * Clock cycle time
`CPU time =I1C * (CPligecation ee Instruction
`
`
`
`=IC# (CPbeeemins * Clock cycle time +
`Memory accesses
`* Miss raic * Miss penalty * Clock cycle time)
`Instruction
`Substituting 200ns for (Miss penalty * Clock cycle time), the performance of
`each cache organization is
`
`
`Thus. the time for each organization is
`
`
`Average memory-access UMC) .way = 20 + .039#200 = 27.8 ns
`
`
`
` CPUperformanceis
`on a
`
`
`
`
`
`
`
`CPU time vay = IC#U.5*20 + 1.3+0,039200) = 40.141C
`
`CPUtimes. way = 1C#(1.5*20#1.085 + 1.3*0.030#200) = 40.4% IC
`and relative performanceis
`
`
`
`a CPUtimeyay_40.4*Instruction count
`CPU timey-way 40.1 * Instruction count
`
`
`
`In contrasl to the resulis of average access-time comparison. the direct-mapped
`cache leads to slightly better performance. Since CPU time is our bottom-line
`evaluation (and direct mapped is simpler to build). the preferred cache is direct
`mapped in this example. (See the fallacy on page 45! for more on this Kin oF
`
`trade-off.)
`
`
`
`
`
`
`
`
`
`so the oe
`a Compulsary—The first access to a block is pet i te cache.
`must be brought into the cache. These are piso tales
`ceed
`eteriiss § orfirs
`
`
`Peperence HUSSES.
`
`
`
`
`= Capacity—If the cache cannot contiti 20 se blocks needed eee execution
`of a program, capacity misses wi! cour due [0 blocks being di
`
`later retneved.
`
`
`
`araahe
`
`
`
`
`The Three Sources of Cache Misses: Compulsory,
`Capacity, and Conflicts
`An intuitive model of cache behavior attributes all misses to ane
`sources:
`
`ef the
`thers
`
`s
`
`
`
`
`Qualcomm, Ex. 1018, Page 14
`
`Qualcomm, Ex. 1018, Page 14
`
`

`

`« Conflict—If the block-placement strategy is set associatres of Geet
`conflict misses (in addition 10 compulsory and capacity masses)
`because a block can be discarded and later retieved ff 100 mummy Ele
`to its set, These are also called callision muses
`
`Figure 8.12 shows the relative frequency of cache misses, brakes
`the “three Cs.” To show the benefit of associativity, conflict mics a=
`into misses caused by each decrease in associativity. The catego) oe
`a-way, meaning the misses caused by going to the lower level of a=
`from the next one above. Here are the four categories:
`
`8-way: fromfully associative (no conflicts) to 8-way assecialve
`
`is directly contradictoryto the three C’s model.
`
`Figure 8.13 (page 422) presents the same data graphically. The top=
`absolute miss rates: the bottom graph plots percentage of all the miss = =
`size,
`Having identified the three Cs, what can a computer designer do a
`Conceptually, conflicts are the easiest: Fully associative placement ’
`conflict misses. Associativity is expensive In hardware, however, and
`access time (see the example ahove or the second fallacy in Sesto:

`leading to lower overall performance. There is little to he done about —
`except to buy larger memory chips. If the upper-level memory is muc®
`than what is needed for a program, and a significant percentage of tee
`spent moving data between two levels in the hierarchy, the memory fy
`t=
`said to thrash. Because so many replacements are required, thrashing
`machine runs close to the speed of the lower-leve] memory, or may
`slower due to the miss overhead. Making blocks larger reduces the
`pum
`compulsory misses, but it can increase conflict misses.
`The three C's give insight into the cause of misses, but this simple m=
`its limits. For example, increasing cache size reduces conflict misses os
`capacity misses, since a larger cache spreads out references. Thus, a mi
`move from one category to the other as parameters change. Three C -
`replacement policy, since it is difficult to model and since. in general, it)
`significance. In specific circumstances the replacement policy can actus) =
`to anomalous behavior, such as poorer miss rates for larger associativity.
`
`4-way: from 8-way associative to 4-wayassociative
`
`2-way: from 4-way associative to 2-way associative
`
`t-way: from 2-wayassociative to [-way associative (direct mappe
`
`Qualcomm, Ex. 1018, Page 15
`
`Qualcomm, Ex. 1018, Page 15
`
`

`

`
`
`Memory-Hierarchy Design
`
`421
`
`isl mapped,
`will occur
`olocks map
`
`Beesa
`
`ies divided
`are labeled
`sociativity
`
`Cache size
`
`4-way
`LKB
`[KB 8-way
`2KB
`l-way
`
`2KB
`
`Q-way
`
`0.152
`
`0.149
`0,148
`
`0.122
`
`0.009
`0.0009
`0,009
`
`0.009
`
`Miss-rate components(relative percent)
`Total
`Degree
`
`associative
`miss
`(Sum = 100% of total miss rate)
`
`rate
`Compulsory
`Capacity
`Conflict
`| KB
`I-way
`0.191
`9.009
`5%
`0.141
`13%
`0.042
`22%,
`1KB
`2-way
`0.16
`0.009
`6%
`0.141
`87%
`0.012
`1%
`|
`
`8%
`0.009
`0.115
`4-way
`2KB
`
`__2KB Bway0.1130.009 8%
`
`4KB
`
`l-way
`
`0.109
`
`0.009
`
`RG
`
`2%
`0.003
`92%
`0.141
`6%
`
`
`6% O14 94% 0000 OH
`6%
`0.103
`70%
`0.036
`24%
`
`1%
`
`0.103
`
`0.103
`0.103
`
`0.073
`
`84%
`
`0.010
`
`8%
`
`2%
`0.003
`909%
`
`91% 0.00
`
`0.027
`
`25%
`
`67%
`
`|
`
`|
`
`
`
`:
`
`8KB
`
`2-way
`
`0.069
`
`0.009
`
`13%
`
`0.052
`
`15%
`
`0.008
`
`12%
`
`
`woids allBe8kB Bway 0.063 0.009 14% 052 HO
`
`nay slow
`16KB
`l-way
`0.086
`0,009
`14%
`0.038
`57%
`0.019
`29%
`
`14%
`0.013
`11%
`0.073
`9%
`0.009
`_—0.095
`2-way
`4KB
`6%
`0.005
`84%
`0.073
`10%
`0,009
`0.087
`4-way
`4KB
`j
`87% 0.002, |
`11% 0.073.
`0.084 0,009
`_4KBRway
`
`
`60%
`0.026
`30%
`10%
`0.052
`0.087
`0.009
`8 KB
`L-way
`=ae
`80%
`0.004
`6%
`|
`14%
`0.052
`0.065
`0.009
`8 KB
`4-way
`bee teed
`70%
`0.007
`13%
`|
`17%
`0.038
`«0,084
`0.009
`16 KB
`2-way
`an 8,10),
`16%
`0.003
`0%
`|
`18%
`0.038
`0.049
`0.009
`16

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket