throbber
-COMPUTER
`
`DAVID A PATTERSON
`
`
`
`
`
`ARCHITECTURE"
`
`- A
`
`QUANTITATIVE
`APPROACH
`
`JOHN L HENNESSY
`
`.
`
`&
`
`-
`
`Qualcomm, Ex. 1018, Page 1
`
`

`

`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Haad
`Cover Design David Lance Goines
`Copy Editor Linda Medoff
`Proofreader Paul Medoff
`Computer Typesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging-in-Publication Data
`Patterson, David A.
`Computer architecture : a quantitative approach / David A.
`Patterson, John L. I-Iennessy
`p.
`cm,
`Includes bibliographical references
`ISBN 1-55880- 069-8
`
`1. Computer architecture. 2. Electronic digital computers
`~-Design and construction.
`I. Hennessy, John L. II. Title.
`QA76.9.A73P377 1990
`004.2‘2--dc20
`
`Morgan Kaufmann Publishers. Inc.
`Editorial Office: 2929 Campus Drive. San Mateo, CA 94403
`Order from: PO. Box 50490= Palo Alto, CA 94303-9953
`
`@1990 by Morgan Kaufmann Publishers, Inc.
`All rights reserved.
`
`89-85227
`CIP
`
`No part of this publication may be reproduced, stored in a retrieval system, or transmitted
`in any form or by any means—electronic. mechanical, reCOrding. or otherwise—without
`the prior permission of the publisher.
`
`All instruction sets and other design information of the DLX computer system contained
`herein is copyrighted by the publisher and may not be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`ADVICE. PRAISE. & ERRORS: Any correspondence related to this publication or
`intended for the authors should be addreSsed to the editorial offices of Morgan Kaui'mann
`Publishers. Inc., Dept. P&H APE. In formation regarding error sightings is encouraged.
`Any error sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (US) per correction. Electronic mail
`can be sent to bugs@vsop.stanford.etlu.
`INSTRUCTOR SUPPORT: For infomiation on classroom software and other instructor
`materials available to adopters, please contact the editorial offices of Morgan Kaul‘mann
`Publishers, Inc.
`
`
`
`Qualcomm, Ex. 1018, Page 2
`
`Qualcomm, Ex. 1018, Page 2
`
`

`

`4GB
`
`8.3
`
`
`8 .3 Caches
`
`
`
`Webster’s New World Dictionary of the American :-
`
`Second College Editi‘mt p
`
`
`Cache: a safe place for hiding 0t" storing things.
`
`Caches
`
`Cache is the name first chosen to represent the level of the memory hi .
`between the CPU and main memory, and that is the dominant use of the
`While the concept of caches is younger than the IBM 360 architecture, n-
`appear today in every class of computer and in some computers more than -
`
`In fact, the word has become so popular that it has replaced “buffer" in
`computcrrscience circles.
`-
`The general
`terms defined in the prior section can be used for
`although the word line. is often used instead of block. Figure 8.5 shows
`typical range of memory—hierarchy parameters for caches.
`
`
`-
`
`4 M 128 bytes
`@Ck (hr-Em:
`I - 4 clock cycles (normally I) _‘
`i Hit time
`8 — 32 clock cycles
`I—J'
`bliss penalty
`
`(Access time}
`(6 ~ 10 clock cycles)
`I
`
`
`i
`(Transfer time]
`(2 — 12 clock cycles)
`Miss rate
`1% — 20%
`l__ — _ —
`
`1 KB — 256 KB
`
`i Cache size
`
`—
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters tor caches in
`workstations and mlnlcomputers.
`
`Now let’s examine caches in more detail by answering the four me 1
`hierarchy questions.
`
`.
`
`
`
`
`
`
`
`
`'
`
`
`
`
`
`
`
`Q1: Where Can a Block Be Placed in a Cache?
`
`Restrictions on where a block is placed create three categories of
`organization:
`
`- "
`
`I
`
`.
`
`If each block has only one place it can appear in the cache, the cache is -
`to be direct mapped. The mapping is usually (block-frame address) 111 -
`(number of blocks in cache).
`
`If a block can be placed anywhere in the cache, the cache is said to be '-
`rissm iati'i'e.
`
`,
`
`
`
`Qualcomm, Ex. 1018. Page 3
`
`
`Qualcomm, Ex. 1018, Page 3
`
`

`

`409
`
`
`u
`
`If a block can be placed in a restricted set of places in the cache. the cachc is
`said to be set assoc-tantra A M”! is a group of two or more block}: in the cache.
`A block is first mapped onto a set. and then the block can be placed anywhere
`within the set. The set is usually chosen by bit selection; that is. (block-frame
`address) modulo (number 01’ sets in cache). If there are it blocks in a set. the
`cache placement is called [1—way rc-r associative.
`
`The range of caches from direct mapped to fully associative is really a
`continuum of levels of set associativity: Direct mapped is simply one—way set
`associative and a fully associative eache with in blocks could he called iii-way
`set associative. Figure 3.6 shows where block 12 can be placed in a cache
`acc0rdiug to the block—placement policy.
`
`Full)- umfialwo
`Mfgemge
`anytime-E
`
`Direct *nwtt'l
`mack mango
`one Info mot: d
`{la-nodal
`
`Eluclt
`
`IZ-123AEIE'I'
`
`Bloch
`
`Se‘. aarmialifi.
`titanium-1m
`anywhere at set 5'
`{:2 mud-ill
`“123-5513?
`
`"D II
`
`IIII
`
`"a.
`
` Memory/Hierarchy Design
`
`
`no.
`
`
`
`Set Set Set an
`D
`1
`2
`3
`
`3mi:-imm- address
`
`FIGURE 8.6 The cache has 3 blocks, while memory has 32 blocks. The set-
`associative organization has 4 sets with 2 blocks per set. called Midway set associative.
`(Fleal caches contain hundreds ol blocks and real memorms contain hundreds ol thousands
`oi blocks.) Assume that there is nothing in the cache and that the block-frame address in
`question identities lower-level block 12. The three options lor caches are sh0wn left to right.
`In lolly assoc+ative. block 12 lrom lite lower level can go into any oi the 8 blocks or the
`cache. with direct mapped, block 12 can only he placed into block «1 (12 modulo at. Set
`associative. which has some of both leatures. allows the block to he placed anywhere in set
`0 (12 modulo 4). With two blocks per set. this means block 12 can be placed either in block
`0 or block 1 of the cache.
`
`Qualcomm, Ex. 1018, Page 4
`
`Qualcomm, Ex. 1018, Page 4
`
`

`

`410
`
`
`
`
`
`8.3 Caches
`
`Q2: How Is a Block Found If It Is in the Cache?
`
`
`
`Caches include an address tag on each block that gives the block-Tram:
`The tag of every cache block that might contain the desired info
`checked to see if it matches the block frame address from the CPU. F ,-
`
`gives an example. Because speed is of the essence. all possible tags are
`in parallel; serial search would make set associativity COURIEI‘pl’OdUCll'LE
`
`
`
`
`
`
`Fully associative
`Block 0'23456?
`l‘lO.
`
`Met mapped
`lock 0123-1557
`
`5r. assoclallue
`Block 0l23456?
`
`Data
`
`Tag m
`
`llllllll
`
`l
`
`Sin! Slut 541! SM
`
`
`
`
`
`FIGURE 8.7 In lully associative placement, the block for block-frame a...-
`appear in any of the 8 blocks; thus, all 8 tags must be searched. The desire:
`
`found in cache block 6 in this example. In direct-mapped placement there is only
`block where memory block 12 can be lound. In set-associative placement: wrth I.
`
`-
`memory block 12 must be in set 0 (12 mod 4); thus. the tags of cache blocks 0
`checked. In this case the data is found in cache block 1. Speed of cache access -
`
`.
`that searching must be performed in parallel for fully associative and setrase- -.
`
`mappings.
`
`
`
`
`There must be a way to know that a cache block does not H.
`information. The most common procedure is to add a valid bit to the
`
`whether or not this entry contains a valid address. If the bit is not
`
`cannot be a match on this address.
`
`A common omission in finding the cost of caches is to forget the
`
`tag memory. One tag is required for each block. An advantage of -
`-
`
`block sizes is that the tag overhead per cache entry becomes a smaller +-
`
`the total cost of the cache.
`
`
`Before proceeding to the next question, lct’s explore the retatti.
`
`CPU address to the cache. Figure 8.8 shows how an address is divided
`fields to find data in a setiassociative cache: the block-(fire! field usd
`
`
`the desired data from the block, the index field used to select the set.
`field used for the comparison. While the comparison could be mad: -
`' '
`
`the address than the tag. there is no need:
`
`
`
`Qualcomm, Ex. 1018, Page 5
`
`
`Qualcomm, Ex. 1018, Page 5
`
`

`

`
`
`Memory-Hierarchy Design
`
`41 i
`
`
`
`
`..
`
`[It Jram:
`Ed mfn .
`
`t,
`ECPL'
`Egan-c -
`
`
`
`- Checking the index would be redundant, since it was used to select the set to
`be checked (an address stored in set 0. for example. must have 0 in the index
`field or it couldn’t be stored in set 0).
`
`u The offset is unnecessary in the comparison because all hloek offsets match
`and the entire block is present or not.
`
`If the total sire is kept the same, increasing associativity increases the number of
`blocks per set, thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 moves to the right with
`increasing associativity.
`
`
`
`FlGUl-‘IE 8.8 The 3 portions at an address in a set-associative or direct-mapped cache.
`The tag is used to check all the blocks in the set and the index is used to select the set. The
`block offset is the address of the desired data Within the block.
`
`Q3: Which Bl0ck Should Be Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn‘t.
`
`then it woald be easy to select which block to replace. Alas, the high hit rate of
`caches means that the overwhelming decision is between blocks that have valid
`data.
`
`is that hardware decisions are
`A benefit of direct~mapped placement
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set-
`associative placement, there are several blocks to choose from on a miss. There
`are two primary strategies employed for selecting which block to replace:
`
`- Random—To spread allocation uniformly, candidate blocks are randomly
`selected. Seine systems use a scheme for spreading data across a set of blocks
`in a pseudnrantlomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`- Least-recently used (LRUt—To reduce the chance of throwing out informa-
`tion that will be needed seen. accesses to blocks are recorded. The bloek
`
`replaced is the one that has been unused for the longest time. This makes use
`01' a corollary of tcmporal locality: It recently used blocks are likely to be
`used again. then the best candidate for disposal
`is the least recently used.
`Figure 8.9 (page 412) shOWs which block is the least—recently used for a
`sequence of block—frame addresses in a fully associative memory hierarchy.
`
`Qualcomm, Ex. 1018. Page 6
`
`
`
`Qualcomm, Ex. 1018, Page 6
`
`

`

`412
`
`8.3 Caches
`
`
`
`
`
`l._
`
`
`Block-frame addresses
`3
`l 2
`LRUblocknurnbcrlOiOilOlO 3'3 3
`
`
`
`1
`
`
`A virtue of random is that it is simple to build in hardware. As the n --
`
`blocks to keep track of increases, LRU becomes increasingly expentmre
`frequently only approximated. Figure 8.10 shows the difference in min'
`between LRU and random replacement. Replacement policy plays a g r.
`.
`.
`in smaller caches than in larger caches where there are more choices of
`.
`
`replace.
`
`
`
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-trams addr -
`
`a fully associative memory hierarchy. This assumes that there are 4 blocks and -.
`the beginning the LRU block is number 0. The LRU block number is shown below r-- —
`
`block reference. Another policy. First-in—firsr-out (Fl F0), simply discards the block
`used N unique accesses belore. independent of its reference pattern in the last N - 1
`
`references. Random replacement generally outperforms FIFO and it is easier to i
`+
`
`
`
`
`
`Associativity:
`
`2—way
`
`4-way
`
`8-way
`
`
`
`_ Size
`16 KB
`
`Random
`LRL‘
`
` 5.13% 5.69%
`
`Random
`LRL'
`167% 5.299%-
`
`
`
`LRU
`Ramh
`4.39% 4.96%
`
`
`i5?!-
`I 6F 1.33%
`2.01%
`1.54% 1.66%
`1.39%
`
`llifiKB
`
`FIGUHE 8.10 Miss rates comparing least-recently used versus random repl -
`for several sizes and associatlvities. This data was collected fora block size ol TE -
`using one of the VAX traces containing user and operating system code (SAVED). Thi
`trace is included in the software supplement for course use. There is little ditlerence
`between LRU and random for larger size caches in this trace.
`
`Q4: What Happens on a Write?
`
`'
`
`Qualcomm, Ex. 1018. Page 7
`
`
`
`
`
`
`
`
`
`
`1.5% 1.179? 1.12%- 1.13%- l.l3% 1.12%
`
`
`
`
`Reads dominate cache accesses. All instruction accesses are reads, and -
`
`instructions don't write to mcmory. Figure 4.34 (page 181) suggests a mix d'
`
`stores and 17% loads for four DLX programs. making writes less than it}! '
`the memory traffic. Making the common case fast means optimizing cache;
`"
`
`reads. but Amdahl's Law reminds us that high~perfomtance designs r:
`--
`I
`neglect the speed of writes.
`
`
`-
`Fortunately, the common case is also the easy case to make fast. The .
`
`can be read at the same time that the tag is read and compared. so the block -
`
`
`begins as soon as the block—frame address is available. If the read is a hit.
`block is passed on to the CPU immediately. If it is a miss, there is no ben -
`'
`
`but also no harm.
`
`Qualcomm, Ex. 1018, Page 7
`
`

`

` Memory-Hierarchy Design
`
`Such is not the case for writes. The processor specifies the size of the write.
`usually between i and 8 bytes; only that portion of a block can be changed. In
`general this means a read-modify'write sequence of operations on the block:
`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block cannot begin until the tag is checked to see if it is
`a hit. Because tag checking cannot occur in parallel, then, writes normally take
`longer than reads.
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`. Write through (or store through) The information is written to both the
`block in the cache and to the block in the lower-level memory.
`
`I Write back (also called copy back or store in)—The information is written
`only to the block in the cache. The modified cache block is written to main
`memory only when it is replaced.
`
`Write-back cache blocks are called clean or (lirtr. depending on whether the
`information in the cache differs from that in lower—level memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirty bit is
`commonly used. This status bit indicates whether or not the block was modified
`while in the cache. Ifit wasn't. the block is not written. since the lower level has
`the same information as the cache.
`
`lower level and not loaded into the cache.
`
`Both write back and write through have their advantages. With write back,
`writes occur at the Speed of the cache memory. and multiple writes within a
`block require only one write to the lower—level memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through. read misses don’t result
`in writes to the lower level, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the most
`current copy of the data. This is important in multiprocessors and for I/O, which
`we shall examine in Section 8.3. Hence, multiprocessors want write back to
`reduce the memory traffic per processor and write through to keep the cache and
`memory consistent.
`When the CPU must wait for writes to complete during write throughs, the
`CPU is said to write stall. A common optimization to reduce write stalls is a
`write buffer, which allows the processor to continue while the memory is
`updated. As we shall see in Section 8.8, write stalls can oecur even with write
`buffers.
`
`There are two options on a write miss:
`
`I Write allorato (also calledfctt‘h on write]-- The block is loaded, followed by
`the write~hit actions above. This is similar to a read miss,
`
`I No write allot-ate (also called irritc armmrfl—The block is modified in the
`
`Qualcomm, Ex. 1018. Page 8
`
`Qualcomm, Ex. 1018, Page 8
`
`

`

`
`
`
`
`41‘
`
`8.3 Caches
`
`While either write—miss policy could be used with write through or write back.
`generally write-back caches use write allocate (hoping that subsequent writes to
`that block will be captured by the cache) and write—through caches often use nu
`write allocate (since subsequent writes to that block will still have to go to
`memory).
`
`An Example Cache: The VAX-1 1/780 Cache
`
`To give substance to these ideas, Figure 8.11 shows the organization of the
`cache on the VAX—ll/780. The cache contains 8192 bytes of data in til-byte
`blocks with two-way—set-associativc placement, random replacement, write
`through with a one-word write buffer. and no write allocate on a write miss.
`Let’s trace a cache hit through the steps of a hit as lahcled in Figure 8.11
`(The five steps are shown as circled numbers.) The address coming into the
`cache is divided into two fields: the 29-bit block—frame address and 3-bit block
`
`offset. The block-frame address is further divided into an address tag and cache
`index. Step 1 shows this division.
`The cache index selects the set to be tested to see if the block is in the cache
`
`is one block from each bank in Figure 8.11.) The size of the index
`(A set
`depends on cache size, block size, and set associativity. In this case, a 9-hi1
`index results:
`
`‘’39
`192 : 512 : 29
`gache size
`Blocks _
`Bank _ Block size * Set associativity _ 8 *
`
`In a two-way—set-associative cache, the index is sent to both banks. This I?
`Step 2.
`After reading an address tag from each bank, the tag portion of the block-
`frame address is compared to the tags. This is step 3 in the figure. To be sure the
`tag contains valid information, the valid bit must be set, or the results of the
`comparison are ignored.
`Assuming one of the tags does match, a 2:1 multiplexer (step 4) is set I:-
`select the block from the matching set. Why can’t both tags match? It is the job
`of the replacement algorithm to make sure that an address appears in only on:
`block. To reduce the hit time, the data is read at the same time as the address-
`tags; thus, by the time the block multiplexer is ready, the data is also ready.
`This step is needed in seteassociative caches. but
`it can be omitted from
`directumapped caches since there is no selection to be made. The multiplexer
`used in this stcp can be on the critical
`timing path, endangering the clock cyck
`time of the CPU. (The example on pages 418—419 and the fallacy on page 451
`explore the trade—off of lower miss rates and higher clock cycle time.)
`In the final step the word is sent to the CPU. All five steps occur within a
`'
`single CPU clock cycle.
`What happens on a miss? The cache sends a stall signal to the CPU telling t‘
`to wait, and two words (eight bytes) are read from memory. That takes 6 Clo-cl
`cycles on the VAX-111780 (ignoring bus interference). When the data arrives,
`
`
`
`Qualcomm, Ex. 1018. Page 9
`
`Qualcomm, Ex. 1018, Page 9
`
`

`

`
`
`
`Memory-Hierarchy Design
`
`41 5
`
`the cache must pick a block to replace; the VAX-l 1/780 selects one of the two
`sets at random. Replacing a block means updating the data, the address tag, and
`the valid bit. Once this is done, the cache goes through a regular hit cycle and
`returns the data to the CPU.
`
`Writes are more complicated in the VAX—l 1/780, as they are in any cache. If
`the word to be written is in the cache, the first four steps are the same. The next
`step is to write the data in the block, then write the changed~data portion into the
`
`
`
`
`r Write back,
`em writes to
`often use no
`We to go [0
`
`lion of the
`
`3 in S‘byle
`rlent, write
`miss.
`Sure 8.1].
`3 into the
`l-bil block
`and cachi-
`
`hi: L'atl'l-e.
`he index
`3. a 9-bit
`
`
`
`Block-trill“: Block
`address
`all“:
`(29,
`e9» -:5:_-
`To
`Index
`
`1
`
`\‘alid
`r1;-
`
`Tau
`LEGp
`
`Gnu
`cfifib
`
`L
`
`
`i CPU
`
`adores-t
`Data Data
`..-. W1
`
`
`i ®
`
`
`
`
`
`
`
`
`Bank. I
`5512
`blocks)
`
`
`
`FIGURE 8.11 The organization of the VAX-11l'i’80 cache. The B-KB cache is two-way
`set associative with 8»byte blocks. it has 512 sets with two blocks per set; the set is
`selected by the 9-bit index. The live steps of a read hit, shown as circled numbers in order
`of occurrence, label this organization. The line from memory to the cache is used on a misa
`to load the cache. Such multiplexing is not needed in a direct-mapped cache. Note that the
`offset is connected to chip select ol the data SRAMs to allow the proper words to be sent to
`the 2:1 multiplexer.
`
`Qualcomm, Ex. 1018, Page 10
`
`
`
`Qualcomm, Ex. 1018, Page 10
`
`

`

`cache. The VAX—l 1/780 uses no write allocate. Consequently. on a a rite rail
`the CPU writes "around" the cache to lower-level memory and does not affect
`the cache.
`
`Since this. is a write—through cache. the process isn’t yet over. The word I
`also sent to a one-word write buffer. If the write buffer is empty. the word
`full address are written in the buffer, and we are finished. The CPU eont' - —
`working while the write buffer writes the word to memory. If the buffer is fill.
`the cache (and CPU) must wait until the buffer is empty.
`
`Cache Perlormance
`
`CPU time can be divided into the clock cycles the CPU Spends executing -
`program and the clock cycles the CPU Spends waiting for the memory 51.-
`-
`Thus.
`
`CPU time = (CPU-execution clock cycles + Memory-stall clock cycles ) 5‘ Clock cycle .
`
`To simplify evaluation of cache alternatives, sometimes designers an" -
`that all memory stalls are due to the cache. This is true for many machines -
`machines where this is not true, the cache still dominates stalls that an-
`exclusively due to the cache. We use this simplifying assumption here. but i:
`important to account for all memory stalls when calculating final performanerf -
`The formula above raises the question whether the clock cycles for a '
`-
`access should be considered part of CPU—execution clock cycles or part of
`-
`ory~stall clock cycles. While either convention is defensible, the most in .- .
`accepted is to include hit clock cycles in CPUicxecution clock cycles.
`Memory-stall clock cycles can then be defined in terms of the number
`memory accesses per program, miss penalty (in clock cycles), and miss rat:
`reads and writes:
`
`Instruction
`
`'
`
`Reads
`Memory-stall clock cycles — if,
`Program
`
`Read miss rate * Read miss pertain
`'
`
`.
`.
`.
`.
`Writes
`+' ¥ , r‘~' Wrttc miss rate =;< Write miss penalty
`Program
`
`We simplify the complete fonnula by combining the reads and writes toged'lz
`
`Memoryistall clock cycles
`
`_ Memory accessess
`Program
`
`* Miss rate =3 Miss penahy'
`
`Factoring instruction count (1C) from execution time and memory
`cycles. we now get a CPU-time formula that includes memory aeeessn '
`instruction. miss rate. and miss penalty:
`
`CPU time: IC =t (CPI
`
`.
`Execution
`
`+
`
`.
`.
`Memor t accesses
`.
`3’
`' ,
`""""- 4‘ Miss rate * Miss penalty * Clock cycle
`
`Qualcomm, Ex. 1018, Page 11
`
`Qualcomm, Ex. 1018, Page 11
`
`

`

`
`
`I write miss
`s not affect
`
`he word is
`3 word and
`i continuesr
`tier is fair
`
`Memory-Hierarchy Design
`
`417
`
`Some designers prefer measuring miss rate as misses per instruction rather
`than misses per memory reference:
`
`MES“?
`lnstrut'llon
`
`= Mgrpgryiagitlfliscrj t Miss rate
`Instruction
`
`is independent of the hardware
`it
`The advantage of this measure is that
`implementation. For example,
`the VAX-llil‘tiO instruction unit can make
`repeated references to a single byte (see Section 8.7), which can artificially
`reduce the miss rate iimeasurcd as misses per memory reference rather than per
`instruction executed. The drawback is that
`this measure is architecture
`
`is [11051 popular with architects working with a single
`thus it
`dependent.
`computer family. They then use this version of the CPU—time formula:
`
`CPU time = IC * (CPIExccution
`
`,5
`._
`fl
`*
`.
`_ Misses
`+ Instruction Miss penalty) Clock cycle time
`
`We can now explore the consequences of caches on performance.
`
`Example
`
`Let’s use the VAX—l1/780 as a first example. The cache miss penalty is 6 clock
`cycles, and all instructions normally take 8.5 clock cycles (ignoring memory
`stalls). Assume the miss rate is 11%, and there is an average of 3.0 memory
`references per instruction. What is the impact on performance when behavior of
`the cache is included?
`
`Answer
`
`CPU time = IC *- (CPI .
`.
`Execution
`time
`
`+
`
`
`
`
`Memgy-sitaliclock cycles)* Clock cycle.
`InstructtOn
`
`The performance, including cache misses, is
`
`CPU timewith cache = 1C ’1‘ (8.5 + 3.0 ='-‘
`
`il‘ii- ‘9: 6) * Clock cycle time
`
`= Instruction count
`
`10.5 * Clock cycle time
`
`The clock cycle time and instruction count are the same, with or without a
`cache. so CPU time increases with CPI from 8.5 to 10.5. Hence. the impact of
`the memory hierarchy is to stretch the CPU time by 24%.
`
`
`Example
`
`Let's now calculate the impact on performance when behavior of the cache is
`included on a machine with a lower CPI. Assume that the cache miss penalty is
`10 clock cycles and. on average, instructions take 1.5 clock cycles; the miss rate
`is 11%. and there is an average of 1.4 memory references per instruction.
`
`Answer
`
`CPU time =IC ’-“ (CPI
`
`Execution
`
`+ Miany—stjill £1093 (Eyck-s) * Clock cycle time
`instruction
`
`Qualcomm, Ex. 1018, Page 12
`
`Qualcomm, Ex. 1018, Page 12
`
`

`

`Making the same assumptions as in the previous example on cache ME. I}: _
`t'ormancc. including cache misses. is
`
`CH; timewith cache = [C (1.5 + l,4*11%>2=10) >f= Clock cycle time
`
`= Instruction count-‘I=3.()=E=Cloek cycle titne
`
`The clock c3 cle time and instruction count are the same, with or “1
`cache, so CPU time increases with CPI from 1.5 to 3.0. Including -
`behavior doubles execution time.
`
`As these examples illustrate, cache—behavior penalties range from sig a -
`to enormous, Furthermore. cache misses have a double—barrclcd impart:
`CPU with a low CPI and a fast clock:
`
`1. The lower the CPI, the more pronounced the impact is.
`
`2.
`
`Independent of the CPU, main memories have similar memory-access
`since they are built from the same memory chips. When calculating CPL
`cache miss penalty is measured in {.TI’E‘ clock cycles needcd for 1
`Therefore. a higher CPU clock rate leads to a larger miss penalty. er- .
`main memories are the same speed.
`
`The importance of the cache for CPUs with low CPI and high clock rates it
`greater; and. consequently. greater is the danger of neglecting cache behas'-
`assessing performance of such machines.
`While minimizing average memoryeaccess time is a reasonable goal and
`will use it in much of this chapter, keep in mind that the final goal is to n
`CPU execution time.
`
`Average memory—access time = Hit time + Miss rate a Miss penalty
`
`What is the. impact of two different cache organizations on the perfomiante
`CPU? Assume that the CPI is normally 1.5 with a clock cycle time of 20 Its.
`-
`there are 1.3 memory references per instruction, and that the sixe of both It:
`is 04 KB. One cache is direct mapped and the other is two-way set assoc -
`Since the speed of the CPU is lied directly to the speed of the caches, assume
`CPU clock cycle time must be stretched 8.5% to accommodate the sci
`multiplexer of the setiassociative cache (step 4 in Figure 8.11 on page 415 3
`the first approximation. the cache miss penalty is 200 as for either - -
`organigation. (In practice it must he rounded up or down to an integer numb-E
`clock cycles.) First, calculate the average memory-access time, and thcn
`performance.
`
`-
`
`Figure 8.12 on page 421 shims that the miss rate of a direct-mapped 6-1- !_
`cache is 3.9% and the miss rate for a two—W'ay~sct—associati\fe cache of the
`size is 3.0%. Average memory-access time is
`
`Qualcomm, Ex. 1018, Page 13
`
`Qualcomm, Ex. 1018, Page 13
`
`

`

`
`
` Memory-Hierarchy Design
`
`Thus. the time for each organization is
`
`
`Average memory-aceess timc1_way = 20 + .O39*200 = 27.8 ns
`
`Average memory-access timezway = 20*1085 + 030*200 = 27.7 ns
`
`
`The average memoryraccess time is better for the two-way—set-associative
`cache.
`
`CPU performance is
`.
`.
`Misgcs _
`.
`_
`.
`CPU time — IC * (CPIExmmon + instruction ,, Vliss penalty) - Clock cycle time
`
`he per-
`
`hour a
`cache
`
`..
`tic-am
`On a
`
`
`
`min
`
`fin“:
`HER
`n it
`
`W5
`'53
`
`Z:
`
`
`
`= IC *= (CPIEXBGUUon as Clock cycle time +
`
`Memory access-gs it Miss ratc * Miss penalty * Clock cycle time)
`Instruction
`
`Substituting 200ns for (Miss penalty ’1‘ Clock cycle time), the performance of
`each cache organiaation is
`
`CPU time1_way = IC*(J .5*20 + l.3*=0.039*200) = 40.1%IC
`
`CPU timelw}. = IC*(1.5*20-31.085 + 1.3*0.030=t=200) =40.4* IC
`
`and relative performance is
`
`FPUJietz-m _ 40-i1ilPWUj1E"? cow
`CPU timepwy— 40.1 * Instruction count
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`In contrast to the results of average access—time comparison. the direct-mapped
`cache leads to slightly better performance. Since CPU time is our bottom~1ine
`evaluation (and direct mapped is simpler to build]. 1h: preferred cache is direct
`mapped in this example. (See the fallacy on page 48| for more on this kind r-f
`tradeoft'.)
`
`
`
`The Three Sources of Cache Misses: Camulsnry.
`.
`Capacity, and Conflicts
`
`'HRI'II‘.
`
`
`
`
`- Cunarin'filf the cache cannot cumin 3: ‘i‘: 5-K“ “95916le during execution
`
`
`
`of a pmgram. capacity misses viii xv: dllt‘ “1 PIOCkS 153109, di ' “
`
`later retrieved.
`
`
`An intuitive model of cache behavior attributes all misses toil}!
`sources:
`
`c! .;—._.
`_-.
`
`- Compulwry—The first access- tn .t biock is not “1 Eb: £33.:
`must he brought into the cache. These are guns-Ila:
`—'
`rqfirr't'nc‘e misses.
`
`:1ir: block
`"we” 05'1”“
`
`
`Qualcomm, Ex. 1018, Page 14
`
`Qualcomm, Ex. 1018, Page 14
`
`

`

`‘
`
`I
`_
`
`Having identified the three Cs. what can a computcr designer tin 11m
`Conceptually. conflicts are the easiest: Fully associative ptacemem -
`conflict misses. Associativity is expensive in hardware. however. and -.
`access time {see the example ahove or the second fallacy in Sccticu -
`leading to lower overall performance. There is little to he done about ..
`except to buy larger memory chips. If the upper-level memory is much
`than what is needed for a program, and a significant percentage oi the
`spent moving data between two levels in the hierarchy, the memory In
`-
`said to thrash. Because so many replacements are required. thrashing
`machine runs close to the speed of the lowcrxlevel memory. or may!!!
`slower due to the mists overhead. Making blocks larger reduces the
`compulsory misses. but it can increase conflict misses.
`The three C‘s give insight into the cause of misses. but this simple
`its limits. For example. increasing cache size reduces conflict misses a:
`capacity misses. since a larger cache spreads out references. Thus, a turn
`move from one category to the other as parameters change. Thrcr: C's.”
`:
`replacement policy, since it is difficult to model and since. in general. it it -
`significance. In specific circumstances the replacement policy can actually .
`to anomalous behavior, such as poorer miss rates for larger associativil)‘.
`
`is directly contradictory to the three C’s model.
`
`. Confirm—1f the black—plum mug u «at am urban
`conflict misses tin addition to compulsory in! alpaca} man:
`because a block can be discarded and Inc-r retrieved If we may -'
`to its set. These are also called I‘t’flrtrflh’lfl miner
`
`-
`
`Figure 8.12 shows the relative frequency of cache mum- limit:-
`the “dime Cs." To show the benefit of associating . ermr‘ltc: muse: -
`into misses caused by each decrease in associativity. "Hie tang-cries I:
`n--way, meaning the misses caused by going to the Inner level! of
`-- -- -
`from the next one above. Here are the four categories:
`
`8-way: from fully associative (no conflicts; to 8—way financial“:
`
`4-way: from 8—way associative to 4—way associative
`
`2—way: from 4-way associative to 2—way associative
`
`l»way: from 2—way associative to 1-way associative (direct mapped.
`
`'
`Figure 8.1]: (page 42?.) presents the sanic data graphically. The {up rat-fl
`absolute miss rates: the bottom graph plots percentage of all the rnifln by '
`size.
`
`- -.
`
`-
`
`Qualcomm. Ex. 1018, Page 15
`
`Qualcomm, Ex. 1018, Page 15
`
`

`

`
`
`
`
`MemoryAHierarchy Design
`
`
`
`flq
`0.009
`0.009
`0.009
`
`0.000
`0.009
`0.009
`0.009
`
`0.009
`0.009
`0.009
`0.009
`
`
`as.
`8%:
`9%
`10%
`
`01-03
`0.073
`0.073
`0.073
`
`_“-1541_ 94_9:._
`0.103
`100%
`0.103
`84‘}?
`0.103
`9096'
`
`91%
`67%-
`7'1"}
`84%
`
`1102— M are.
`1091-
`0.052
`600’:
`13%
`0.052
`75%
`14%
`0.052
`8091
`
`1130:
`5791
`70".?
`76%
`
`H9 __14€ar. __ 0.032_
`0.009
`1411
`0.033
`0.009
`17%
`0.038
`0.009
`18%
`0.038
`
`m 192..
`0.009
`18%
`0.009
`22%
`0.009
`23%
`
`_0_l£3
`0.006
`0.054
`0.049
`
`0._04_8_
`0.050
`0.041
`0.038
`0.038
`0.039
`0.030
`0.028
`
`11009
`0.009
`0.009
`0.009
`
`_n.-OB_?W:2
`0.028
`5592'
`0.028
`68%
`0.028
`73%
`
`2401 __0.0'L 74%
`23%
`0.019
`500'.
`30%
`0.019
`65471-
`32%
`0.019
`68‘??-
`
`68%
`1691'
`21%
`25%
`27“}
`
`3221»— 0.0_19_
`34%
`0.004
`46%
`0.004
`55%
`0.004
`
`59%
`0004
`
`_0.0_09
`0.009
`0.009
`0.009
`
`0.009
`
`Cache size
`
`_ AB
`4K3
`4KB
`4K8
`4K3
`8K8
`SKB
`SKB
`
`11KB
`I6 KB
`if: KB
`111 KB
`.—
`1r. KB
`
`32 110
`32 KB
`3: KB
`
`3: KB
`
`54 KB
`1.: KB
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket