`
`SAMSUNG EXHIBIT 1009
`Samsung v. Image Processing Techs.
`
`
`
`
`
`Computer
`Architecture
`A
`Quantitative
`Approach
`
`LOS ANGELES PUBLIC LIBRARY
`CENTRAL LIBRARY
`DEPT. OF SCIENCE, TECHNOLOGY & PATENTS
`630 WEST Sth ST.
`LOS ANGELES, CA. $0071
`
`David A. Patterson
`UNIVERSITY OF CALIFORNIA AT BERKELEY
`
`John L. Hennessy
`STANFORD UNIVERSITY
`
`With a Contribution by
`David Goldberg
`Xerox Palo Alto Research Center
`
`MORGAN KAUFMANN PUBLISHERS, INC.
`SAN MATEO, CALIFORNIA
`
`Page 2 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 2 of 171
`
`
`
`
`
`
`
`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Head
`Cover Design David Lance Goines
`CopyEditor Linda Medoff
`Proofreader Paul Medoff
`Computer Typesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging-in-Publication Data
`Patterson, David A.
`Computer architecture : a quantitative approach/ David A.
`Patterson, John L. Hennessy
`.
`cm,
`Includes bibliographical references
`ISBN 1-55860- 069-8
`1. Computerarchitecture.
`QA76.9.A73P377 1990
`004.2’2--dc20
`
`I. Hennessy, John L.
`
`If. Title.
`
`89-85227
`
`CIP
`
`Morgan Kaufmann Publishers, Inc.
`Editorial Office: 2929 Campus Drive. San Mateo, CA 94403
`Order from: P.O. Box 50490, Palo Alto, CA 94303-9953
`©1990 by Morgan Kaufmann Publishers, Inc.
`All rights reserved.
`No part of this publication may be reproduced, stored in a retrieval system,or transmitted
`in any form or by any means—electronic, mechanical, recording, or otherwise—without
`the prior permission of the publisher.
`All instruction sets and other design information ofthe DLX computer system contained
`herein is copyrighted by the publisher and may not be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`ADVICE, PRAISE, & ERRORS: Any correspondencerelated to this publication or
`intendedforthe authors should be addressed to the editorial offices of Morgan Kaufmann
`Publishers. Inc., Dept. P&H APE.Information regarding error sightingsis encouraged.
`Anyerror sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction upon availability
`of the new printing. Electronic mail can be sent to bugs3 @vsop.stanford.edu. (Please
`include your full name and permanent mailing address.)
`INSTRUCTOR SUPPORT:Forinformation on classroom software and otherinstructor
`materials available to adopters, please contact the editorial offices of Morgan Kaufmann
`Publishers, Inc. (415) 578-9911.
`Fourth Printing
`
`
`Page 3 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 3 of 171
`
`
`
`
`
`To Andrea, Linda, and our four sons
`
`Page 4 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 4 of 171
`
`
`
`Ideally one would desire an indefinitely large memory
`capacity such that anyparticular ... word would be im-
`mediately available. ...We are... forced to recognize the
`possibility of constructing a hierarchy of memories, each of
`which has greater capacity thanthe preceding but which is
`less quickly accessible.
`
`A. W. Burks, H. H. Goldstine, and J. von Neumann,
`
`Preliminary Discussion ofthe Logical Design
`ofan Electronic Computing Instrument (1946)
`
`fata
`TT
`
`‘ie
`i
`
`;i ti
`rl ;
`i
`ane
`\ |
`Is
`a
`
`8.1.
`
`Introduction: Principle of Locality
`
`403
`
`7
`an |
`Hh
`ii
`
`ul]
`
`|
`
`
`
`
`
`|
`|
`|
`|
`|
`|
`|
`
`!
`
`,
`|
`|
`
`|
`|
`
`\
`
`|
`
`| i
`| t ;
`i ;
`1
`a |
`_ |
`! |
`i |
`|
`
`eal
`
`af
`
`|
`in
`
`8.3. Caches
`
`8.5 Virtual Memory
`
`8.10 Fallacies and Pitfalls
`
`8.2 General Principles of Memory Hierarchy
`8.4 Main Memory
`8.6
`Protection and Examples of Virtual Memory
`8.7 More Optimizations Based on Program Behavior
`8.8 Advanced Topics—Improving Cache-Memory
`Performance
`8.9
`Putting It All Together: The VAX-11/780 Memory
`Hierarchy
`8.11 Concluding Remarks
`8.12 Historical Perspective and References
`Exercises
`
`408
`
`432
`
`480
`
`404
`425
`438
`449
`454
`475
`484
`485
`490
`
`.
`
`;
`
`Page 5 of 171
`
`|
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 5 of 171
`
`
`
`8 Memory-Hierarchy
`
`Design
`
`|
`|
`
`|
`|
`
`8.1 | Introduction: Principle of Locality
`Computer pioneers correctly predicted that programmers would want unlimited
`amounts of fast memory. As the 90/10 rule in the first chapter predicts, most
`programs fortunately do not access all code or data uniformly (see Section 1.3,
`pages 8-12). The 90/10 rule can be restated as the principle of locality. This
`hypothesis, which holds that all programs favor a portion of their address space
`at any instant of time, has two dimensions:
`
`a Temporal locality (locality in ime)—If an item is referenced. it will tend to
`be referenced again soon.
`
`a Spatial locality (locality in space)—Ifan item is referenced, nearby items will
`tend to be referenced soon.
`
`A memory hierarchy is a natural reaction to locality and technology. The
`principle of locality and the guideline that smaller hardware is faster yield the
`conceptofahierarchybasedon different speeds and sizes. Since slower memory
`is cheaper, amemoryhierarchyisorganizedintoseveral levels—each smaller,
`faster, and more expensive per byte than the level below.The levels of the
`hierarchy subset one another; all data in one level is also found in the level
`below, and al! data in that lower level is found in the one below it, and so on
`until we reach the bottom ofthe hierarchy.
`
`Page6 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 6 of 171
`
`
`
`404
`8.1 Introduction: Principle of Locality
`
`
`
` 8.2
`
`This chapter includes a half-dozen examples that demonstrate how taking
`advantage of the principle of locality can improve performance. All
`these
`strategies map addresses from a larger memory to a smaller but faster memory.
`As part of address mapping,
`the memory hierarchy is usually given the
`responsibility of address checking; protection schemes used for doing this are
`covered in this chapter. Later we will explore advanced memory hierarchy topics
`and trace a memory access through three levels of memory on the VAX-11/780.
`
`General Principles of Memory Hierarchy
`
`Before proceeding with examples of the memory hierarchy, let’s define some
`general
`terms applicable to all memory hierarchies. A memory hierarchy
`normally consists of many levels, but it is managed between two adjacentlevels
`at a time. The upper level—the onecloser to the processor—is smaller and faster
`than the /ower level (see Figure 8.1). The minimum unit of information that can
`be either present or not present in the two-level hierarchy is called a block. The
`size of a block maybeeitherfixed or variable.If it is fixed, the memory size is a
`multiple of that block size. Most of this chapter will be concerned with fixed
`block sizes, although a variable block design is discussed in Section 8.6.
`Success or failure of an access to the upper level is designated as a hit or a
`miss: A hit is a memory access found in the upper level, while a miss meansit is
`not found in that level. Hit rate, or hit ratio—like a batting average—is the
`fraction of memory accesses found in the upper level. This is sometimes repre-
`sented as a percentage. Miss rate (1.0 — hit rate) is the fraction of memory
`accesses not found in the upperlevel.
`
`
`
` Processor
`
`!
`
`Blocks
`
`FIGURE 8.1 Every pair of levels in the memory hierarchy can be thoughtof as
`having an upperand lowerlevel. Within each level the unit of information that is present
`or not is called a block.
`
`SAMSUNG EXHIBIT 1009
`Page 7 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 7 of 171
`
`
`
`
`
`
`
`
`
`Block-offset address
`Block-frame address
`
`
`01011010001000001001010 111901110]
`
`
`Since performance is the major reason for having a memory hierarchy, the
`speed of hits and misses is important. Hit time is the time to access the upper
`level of the memory hierarchy, which includes the time to determine whether the
`accessis a hit or a miss.Misspenaltyisthetimeto replace a block in the upper
`levelwiththe corresponding blockfromthelowerlev§l, plus the time to deliver
`this block to the requesting device (normally the CPU). The miss penalty is
`further divided into two components: accesstime—thetime to accessthefirst
`word of a block on a miss; and transfer time—the additional time to transfer the
`remaining wordsin the block. Access time is related to the latency of the lower-
`level memory, while transfer time is related to the bandwidth between the lower-
`level and upper-level memories. (Sometimes access latency is used to mean
`access time.)
`The memory address is dividedinto pieces that access each part of the
`
`
`hierarchy. The block-frame addr.essisthe higher-order piece of theeoftheaddressthat
`identifies ablockatthat levelof the”hierarchy (see Figure 8.2).The block-offset
`address is the lower-order piece of the address and identifies an item within a
`block. The size of the block-offset address is log, (size of block), the size of the
`block-frame address is then the size of the full address at this level less the size
`of the block-offset address.
`
`Memory-Hierarchy Design
`
`FIGURE 8.2 Example of the frame addressandoffset address portions of a 32-bit
`lower-level memory address.In this case the block size is 512, making the size of the
`offset address 9 bits and the size of the block-frame address 23bits.
`
`Evaluating Performance of a Memory Hierarchy
`
`it is tempting to
`Because instruction count is independent of the hardware,
`evaluate CPU performance using that number. As we saw in Chapters 2 and 4,
`however, such indirect performance measures have waylaid many a computer
`designer. The corresponding temptation for evaluating memory-hierarchy
`performance is to concentrate on miss rate, for it, too,
`is independentof the
`speed of the hardware. As we shall see, miss rate can be just as misleading as
`instruction count. A better measure_ofmemory-hierarchy performanceis thene
`average time
`cess memory:
`
`Average memory-access time = Hit time + Miss rate * Miss penalty
`
`The components of average access time can be measured either in absolute
`time—say, 10 nanoseconds on a hit—orin the numberof clock cycles that the
`
`SAMSUNG EXHIBIT 1009
`
`Page 8 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 8 of 171
`
`
`
`
`
`406
`
`8.2 General Principles of Memory Hierarchy
`
`CPU waits for the memory—such as a miss penalty of 12 clock cycles.
`Rememberthat average memory-access time is still an indirect measure of
`performance; so whileit is a better measure than missrate, it 1s not a substitute
`for execution time.
`The relationship of block size to miss penalty and miss rate is shown
`abstractly in Figure 8.3. These representations assumethat the size of the upper-
`level memory does not change. The access-time portion of the miss penalty is
`not affected by block size, but the transfer time does increase with block size. If
`access time is large, initially there will be little additional miss penalty relative
`to access time as block size increases. However, increasing block size means
`fewer blocks in the upper-level memory. Increasing block size lowers the miss
`rate until
`the reduced misses of larger blocks (spatial locality) are outweighed
`by the increased misses as the numberof blocks shrinks (temporallocality).
`
`
`
`
`Transfer
`time
`
`Miss
`rate
`
`Blocksize
`
`
`Block size
`
`Miss
`penalty
`
`Block size
`
`FIGURE 8.3 Block size versus miss penalty and miss rate. The transfer-time portion of
`the miss penalty obviously grows with increasing block size. For a fixed-size upper-level
`memory, miss rates fall with increasing block size until] so much of the block is not used that
`it displaces useful information in the upperlevel, and miss rates begin to rise. The point on
`the curve on the right where miss rates begin to rise with incréasing blocksizeis
`sometimescalled the pollution point.
`
`
`
`Average
`access
`time
`
`FIGURE 8.4 Therelationship between average memory-accesstime and blocksize.
`
`SAMSUNG EXHIBIT 1009
`Page 9 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 9 of 171
`
`
`
`
`
`Memory-Hierarchy Design 407
`
`The goal of a memory hierarchy is to reduce execution time, not misses.
`
`
`
`
`Hence, computer designersTavorablocksizewiththelowestaverageaccess
`time rather than the lowest miss rate. This is related to the product of miss rate
`and miss penalty, as Figure 8.4 showsabstractly. Of course, overall CPU
`performanceis the ultimate performance test, so care must be taken whenre-
`ducing average memory-access timeto be sure that changes to clock cycle time
`and CPI improve overall performanceas well as average memory-accesstime.
`
`Implications of a Memory Hierarchy to the CPU
`
`Processors designed without a memory hierarchy are simpler because memory
`accesses always take the same amountof time. Misses in a memory hierarchy
`
`mean that the CPU must be able to handle‘variablememory-access times. If the
`misspenalty 1s on the orderoftens of clock cycles,the processor normally waits
`thousandsofprocessor clock cycles,iit is too wasteful to let the CPUsit idle; in
`this case, the CPU is interrupted and used for another process during the miss
`handling. Thus, avoiding the overhead of a long miss penalty means any
`memory access can result in a CPU interrupt. This also means the CPU mustbe
`able to recover any memory address that can cause such an interrupt, so that the
`system can know whatto transfer to satisfy the miss (see Section 5.6). When the
`memorytransfer is complete, the original process is restored, and the instruction
`that missedis retried.
`The processor must also have some mechanism to determine whether or not
`information is in the top level of the memory hierarchy. This check happens on
`every memory access andaffects hit time; maintaining acceptable performance
`usually requires the check to be implemented in hardware. The final implication
`of a memory hierarchy is that the computer must have a mechanism to transfer
`blocks between upper- and lower-level memory. If the block transfer is tens of
`clock cycles, it is controlled by hardware; if it is thousands of clock cycles, it
`éan be controlled by software.
`
`Four Questions for Classifying Memory Hierarchies
`
`The fundamental principles that drive all memory hierarchies allow us to use
`termsthat transcend the levels we are talking about. These sameprinciples allow
`us to pose four questions about any level ofthe hierarchy:
`
`Q1: Where can a block be placedin the upper level? (Block placement)
`
`Q2: Howis a block foundif it is in the upper level? (Block identification)
`
`Q3: Which block should be replaced on a miss? (Block replacement)
`
`Q4: What happens on a write? (Write strategy)
`
`These questions will help us gain an understanding of the different tradeoffs
`demanded bythe relationships of memories at different levels of a hierarchy.
`
`SAMSUNG EXHIBIT 1009
`Page 10 of 171
`
`———
`
`SAMSUNG EXHIBIT 1009
`Page 10 of 171
`
`
`
`408
`
`8.3 Caches
`
`8.3 Caches
`
`Cache:a safe place for hiding or storing things.
`Webster’ s New World Dictionary of the American Language,
`Second College Edition (1976)
`
`Cacheis the namefirst chosen to represent the level of the memory hierarchy
`between the CPU and main memory,and that is the dominant use of the term.
`While the concept of caches is younger than the IBM 360 architecture, caches
`appear today in every class of computer and in some computers more than once.
`In fact, the word has become so popular that it has replaced “buffer” in many
`computer-science circles.
`The general terms defined in the prior section can be used for caches,
`although the word /ine is often used instead of block. Figure 8.5 shows the
`typical range of memory-hierarchy parameters for caches.
`
`Block(line) size
`4 — 128 bytes
`
`
`Hit time
`1 —4 clock cycles (normally 1)
`
`Miss penalty
`8 — 32 clock cycles
`
`(Access time)
`(6— 10 clock cycles)
`(Transfer time)
`(2 — 22 clock cycles)
`
`Missrate
`1% — 20%
`
`
`
`
`
`
`
`
`Cachesize
`
`1 KB - 256 KB
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters for caches in 1990
`workstations and minicomputers.
`
`Now let’s examine caches in more detail by answering the four memory-
`hierarchy questions.
`
`Q1: Where Can a Block Be Placed in 4 Cache?
`
`Restrictions on where a block is placed create three categories of cache
`organization:
`
`s
`
`a
`
`If each block has only one place it can appearin the cache, the cacheis said
`to be direct mapped. The mappingis usually (block-frame address) modulo
`(numberof blocks in cache).
`
`Ifa block can be placed anywherein the cache, the cacheis said to be fully
`associative,
`
`SAMSUNG EXHIBIT 1009
`Page 11 of 171
`
`:
`
`7
`
`*
`
`=
`L
`
`q
`
`ieee
`
`Wangarte
`
`‘
`|
`i E
`
`[iE
`
`SAMSUNG EXHIBIT 1009
`Page 11 of 171
`
`
`
`Memory-Hierarchy Design
`
`409
`
`
`
`a Ifa block can be placedinarestricted set of places in the cache, the cache is
`said to be ser associative. A set is a group of two or more blocksin the cache.
`A blockis first mapped onto a set, and then the block can be placed anywhere
`within the set. The set is usually chosen by bit selection; that is, (block-frame
`address) modulo (numberofsets in cache). If there are n blocks in a set, the
`cache placementis called n-way set associative.
`
`The range of caches from direct mapped to fully associative is really a
`continuum oflevels of set associativity: Direct mapped is simply one-way set
`associative and a fully associative cache with m blocks could be called m-way
`set associative. Figure 8.6 shows where block 12 can be placed in a cache
`according to the block-placementpolicy.
`
`Fully associative:
`block 12 can go
`anywhere
`
`01234567
`
`Direct mapped:
`block 12 can go
`only into block 4
`(12 mod 8)
`
`Block 01234567
`
`no.
`
`
`
`no. 0123456789
`
`Set associative:
`block 12 can go
`anywhere in set 0
`(12 mod 4)
`
`01234567
`
`. |
`
`
`
`Set Set Set Set
`0
`1
`2
`3
`Block-frame address
`
`
`
`Block
`
`FIGURE 8.6 The cachehas8 blocks, while memory has 32 blocks. Theset-
`associative organization has 4 sets with 2 blocks perset, called two-way set associative.
`(Real caches contain hundreds of blocks and real memories contain hundredsof thousands
`of blocks.) Assume thatthere is nothing in the cache and that the block-frame address in
`question identifies lower-level block 12. The three options for caches are shownleft to right.
`In fully associative, block 12 from the lower level can go into any of the 8 blocks of the
`cache. With direct mapped, block 12 can only be placed into block 4 (12 modulo 8). Set
`associative, which has someofbothfeatures, allows the block to be placed anywherein set
`0 (12 modulo 4). With two blocks perset, this means block 12 can be placedeither in block
`0 or block 1 of the cache.
`
`SAMSUNG EXHIBIT 1009
`Page 12 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 12 of 171
`
`
`
`410
`
`Q2: HowIs a Block FoundIf It Is in the Cache?
`
`
`
`Cachesinclude an address tag on each block that gives the block-frame address.
`The tag of every cache block that might contain the desired information is
`checked to see if it matches the block-frame address from the CPU. Figure 8.7
`gives an example. Because speedis of the essence,all possible tags are searched
`in parallel; serial search would makesetassociativity counterproductive.
`
`
`Fully associative
`01234567
`
`Direct mapped
`Block 01234567
`
`Set associative
`Block 01234567
`
`Search
`
`8.3 Caches
`
`There must be a way to know that a cache block does not have valid
`information. The most common procedure is to add a valid bit to the tag to say
`whetheror not this entry contains apalid address. If the bit is not set, there
`cannot be a match onthis address.
`A common omission in finding the cost of caches is to forget the cost of the
`tag memory. One tag is required for each block. An advantage of increasing
`
`block sizes is that the tag overhead per cache entry becomes a smaller fraction of
`
`
`
`thetotalcostofthecache.—FSC~™ _
`
`~~Beforeproceedingtothe next question, let’s explore the relationship of a
`CPU address to the cache. Figure 8.8 shows how anaddressis divided into three
`fields to find data in a set-associative cache: the block-offset field used to select
`the desired data from the block, the index field used to select the set, and the fag
`field used for the comparison. While the comparison could be made on more of
`the address than the tag, there is no need:
`
`uy
`mn PT
`
`FIGURE 8.7 In fully associative placement, the block for biock-frame address 12 can
`appearin any of the 8 blocks; thus,all 8 tags must be searched. The desired data is
`found in cache block 6 in this example. In direct-mapped placementthere is only one cache
`block where memory block 12 can be found. In set-associative placement, with 4 sets,
`memory block 12 must be in set 0 (12 mod 4); thus, the tags of cache blocks 0 and 1 are
`checked. In this case the data is found in cache block 1. Speed of cache accessdictates
`that searching must be performedin parallel for fully associative and set-associative
`mappings.
`
`SAMSUNG EXHIBIT 1009
`Page 13 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 13 of 171
`
`
`
`
`
`
`Memory-Hierarchy Design
`
`
`
`s Checking the index would be redundant, since it was used to select the set to
`be checked (an address stored in set 0, for example, must have 0 in the index
`field or it couldn’t be stored in set 0).
`
`
`
`
`
`
`
`
`u The offset is unnecessary in the comparison becauseall block offsets match
`and the entire block is present or not.
`
`If the total size is kept the same, increasing associativity increases the number of
`blocks perset, thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 movesto the right with
`increasing associativity.
`
`
`
`
`FIGURE 8.8 The 3 portions of an addressin a set-associative or direct-mapped cache.
`The tag is used to checkall the blocks in the set and the index is used to select the set. The
`block offset is the address of the desired data within the block.
`
`Q3: Which Biock Should Be Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn’t,
`then it would be easy to select which block to replace. Alas, the high hit rate of
`caches meansthat the overwhelming decision is between blocks that have valid
`data.
`is that hardware decisions are
`A benefit of direct-mapped placement
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set-
`associative placement, there are several blocks to choose from on a miss. There
`are two primary strategies employed for selecting which block to replace:
`
`a Random—Tospread allocation uniformly, candidate blocks are randomly
`selected. Some systems use a schemefor spreading data across a set of blocks
`in a pseudorandomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`a Least-recently used (LRU)—To reduce the chance of throwing out informa-
`tion that will be needed soon, accesses to blocks are recorded. The block
`replaced is the one that has been unusedfor the longest time. This makes use
`of a corollary of temporal locality: If recently used blocks are likely to be
`used again, then the best candidate for disposal is the least recently used.
`Figure 8.9 (page 412) shows which block is the least-recently used for a
`sequence of block-frame addressesin a fully associative memory hierarchy.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`SAMSUNG EXHIBIT 1009
`
`Page 14 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 14 of 171
`
`
`
`Page 15 of 171
`
`
`Q4: What Happens on a Write?
`Reads dominate cache accesses. Ail instruction accesses are reads, and most
`instructions don’t write to memory. Figure 4.34 (page 181) suggests a mix of 9%
`stores and 17% loads for four DLX programs, making writes less than 10%of
`the memory traffic. Making the commoncase fast means optimizing caches for
`reads, but Amdahl’s Law reminds us that high-performance designs cannot
`neglect the speed of writes.
`Fortunately, the commoncaseis also the easy case to make fast. The block
`can beread at the sametimethat the tag is read and compared, so the block read
`begins as soon as the block-frame address is available. If the read is a hit, the
`block is passed on to the CPU immediately. If it is a miss, there is no benefit—
`but also no harm.
`
`_.. ee
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-frame addressesin
`a fully associative memory hierarchy. This assumesthat there are 4 blocks andthat in
`the beginning the LRU block is number 0. The LRU block number is shown below each
`new block reference. Another policy, First-in-first-out (FIFO), simply discards the block that
`was used N unique accessesbefore, independentof its reference pattern in the last N — 1
`references. Random replacement generally outperforms FIFO andit is easier to implement.
`
`—_————sv
`
`Associativity:
`
`Size
`
`2-way
`
`Random
`
`LRU
`
`4-way
`
`Random
`
`LRU
`
`8-way
`
`Random
`
`LRU
`
`|
`
`1.53%
`
`
`16 KB
`5.18% 5.69%
`4.67% 5.29%
`4.39% 4.96%
`— .
`_
`-
`sj
`1.88% 2.01%
`1.54% 1.66%
`
`1.39%
`
`-
`
`64 KB
`
`256 KB
`
`115% 1.17%
`
`1.13%
`
`1.13%
`
`1.12%
`
`1.12%
`
`FIGURE 8.10 Miss rates comparing least-recently used versus random replacement
`for several sizes and associativities. This data was collected for a block size of 16 bytes
`using one of the VAX traces containing user and operating system code (SAVE). This
`traceis included in the software supplementfor course use. Thereis little difference
`between LRU and random for larger size cachesin this trace.
`
`|
`
`“
`
`SAMSUNG EXHIBIT 1009
`
`412
`
`8.3 Caches
`
`A virtue of random is that it is simple to build in hardware. As the numberof
`blocks to keep track of increases, LRU becomes increasingly expensive andis
`frequently only approximated. Figure 8.10 shows the difference in miss rates
`between LRU and random replacement. Replacement policy plays a greater role
`in smaller caches than in larger caches where there are more choices of whatto
`replace.
`
` -
`
`
`
`
`
`
`
`
`Block-frame addresses | 0 i)1 3 2 l 0
`
`
`
`
`
`
`
`
`
`LRU block number|0) 0/0 0) 3) 3/3|1|0 02 |
`
`_
`
`fan
`
`—
`
`Wo
`
`-
`| o
`
`SAMSUNG EXHIBIT 1009
`Page 15 of 171
`
`
`
`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block cannot begin until the tag is checkedto seeif it is
`a hit. Because tag checking cannotoccurin parallel, then, writes normally take
`longer than reads.
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`a Write through (or store through)—The information is written to both the
`blockin the cache andto the block in the lower-level memory.
`
`a Write back (also called copy back or store in)—The information is written
`only to the block in the cache. The modified cache block is written to main
`memory only whenitis replaced.
`
`Write-back cache blocksare called clean or dirty, depending on whether the
`information in the cache differs from that in lower-level memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirty bit is
`commonly used. This status bit indicates whether or not the block was modified
`while in the cache.If it wasn’t, the block is not written, since the lower level has
`the same information as the cache.
`Both write back and write through have their advantages. With write back,
`writes occur at theSpeed_ofthecachememory, andmultiple writes within a
`blockrequire only one write to the lower-level memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through, read misses don’t result
`in writes to the lower level, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the most
`current copy of the data. This is important in multiprocessors and for I/O, which
`weshall examine in Section 8.8. Hence, multiprocessors want write back to
`reduce the memory traffic per processor and write through to keep the cache and
`memory consistent.
`When the CPU must wait for writes to complete during write throughs, the
`CPUis said to write stall. A common optimization to reduce write stalls is a
`write buffer, which allows the processor to continue while the memory is
`updated. As weshall see in Section 8.8, write stalls can occur even with write
`uffers.
`There are two options on a write miss:
`
`a Write allocate (also called fetch on write)—Theblockis loaded, followed by
`the write-hit actions above. This is similar to a read miss.
`
`a No write allocate (also called write around)—The block is modified in the
`lowerlevel and not loaded into the cache.
`
`
`
`=
`
`SAMSUNG EXHIBIT 1009
`Page 16 of 171
`ret
`
`SAMSUNG EXHIBIT 1009
`Page 16 of 171
`
`
`
` 414
`
`
`
`8.3 Caches
`
`While either write-miss policy could be used with write through or write back,
`generally write-back caches use write allocate (hoping that subsequent writes to
`that block will be captured by the cache) and write-through caches often use no
`write allocate (since subsequent writes to that block will still have to go to
`memory).
`
`An Example Cache: The VAX-11/780 Cache
`
`To give substance to these ideas, Figure 8.11 shows the organization of the
`cache on the VAX-11/780. The cache contains 8192 bytes of data in 8-byte
`blocks with two-way-set-associative placement, random replacement, write
`through with a one-word write buffer, and no write allocate on a write miss.
`Let’s trace a cache hit through the steps of a hit as labeled in Figure 8.11.
`(The five steps are shown as circled numbers.) The address coming into the
`cacheis divided into two fields: the 29-bit block-frame address and 3-bit block
`offset. The block-frame addressis further divided into an address tag and cache
`index. Step | showsthis division.
`The cache index selects the set to be tested to see if the block is in the cache.
`(A set is one block from each bank in Figure 8.11.) The size of the index
`depends on cachesize, block size, and set associativity. In this case, a 9-bit
`index results:
`
`
`Blocks _
`Cache size
`_ 8192 =
`_99
`Bank — Blocksize * Set associativity 8 * 27 sl2=2
`In a two-way-set-associative cache, the index is sent to both banks. This is
`step 2.
`After reading an address tag from each bank, the tag portion of the block-
`frame address is compared to the tags. This is step 3 in the figure. To be sure the
`tag contains valid information, the valid bit must beset, or the results of the
`comparison are ignored.
`Assuming one of the tags does match, a 2:1 multiplexer (step 4) is set to
`select the block from the matching set. Why can’t both tags match?It is the job
`of the replacement algorithm to makésure that an address appears in only one
`block. To reduce the hit time, the data is read at the same time as the address
`tags; thus, by the time the block multiplexer is ready, the data is also ready.
`This step is needed in set-associative caches, but it can be omitted from
`direct-mapped caches since there is no selection to be made. The multiplexer
`used in this step can be on the critical
`timing path, endangering the clock cycle
`time of the CPU. (The example on pages 418-419 and the fallacy on page 481
`explore the trade-off of lower miss rates and higher clock cycle time.)
`In the final step the word is sent to the CPU. All five steps occur within a
`single CPU clock cycle.
`Whathappenson a miss? Thecachesendsa stall signal to the CPUtelling it
`to wait, and two words (eight bytes) are read from memory. That takes 6 clock
`cycles on the VAX-11/780 (ignoring bus interference). When the data arrives,
`
`t
`
`fj
`
`SAMSUNG EXHIBIT 1009
`Page 17 of 17
`
`
`
`SAMSUNG EXHIBIT 1009
`Page 17 of 171
`
`
`
`the cache must pick a block to replace: the VAX-11/780 selects one of the two
`blocks at random. Replacing a block means updating the data, the addresstag,
`and the valid bit. Once this is done, the cache goes through a regular hit cycle
`and returns the data to the CPU.
`Writes are more complicated in the VAX-11/780, as they are in any cache.If
`the word to be written is in the cache, the first four steps are the same. The next
`step is to write the data in the block, then write the changed-data portion into the
`
`
`
`
`
`
`
`
`
`
`eeeee+
`
`2own.5
`nou
`
`
`sod
`bataDatata
`
`
`
`
`Tag
`Valid
`
`<i> <20>
`
`
`
`
`
`
`
`blocks)
`
`
`
`
`FIGURE 8.11 The organization of the VAX-11/780 cache. The 8-KB cacheis two-way
`set associative with 8-byte blocks. It has 512 sets with two blocks perset, the setis
`selected by the 9-bit index. The five steps of a read hit, shownascircled numbersin order
`of occurrence, label this organization. Theline from memoryto the cache is used on a miss
`to load the cache. Multiplexing as found in step 4 is not neededin a direct-mapped cache.
`Note that the offset is connectedto chip select of the data SRAMsto allow the proper
`words to be sent to the 2:1 multiplexer.
`
`
`
`
`
`
`SAMSUNG EXHIBIT 1009
`
`Page 18 of 17