`
`SAMSUNG EXHIBIT 1009
`Samsung v. Image Processing Techs.
`
`
`
`
`
`Computer
`Architecture
`
`A
`
`Quantitative
`Approach
`
`LOS ANGELES PUBLIC LIBRAFIY
`CENTRAL LIBRARY
`DEPT. OF SCIENCE TECHNOLOGY & PATENTS
`630 WEST 51h ST.
`LOS ANGELES, CA. 90071
`
`David A. Patterson
`UNIVERSITY OF CALIFORNIA AT BERKELEY
`
`John L. Hennessy
`STANFORD UNIVERSITY
`
`With a Contribution by
`David Goldberg
`Xerox Palo Alto Research Center
`
`MORGAN KAUFMANN PUBLISHERS, INC.
`SAN MATEO, CALIFORNIA
`
`Page 2 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 2 of 171
`
`
`
`
`
`
`
`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Head
`Cover Design David Lance Goines
`Copy Editor Linda Medoff
`Proofreader Paul Medoff
`Computer Tvpesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging—in—Publication Data
`Patterson, David A.
`Computer architecture : a quantitative approach / David A.
`Patterson, John L. Hennessy
`p.
`cm.
`Includes bibliographical references
`ISBN 1-55860— 069—8
`1. Computer architecture.
`QA76.9.A73P377 1990
`004.2’2--dc20
`
`I. Hennessy, John L.
`
`ll. Title.
`
`89-85227 CIP
`
`Morgan Kaufmann Publishers, Inc.
`Editorial Office: 2929 Campus Drive. San Mateo. CA 94403
`Order from: PO. Box 50490. Palo Alto, CA 94303—9953
`
`@1990 by Morgan Kaufmann Publishers. Inc.
`All rights reserved.
`No part of this publication may be reproduced. stored in a retrieval system. or transmitted
`in any form or by any means—electronic. mechanical. recording. or otherwise——without
`the prior permission of the publisher.
`All instruction sets and other design information of the DLX computer system contained
`herein is copyrighted by the publisher and may npt be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`ADVICE. PRAISE. & ERRORS: Any correspondence related to this publication or
`intended for the authors should be addressed to the editorial offices of Morgan Kaufmann
`Publishers. Inc, Dept. P&H APE. Information regarding error sightings is encouraged.
`Any error sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction upon availability
`of the new printing. Electronic mail can be sent to bugs3@vsop.stanford.edu. (Please
`include your full name and permanent mailing address.)
`INSTRUCTOR SUPPORT: For information on classroom software and other instructor
`materials available to adopters. please contact the editorial offices of Morgan Kaufmann
`Publishers. Inc. (415)578-9911.
`Fourth Printing
`
`
`
`Page 3 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 3 of 171
`
`
`
`
`
`T0 Andrea, Linda, and our four sons
`
`Page 4 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 4 of 171
`
`
`
`a
`I
`I
`
`I
`
`I
`
`‘
`I
`
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`II
`I
`I
`I w w
`I
`I
`I
`
`III,
`
`g
`
`I
`
`I
`‘
`
`II
`II"
`.
`II
`I
`I
`
`III
`I xi
`I
`
`I
`
`I
`
`I
`I
`I
`
`I
`
`I
`
`‘
`
`II
`I
`I
`I
`
`I
`
`
`
`
`
`Ideally one would desire an indefinitely large memory
`capacity such that any particular .
`.
`. word would be im-
`mediately available. .
`.
`. We are .
`.
`. forced to recognize the
`possibility of constructing a hierarchy ofmemories, each of
`which has greater capacity than the preceding but which is
`less quickly accessible.
`A. W. Burks. H. H. Goldstine, and J. von Neumann,
`Preliminary Discussion of the Logical Design
`ofan Electronic Computing Instrument (1946)
`
`i
`
`I
`
`I
`
`ml
`
`I
`II
`
`I
`II
`I.
`H
`II
`1i
`I
`:1
`I I
`I
`‘I
`"I
`w
`I»
`I
`
`I
`I
`
`I
`I
`
`t
`
`I
`
`’~I
`I
`
`8.1
`
`Introduction: Principle of Locality
`
`8.2 GeneralPrinciplesofMemoryHierarchy
`
`8.3 Caches
`8.4 Main Memory
`8.5 Virtual Memory
`8.6
`Protection and Examples of Virtual Memory
`8.7 More Optimizations Based on Program Behavior
`8.8 Advanced Topics—Improving Cache-Memory
`Performance
`8.9
`Putting It All Together: The VAX-1 1/780 Memory
`Hierarchy
`8.10 Fallacies and Pitfalls
`8.1 1 Concluding Remarks
`8.1 2 Historical Perspective and References
`
`Exercises
`
`(
`
`403
`
`404
`
`408
`425
`432
`438
`449
`
`454
`
`475
`480
`484
`485
`
`490
`
`I
`I
`I
`I
`t
`I
`J
`I
`
`1
`I
`I
`
`I
`1 _
`I
`
`I
`
`I
`
`I
`I
`I
`
`'
`'
`I
`I
`
`I
`
`SAMSUNG EXHIBIT 1009
`
`Page 5 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 5 of 171
`
`
`
`8 Memory-Hierarchy
`
`Design
`
`l
`
`l
`I
`
`I
`
`8.1 l Introduction: Principle of Locality
`
`Computer pioneers correctly predicted that programmers would want unlimited
`amounts of fast memory. As the 90/10 rule in the first chapter predicts. most
`programs fortunately do not access all code or data uniformly (see Section 1.3.
`pages 8712). The 90/10 rule can be restated
`the principle of locality. This
`hypothesis. which holds that all programs favor a portion of their address space
`at any instant of time. has two dimensions:
`
`- Temporal locality (locality in time)—lf an item is referenced. it will tend to
`be referenced again soon.
`
`- Spatial locality (locality in space)—If an item is referenced, nearby items will
`tend to be referenced soon.
`
`A memory hierarchy is a natural reaction to locality and technology. The
`principle of locality and the guideline that smaller hardware is faster yield the
`
`concept ofilnerarchyhasedbndifferem speeds and sizes. Since slower memory
`
`is cheaper. a ntemoryjierarchy‘jsorganged into several levels—each smaller.
`faster. and more expensive per byte than the level below. The levels of the
`hierarchy subset one another; all data in one level is also found in the level
`below. and all data in that lower level is found in the one below it. and so on
`
`until we reach the bottom of the hierarchy.
`
`Page 6 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 6 of 171
`
`
`
`8.1 Introduction: Principle of Locality
`404
`
`
`
` 8.2
`
`This chapter includes a half-dozen examples that demonstrate how taking
`advantage of the principle of locality can improve performance. All
`these
`strategies map addresses from a larger memory to a smaller but faster memory.
`As part of address mapping,
`the memory hierarchy is usually given the
`responsibility of address checking; protection schemes used for doing this are
`covered in this chapter. Later we will explore advanced memory hierarchy topics
`and trace a memory access through three levels of memory on the VAX-l 1/780.
`
`General Principles of Memory Hierarchy
`
`Before proceeding with examples of the memory hierarchy, let’s define some
`general
`terms applicable to all memory hierarchies. A memory hierarchy
`normally consists of many levels. but it is managed between two adjacent levels
`at a time. The upper level—the one closer to the processor—is smaller and faster
`than the lower level (see Figure 8.1). The minimum unit of information that can
`be either present or not present in the two-level hierarchy is called a block. The
`size of a block may be either fixed or variable. If it is fixed, the memory size is a
`multiple of that block size. Most of this chapter will be concerned with fixed
`block sizes, although a variable block design is discussed in Section 8.6.
`Success or failure of an access to the upper level is designated as a hit or a
`miss: A hit is a memory access found in the upper level, while a miss means it is
`not found in that level. Hit rate, or hit ratio—like a batting average—is the
`fraction of memory accesses found in the upper level. This is sometimes repre—
`sented as a percentage. Miss rate (1.0 — hit rate) is the fraction of memory
`accesses not found in the upper level.
`
`I
`
` Processor
`
`
`
`FIGURE 8.1 Every pair of levels in the memory hierarchy can be thought of as
`having an upper and lower level. Within each level the unit of information that is present
`or not is called a block.
`
`SAMSUNG EXHIBIT 1009
`
`Page 7 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 7 of 171
`
`
`
`
`Memory-Hierarchy Design
`
`The memory address is divided into pieces that access each part of the
`
`hierar/chy. Eblaglilfilame address is the higher-order piece of_theW
`identifies a block at that level of thefihierarchy (sée Figure 8.2). The block-afi‘set
`address is the lower-order piece of the address and identifies an item within a
`block. The size of the block-offset address is logz (size of block); the size of the
`block-frame address is then the size of the full address at this level less the size
`of the block—offset address.
`
`
`
`
`
`Block-offset address
`Block-frame address
`
`
`01011010001000001001010 111001110'
`
`
`
`FIGURE 8.2 Example of the frame address and offset address portions of a 32-bit
`lower-level memory address. In this case the block size is 512, making the size of the
`offset address 9 bits and the size of the block-frame address 23 bits.
`
`speed of hits and misses is important. Hit time is the time to access the upper
`level of the memory hierarchy, which includes the time to determine whether the
`access is a hit or a miss. Whetimeto replace a block i_n the upper
`level with the corresponding bblflrpmthe_lower level plus the time to deliver!
`this block to the reguesting device (normally the CPU). The miss penalty is
`
`further divided into two components: access time—the time to access the first
`word of a block on a miss; and transfer time—the additional time to transfer the
`remaining words in the block. Access time is related to the latency of the lower-
`level memory, while transfer time is related to the bandwidth between the lower—
`level and upper-level memories. (Sometimes access latency is used to mean
`access time.)
`
` Since performance is the major reason for having a memory hierarchy, the
`
`Evaluating Performance of a Memory Hierarchy
`
`it is tempting to
`Because instruction count is independent of the hardware,
`evaluate CPU performance using that number. As we saw in Chapters 2 and 4,
`however, such indirect performance measures have waylaid many a computer
`designer. The corresponding temptation for evaluating memory-hierarchy
`performance is to concentrate on miss rate, for it, too,
`is independent of the
`speed of the hardware. As we shall see, miss rate can be just as misleading as
`‘
`x
`instruction count. A better ingsurggfimemorymierarchy,Vperfogrrnance is the
`
`aWQQSS memory:
`
`Average memory-access time = Hit time + Miss rate * Miss penalty
`
`The components of average access time can be measured either in absolute
`time—say, 10 nanoseconds on a hit—or in the number of clock cycles that the
`
`SAMSUNG EXHIBIT 1009
`
`Page 8 of 171
`
`
`
`SAMSUNG EXHIBIT 1009
`Page 8 of 171
`
`
`
`
`
`406
`
`8.2 General Principles of Memory Hierarchy
`
`CPU waits for the memory—such as a miss penalty of 12 Clock cycles.
`Remember that average memory-access time is still an indirect measure of
`performance; so while it is a better measure than miss rate. it is not a substitute
`for execution time.
`
`The relationship of block size to miss penalty and miss rate is shown
`abstractly in Figure 8.3. These representations assume that the size of the upper-
`level memory does not change. The access-time portion of the miss penalty is
`not affected by block size, but the transfer time does increase with block size. If
`access time is large, initially there will be little additional miss penalty relative
`to access time as block size increases. However, increasing block size means
`fewer blocks in the upper-level memory. Increasing block size lowers the miss
`rate until
`the reduced misses of larger blocks (spatial locality) are outweighed
`by the increased misses as the number of blocks shrinks (temporal locality).
`
`
`
`Miss
`penalty
`
`Block size
`
`
`
`
`Transfer
`time
`
`Miss
`rate
`
`Block size
`
`FIGURE 8.3 Block size versus miss penalty and miss rate. The transfer-time portion of
`the miss penalty obviously grows with increasing block size. For a fixed-size upper—level
`memory, miss rates fall with increasing block size until so much of the block is not used that
`it displaces useful information in the upper level, and miss rates begin to rise. The point on
`the curve on the right where miss rates begin to rise with increasing block size is
`sometimes called the pollution point.
`
`
`
`Block size
`
`Average
`access
`time
`
`FIGURE 8.4 The relationship between average memory-access time and block size.
`
`SAMSUNG EXHIBIT 1009
`
`Page 9 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 9 of 171
`
`
`
`
`Memory—Hierarchy Design
`407
`
`The goal of a memory hierarchy is to reduce execution time not misses.
`Hence, computer designersmm
`time rather than the lowest miss rate. This is related to the product of miss rate
`and miss penalty, as Figure 8.4 shows abstractly. Of course, overall CPU
`performance is the ultimate performance test, so care must be taken when re-
`ducing average memory-access time to be sure that changes to clock cycle time
`and CPI improve overall performance as well as average memory-access time.
`
`Implications of a Memory Hierarchy to the CPU
`
`Processors designed without a memory hierarchy are simpler because memory
`accesses always take the same amount of time. Misses in a memory hierarchy
`
`mean that the CPU must be able to handleyariabl’emmemory-access times. If the
`mfipé'n‘alfi is on the order of tens of clock cycles, the processor normally waits
`W .
`.
`.
`'forthe memory transfer to complete. On the other hand, if the miss penalty 15
`thousaEI-ds of processor clock cycles, it is too wasteful to let the CPU sit idle; in
`this case, the CPU is interrupted and used for another process during the miss
`handling. Thus, avoiding the overhead of a long miss penalty means any
`memory access can result in a CPU interrupt. This also means the CPU must be
`able to recover any memory address that can cause such an interrupt, so that the
`system can know what to transfer to satisfy the miss (see Section 5.6). When the
`memory transfer is complete, the original process is restored, and the instruction
`that missed is retried.
`
`The processor must also have some mechanism to determine whether or not
`information is in the top level of the memory hierarchy. This check happens on
`every memory access and affects hit time; maintaining acceptable performance
`usually requires the check to be implemented in hardware. The final implication
`of a memory hierarchy is that the computer must have a mechanism to transfer
`blocks between upper- and lower—level memorylW
`(WM...
`clock cycles, it is controlled by hardware; if it is thousands of clock cycles, it
`can be controlled by software.
`
`Four Questions for Classifying Memory Hierarchies
`
`The fundamental principles that drive all memory hierarchies allow us to use
`terms that transcend the levels we are talking about. These same principles allow
`us to pose four questions about any level of the hierarchy:
`
`Q1: Where can a block be placed in the upper level? (Block placement)
`
`Q2: How is a block found if it is in the upper level? (Block identification)
`
`Q3: Which block should be replaced on a miss? (Block replacement)
`
`Q4: What happens on a write? (Write strategy)
`
`These questions will help us gain an understanding of the different tradeoffs
`demanded by the relationships of memories at different levels of a hierarchy.
`
`Page 10 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 10 of 171
`
`
`
`408
`
`8.3 Caches
`
`8.3 Caches
`
`Cache: a safe place for hiding or storing things.
`
`Webster’s New World Dictionary of the American Language,
`Second College Edition (1976)
`
`Cache is the name first chosen to represent the level of the memory hierarchy
`between the CPU and main memory, and that is the dominant use of the term.
`While the concept of caches is younger than the IBM 360 architecture, caches
`appear today in every class of computer and in some computers more than once.
`In fact, the word has become so popular that it has replaced “buffer” in many
`computer-science circles.
`The general terms defined in the prior section can be used for caches.
`although the word line is often used instead of block. Figure 8.5 shows the
`typical range of memory—hierarchy parameters for caches.
`
`Block (line) size
`Hit time
`Miss penalty
`
`(Access time)
`(Transfer time)
`Miss rate
`Cache size
`
`
`
`
`
`
`4 — 128 bytes
`1 — 4 clock cycles (normally 1)
`8 — 32 clock cycles
`
`(6 — 10 clock cycles)
`(2 — 22 clock cycles)
`1% — 20%
`1 KB — 256 KB
`
`.
`
`7
`
`1
`
`(I
`it
`
`‘ j
`f.
`"
`
`t
`
`i
`1
`|
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters for caches in 1990
`workstations and minicomputers.
`
`Now let’s examine caches in more detail by answering the four memoryv
`hierarchy questions.
`
`Q1: Where Can a Block Be Placed in E; Cache?
`
`Restrictions on where a block is placed create three categories of cache
`organization:
`I
`If each block has only one place it can appear in the cache, the cache is said
`to be direct mapped. The mapping is usually (block-frame address) modulo
`(number of blocks in cache).
`If a block can be placed anywhere in the cache, the cache is said to be fully
`associative.
`
`u
`
`SAMSUNG EXHIBIT 1009
`Page 11 of 171
`
`.
`‘1
`
`
`
`
`
`SAMSUNG EXHIBIT 1009
`Page 11 of 171
`
`
`
`Memory-Hierarchy Design
`
`409
`
`u
`
`If a block can be placed in a restricted set of places in the cache, the cache is
`said to be set associative. A set is a group of two or more blocks in the cache.
`A block is first mapped onto a set, and then the block can be placed anywhere
`within the set. The set is usually chosen by bit selection; that is, (block-frame
`address) modulo (number of sets in cache). If there are n blocks in a set. the
`cache placement is called n-way set associative.
`
`The range of caches from direct mapped to fully associative is really a
`continuum of levels of set associativity: Direct mapped is simply one-way set
`associative and a fully associative cache with m blocks could be called m-way
`set associative. Figure 8.6 shows where block 12 can be placed in a cache
`according to the block—placement policy.
`
`Fully associative:
`block 12 can go
`anywhere
`
`Direct mapped:
`block 12 can go
`only into block 4
`(12 mode)
`
`Block 01234567
`
`Block 01234567
`
`no,
`
`no.
`
`
`
`Set associative:
`block 12 can go
`anywhere in set 0
`(72 mod4)
`
`01234567
`
`.
`
`no. 01234567890
`
`Set Set Set Set
`0
`1
`2
`3
`Block-frame address
`
`
`
`Block
`
`1
`
`FIGURE 8.6 The cache has 8 blocks, while memory has 32 blocks. The set-
`associative organization has 4 sets with 2 blocks per set, called two-way set associative.
`(Real caches contain hundreds of blocks and real memories contain hundreds of thousands
`of blocks.) Assume that there is nothing in the cache and that the block-frame address in
`question identifies lower-level block 12. The three options for caches are shown left to right.
`In fully associative, block 12 from the lower level can go into any of the 8 blocks of the
`cache. With direct mapped, block 12 can only be placed into block 4 (12 modulo 8). Set
`associative, which has some of both features, allows the block to be placed anywhere in set
`0 (12 modulo 4). With two blocks per set, this means block 12 can be placed either in block
`0 or block 1 of the cache.
`
`SAMSUNG EXHIBIT 1009
`
`Page 12 of 171
`
`
`
`SAMSUNG EXHIBIT 1009
`Page 12 of 171
`
`
`
`
`
`
`
`410
`8.3 Caches
`
`Q2: How Is a Block Found If It Is in the Cache?
`
`Caches include an address tag on each block that gives the block-frame address.
`The tag of every cache block that might contain the desired information is
`checked to see if it matches the block-frame address from the CPU. Figure 8.7
`gives an example. Because speed is of the essence, all possible tags are searched
`in parallel; serial search would make set associativity counterproductive.
`
`Fully associative
`Block 01234567
`
`Direct mapped
`Block 01234567
`
`Set associative
`Block 01234567
`
`Search
`
`IIIIIII
`Hill 1 ii
`
`IIIIIII
`i
`
`Set Set Set Set
`
`IIIIIII
`H
`
`FIGURE 8.7 In tulIy associative placement, the block for block-frame address 12 can
`appear in any of the 8 blocks; thus, all 8 tags must be searched. The desired data is
`found in cache block 6 in this example. In direct-mapped placement there is only one cache
`block where memory block 12 can be found. In set-associative placement, with 4 sets,
`memory block 12 must be in set 0 (12 mod 4); thus, the tags of cache blocks 0 and 1 are
`checked. In this case the data is found in cache block 1. Speed of cache access dictates
`that searching must be performed in parallel for fully associative and set—associative
`mappings.
`
`There must be a way to know that a cache block does not have valid
`information. The most common procedure is to add a valid bit to the tag to say
`whether or not this entry contains a iyalid address. If the bit is not set, there
`cannot be a match on this address.
`
`A common omission in finding the cost of caches is to forget the cost of the
`tag memory. One tag is required for each block. AWg
`
`block sizes is that the tag overhead per cache entry b'écomes a smaller fractiquof
`tWfl—g“
`
`We next question, let’s explore the relationship of a
`CPU address to the cache. Figure 8.8 shows how an address is divided into three
`fields to find data in a set—associative cache: the block—oflset field used to select
`the desired data from the block, the index field used to select the set, and the tag
`
`field used for the comparison. While the comparison could be made on more of
`the address than the tag, there is no need:
`
`SAMSUNG EXHIBIT 1009
`
`Page 13 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 13 of 171
`
`
`
`
`
`
`is that hardware decisions are
`A benefit of direct-mapped placement
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set-
`associative placement, there are several blocks to choose from on a miss. There
`are two primary strategies employed for selecting which block to replace:
`
`- Random—To spread allocation uniformly, candidate blocks are randomly
`selected. Some systems use a scheme for spreading data across a set of blocks
`in a pseudorandomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`I Least—recently used (LRU)—To reduce the chance of throwing out informa—
`tion that will be needed soon, accesses to blocks are recorded. The block
`
`replaced is the one that has been unused for the longest time. This makes use
`of a corollary of temporal locality: If recently used blocks are likely to be
`used again, then the best candidate for disposal is the least recently used.
`Figure 8.9 (page 412) shows which block is the least—recently used for a
`sequence of block-frame addresses in a fully associative memory hierarchy.
`
`
`
`SAMSUNG EXHIBIT 1009
`Page 14 of 171
`
`Memory-Hierarchy Design
`
`- Checking the index would be redundant, since it was used to select the set to
`
`be checked (an address stored in set 0, for example, must have 0 in the index
`field or it couldn’t be stored in set 0).
`
`n The offset is unnecessary in the comparison because all block offsets match
`and the entire block is present or not.
`
`If the total size is kept the same, increasing associativity increases the number of
`blocks per set, thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 moves to the right with
`increasing associativity.
`
`FIGURE 8.8 The 3 portions of an address in a set-associative or direct-mapped cache.
`The tag is used to check all the blocks in the set and the index is used to select the set. The
`block offset is the address of the desired data within the block.
`
`Q3: Which Block Should Be Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn’t,
`
`then it would be easy to select which block to replace. Alas, the high hit rate of
`caches means that the overwhelming decision is between blocks that have valid
`data.
`
`SAMSUNG EXHIBIT 1009
`Page 14 of 171
`
`
`
`412
`
`8.3 Caches
`
`A virtue of random is that it is simple to build in hardware. As the number of
`blocks to keep track of increases, LRU becomes increasingly expensive and is
`frequently only approximated. Figure 8.10 shows the difference in miss rates
`between LRU and random replacement. Replacement policy plays a greater role
`in smaller caches than in larger caches where there are more choices of what to
`replace.
`,
`,
`.,
`,
`_
`§
`‘
`,
`l O
`
`
`
`3:Block-frame addresses ‘ to2 1 0 1} 0
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`00LRUblocknumber 0 0‘21 o 0 3‘3 3 1
`
`
`
`
`
`
`._.
`
`w
`
`w
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-frame addresses in
`a fully associative memory hierarchy. This assumes that there are 4 blocks and that in
`the beginning the LRU block is number 0. The LRU block number is shown below each
`new block reference. Another policy, First—in—first-out(FIFO), simply discards the block that
`was used N unique accesses before, independent of its reference pattern in the last N — 1
`references. Random replacement generally outperforms FIFO and it is easier to implement.
`
`LRU
`
`
`
`
`1
`‘
`256KB
`1.15%
`1.17%
`1.13%
`1.13%
`1.12%
`1.12%
`‘
`
`\
`
`I
`
`‘
`‘
`
`1
`
`—_—_—_—__—__fi
`Associativity:
`2-way
`4-way
`8-way
`Size
`Random
`Random
`Random
`
`LRU
`
`LRU
`
`
`
`16 KB 5.18% 5.69%
`4.67% 5.29%
`4.39% 4.96%
`'
`
`’
`
`'
`
`'
`
`‘
`l
`
`64KB
`
`1.88% 2.01%
`
`1.54% 1.66%
`
`1.39%
`
`1.53%
`
`FIGURE 8.10 Miss rates comparing least-recently used versus random replacement
`for several sizes and associativities. This data was collected for a block size of 16 bytes
`using one of the VAX traces containing user and operating system code (SAVEO). This
`trace is included in the software supplement for course use. There is little difference
`between LRU and random for larger size caches in this trace.
`
`Q4: What Happens on 3 Write?
`
`Reads dominate cache accesses. A11 instruction accesses are reads, and most
`instructions don’t write to memory. Figure 4.34 (page 18]) suggests a mix of 9%
`stores and 17% loads for four DLX programs, making writes less than 100/1 of
`the memory traffic. Making the common case fast means optimizing caches for
`reads, but Amdahl’s Law reminds us that high-performance designs cannot
`neglect the speed of writes.
`Fortunately, the common case is also the easy case to make fast. The block
`can be read at the same time that the tag is read and compared. so the block read
`begins as soon as the block-frame address is available. If the read is a hit, the
`block is passed on to the CPU immediately. If it is a miss, there is no benefit—«
`but also no harm.
`
`SAMSUNG EXHIBIT 1009
`
`Page 15 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 15 of 171
`
`
`
`
`
`Memory—Hierarchy Design 41 3
`
`Such is not the case for writes. The processor specifies the size of the write,
`usually between 1 and 8 bytes; only that portion of a block can be changed. In
`general this means a read-modify-write sequence of operations on the block:
`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block cannot begin until the tag is checked to see if it is
`a hit. Because tag checking cannot occur in parallel, then, writes normally take
`longer than reads.
`
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`I Write through (or store through)—The information is written to both the
`block in the cache and to the block in the lower—level memory.
`
`I Write back (also called copy back or store in)—The information is written
`only to the block in the cache. The modified cache block is written to main
`memory only when it is replaced.
`
`Write-back cache blocks are called clean or dirty, depending on whether the
`information in the cache differs from that in lower-level memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirty bit is
`commonly used. This status bit indicates whether or not the block was modified
`while in the cache. If it wasn’t, the block is not written, since the lower level has
`the same information as the cache.
`
`Both write back and write through have their advantages. With write bragki
`writes occur at the?@emme‘aafi‘rfiafia‘ry, add-multiple writes within a
`bl'o'c'k require only one write to the lower-level memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through, read misses don’t result
`in writes to the lower level, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the most
`current copy of the data. This is important in multiprocessors and for I/O, which
`we shall examine in Section 8.8. Hence, multiprocessors want write back to
`reduce the memory traffic per processor and write through to keep the cache and
`memory consistent.
`When the CPU must wait for writes to complete during write throughs, the
`CPU is said to write stall. A common optimization to reduce write stalls is a
`write bufler, which allows the processor to continue while the memory is
`updated. As we shall see in Section 8.8, write stalls can occur even with write
`uffers.
`
`
`
`There are two options on a write miss:
`
`- Write allocate (also calledfetch on write)—The block is loaded, followed by
`the write—hit actions above. This is similar to a read miss.
`
`I No write allocate (also called write around)—The block is modified in the
`lower level and not loaded into the cache.
`
`a;"
`
`SAMSUNG EXHIBIT 1009
`
`Page 16 of 171
`#—_4—i
`
`SAMSUNG EXHIBIT 1009
`Page 16 of 171
`
`
`
`414
`
`8.3 Caches
`
`
`
`While either write-miss policy could be used with write through or write back,
`generally write-back caches use write allocate (hoping that subsequent writes to
`that block will be captured by the cache) and write—through caches often use no
`write allocate (since subsequent writes to that block will still have to go to
`memory).
`
`An Example Cache: The VAX-1 1/780 Cache
`
`To give substance to these ideas, Figure 8.1] shows the organization of the
`cache on the VAX-11/780. The cache contains 8192 bytes of data in 8—byte
`blocks with two—way—set—associative placement, random replacement, write
`through with a one-word write buffer, and no write allocate on a write miss.
`Let’s trace a cache hit through the steps of a hit as labeled in Figure 8.11.
`(The five steps are shown as circled numbers.) The address coming into the
`cache is divided into two fields: the 29-bit block-frame address and 3-bit block
`offset. The block—frame address is further divided into an address tag and cache
`
`index. Step 1 shows this division.
`The cache index selects the set to be tested to see if the block is in the cache.
`
`(A set is one block from each bank in Figure 8.11.) The size of the index
`depends on cache size, block size, and set associativity. In this case, a 9—bit
`index results:
`
`
`_ 8192 _
`_ 9
`Blocks _
`Cache size
`Bank a Block size * Set associativity_ 8 >I< 2 T 512 _ 2
`
`In a two—way—set—associative cache, the index is sent to both banks. This is
`step 2.
`After reading an address tag from each bank, the tag portion of the block—
`frame address is compared to the tags. This is step 3 in the figure. To be sure the
`tag contains valid information, the valid bit must be set, or the results of the
`comparison are ignored.
`Assuming one of the tags does match, a 2:1 multiplexer (step 4) is set to
`select the block from the matching set. Why can’t both tags match? It is the job
`of the replacement algorithm to make,sure that an address appears in only one
`block. To reduce the hit time, the data is read at the same time as the address
`
`tags; thus, by the time the block multiplexer is ready, the data is also ready.
`This step is needed in set-associative caches, but it can be omitted from
`direct-mapped caches since there is no selection to be made. The multiplexer
`used in this step can be on the critical
`timing path, endangering the clock cycle
`time of the CPU. (The example on pages 418—419 and the fallacy on page 481
`explore the trade-off of lower miss rates and higher clock cycle time.)
`In the final step the word is sent to the CPU. All five steps occur within a
`single CPU clock cycle.
`What happens on a miss? The cache sends a stall signal to the CPU telling it
`to wait, and two words (eight bytes) are read from memory. That takes 6 clock
`cycles on the VAX—11/780 (ignoring bus interference). When the data arrives.
`
`\l
`
`4
`
`SAMSUNG EXHIBIT 100
`
`Page 17 of 171
`
`
`
`SAMSUNG EXHIBIT 1009
`Page 17 of 171
`
`
`
`
`
`
`
`
`
`
`the cache must pick a block to replace: the VAX—l 1/780 selects one of the two
`blocks at random. Replacing a block means updating the data. the address tag.
`and the valid bit. Once this is done, the cache goes through a regular hit cycle
`and returns the data to the CPU.
`Writes are more complicated in the VAX—1 1/780. as they are in any cache. If
`the word to be written is in the cache. the first four steps are the same. The next
`step is to write the data in the block. then write the Changed-data portion into the
`
`
`
`
`
`
`Memory-Hierarchy Design
`
`
`
`
`
`
`Blockrlrame Block
`address
`offset ‘
`<20>
`<9> <3>
`Tag I
`
`
`
`CPU
`dd
`Satares;
`
`a
`
`
`
`BI
`
`Kr
`1
`'1‘
`N
`I
`
`
`
`1 l
`
`.