throbber
Page 1 of 171
`
`SAMSUNG EXHIBIT 1009
`Samsung v. Image Processing Techs.
`
`

`

`
`
`Computer
`Architecture
`
`A
`
`Quantitative
`Approach
`
`LOS ANGELES PUBLIC LIBRAFIY
`CENTRAL LIBRARY
`DEPT. OF SCIENCE TECHNOLOGY & PATENTS
`630 WEST 51h ST.
`LOS ANGELES, CA. 90071
`
`David A. Patterson
`UNIVERSITY OF CALIFORNIA AT BERKELEY
`
`John L. Hennessy
`STANFORD UNIVERSITY
`
`With a Contribution by
`David Goldberg
`Xerox Palo Alto Research Center
`
`MORGAN KAUFMANN PUBLISHERS, INC.
`SAN MATEO, CALIFORNIA
`
`Page 2 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 2 of 171
`
`

`

`
`
`
`
`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Head
`Cover Design David Lance Goines
`Copy Editor Linda Medoff
`Proofreader Paul Medoff
`Computer Tvpesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging—in—Publication Data
`Patterson, David A.
`Computer architecture : a quantitative approach / David A.
`Patterson, John L. Hennessy
`p.
`cm.
`Includes bibliographical references
`ISBN 1-55860— 069—8
`1. Computer architecture.
`QA76.9.A73P377 1990
`004.2’2--dc20
`
`I. Hennessy, John L.
`
`ll. Title.
`
`89-85227 CIP
`
`Morgan Kaufmann Publishers, Inc.
`Editorial Office: 2929 Campus Drive. San Mateo. CA 94403
`Order from: PO. Box 50490. Palo Alto, CA 94303—9953
`
`@1990 by Morgan Kaufmann Publishers. Inc.
`All rights reserved.
`No part of this publication may be reproduced. stored in a retrieval system. or transmitted
`in any form or by any means—electronic. mechanical. recording. or otherwise——without
`the prior permission of the publisher.
`All instruction sets and other design information of the DLX computer system contained
`herein is copyrighted by the publisher and may npt be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`ADVICE. PRAISE. & ERRORS: Any correspondence related to this publication or
`intended for the authors should be addressed to the editorial offices of Morgan Kaufmann
`Publishers. Inc, Dept. P&H APE. Information regarding error sightings is encouraged.
`Any error sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction upon availability
`of the new printing. Electronic mail can be sent to bugs3@vsop.stanford.edu. (Please
`include your full name and permanent mailing address.)
`INSTRUCTOR SUPPORT: For information on classroom software and other instructor
`materials available to adopters. please contact the editorial offices of Morgan Kaufmann
`Publishers. Inc. (415)578-9911.
`Fourth Printing
`
`
`
`Page 3 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 3 of 171
`
`

`

`
`
`T0 Andrea, Linda, and our four sons
`
`Page 4 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 4 of 171
`
`

`

`a
`I
`I
`
`I
`
`I
`
`‘
`I
`
`I
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`II
`I
`I
`I w w
`I
`I
`I
`
`III,
`
`g
`
`I
`
`I
`‘
`
`II
`II"
`.
`II
`I
`I
`
`III
`I xi
`I
`
`I
`
`I
`
`I
`I
`I
`
`I
`
`I
`
`‘
`
`II
`I
`I
`I
`
`I
`
`
`
`
`
`Ideally one would desire an indefinitely large memory
`capacity such that any particular .
`.
`. word would be im-
`mediately available. .
`.
`. We are .
`.
`. forced to recognize the
`possibility of constructing a hierarchy ofmemories, each of
`which has greater capacity than the preceding but which is
`less quickly accessible.
`A. W. Burks. H. H. Goldstine, and J. von Neumann,
`Preliminary Discussion of the Logical Design
`ofan Electronic Computing Instrument (1946)
`
`i
`
`I
`
`I
`
`ml
`
`I
`II
`
`I
`II
`I.
`H
`II
`1i
`I
`:1
`I I
`I
`‘I
`"I
`w
`I»
`I
`
`I
`I
`
`I
`I
`
`t
`
`I
`
`’~I
`I
`
`8.1
`
`Introduction: Principle of Locality
`
`8.2 GeneralPrinciplesofMemoryHierarchy
`
`8.3 Caches
`8.4 Main Memory
`8.5 Virtual Memory
`8.6
`Protection and Examples of Virtual Memory
`8.7 More Optimizations Based on Program Behavior
`8.8 Advanced Topics—Improving Cache-Memory
`Performance
`8.9
`Putting It All Together: The VAX-1 1/780 Memory
`Hierarchy
`8.10 Fallacies and Pitfalls
`8.1 1 Concluding Remarks
`8.1 2 Historical Perspective and References
`
`Exercises
`
`(
`
`403
`
`404
`
`408
`425
`432
`438
`449
`
`454
`
`475
`480
`484
`485
`
`490
`
`I
`I
`I
`I
`t
`I
`J
`I
`
`1
`I
`I
`
`I
`1 _
`I
`
`I
`
`I
`
`I
`I
`I
`
`'
`'
`I
`I
`
`I
`
`SAMSUNG EXHIBIT 1009
`
`Page 5 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 5 of 171
`
`

`

`8 Memory-Hierarchy
`
`Design
`
`l
`
`l
`I
`
`I
`
`8.1 l Introduction: Principle of Locality
`
`Computer pioneers correctly predicted that programmers would want unlimited
`amounts of fast memory. As the 90/10 rule in the first chapter predicts. most
`programs fortunately do not access all code or data uniformly (see Section 1.3.
`pages 8712). The 90/10 rule can be restated
`the principle of locality. This
`hypothesis. which holds that all programs favor a portion of their address space
`at any instant of time. has two dimensions:
`
`- Temporal locality (locality in time)—lf an item is referenced. it will tend to
`be referenced again soon.
`
`- Spatial locality (locality in space)—If an item is referenced, nearby items will
`tend to be referenced soon.
`
`A memory hierarchy is a natural reaction to locality and technology. The
`principle of locality and the guideline that smaller hardware is faster yield the
`
`concept ofilnerarchyhasedbndifferem speeds and sizes. Since slower memory
`
`is cheaper. a ntemoryjierarchy‘jsorganged into several levels—each smaller.
`faster. and more expensive per byte than the level below. The levels of the
`hierarchy subset one another; all data in one level is also found in the level
`below. and all data in that lower level is found in the one below it. and so on
`
`until we reach the bottom of the hierarchy.
`
`Page 6 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 6 of 171
`
`

`

`8.1 Introduction: Principle of Locality
`404
`
`
`
` 8.2
`
`This chapter includes a half-dozen examples that demonstrate how taking
`advantage of the principle of locality can improve performance. All
`these
`strategies map addresses from a larger memory to a smaller but faster memory.
`As part of address mapping,
`the memory hierarchy is usually given the
`responsibility of address checking; protection schemes used for doing this are
`covered in this chapter. Later we will explore advanced memory hierarchy topics
`and trace a memory access through three levels of memory on the VAX-l 1/780.
`
`General Principles of Memory Hierarchy
`
`Before proceeding with examples of the memory hierarchy, let’s define some
`general
`terms applicable to all memory hierarchies. A memory hierarchy
`normally consists of many levels. but it is managed between two adjacent levels
`at a time. The upper level—the one closer to the processor—is smaller and faster
`than the lower level (see Figure 8.1). The minimum unit of information that can
`be either present or not present in the two-level hierarchy is called a block. The
`size of a block may be either fixed or variable. If it is fixed, the memory size is a
`multiple of that block size. Most of this chapter will be concerned with fixed
`block sizes, although a variable block design is discussed in Section 8.6.
`Success or failure of an access to the upper level is designated as a hit or a
`miss: A hit is a memory access found in the upper level, while a miss means it is
`not found in that level. Hit rate, or hit ratio—like a batting average—is the
`fraction of memory accesses found in the upper level. This is sometimes repre—
`sented as a percentage. Miss rate (1.0 — hit rate) is the fraction of memory
`accesses not found in the upper level.
`
`I
`
` Processor
`
`
`
`FIGURE 8.1 Every pair of levels in the memory hierarchy can be thought of as
`having an upper and lower level. Within each level the unit of information that is present
`or not is called a block.
`
`SAMSUNG EXHIBIT 1009
`
`Page 7 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 7 of 171
`
`

`

`
`Memory-Hierarchy Design
`
`The memory address is divided into pieces that access each part of the
`
`hierar/chy. Eblaglilfilame address is the higher-order piece of_theW
`identifies a block at that level of thefihierarchy (sée Figure 8.2). The block-afi‘set
`address is the lower-order piece of the address and identifies an item within a
`block. The size of the block-offset address is logz (size of block); the size of the
`block-frame address is then the size of the full address at this level less the size
`of the block—offset address.
`
`
`
`
`
`Block-offset address
`Block-frame address
`
`
`01011010001000001001010 111001110'
`
`
`
`FIGURE 8.2 Example of the frame address and offset address portions of a 32-bit
`lower-level memory address. In this case the block size is 512, making the size of the
`offset address 9 bits and the size of the block-frame address 23 bits.
`
`speed of hits and misses is important. Hit time is the time to access the upper
`level of the memory hierarchy, which includes the time to determine whether the
`access is a hit or a miss. Whetimeto replace a block i_n the upper
`level with the corresponding bblflrpmthe_lower level plus the time to deliver!
`this block to the reguesting device (normally the CPU). The miss penalty is
`
`further divided into two components: access time—the time to access the first
`word of a block on a miss; and transfer time—the additional time to transfer the
`remaining words in the block. Access time is related to the latency of the lower-
`level memory, while transfer time is related to the bandwidth between the lower—
`level and upper-level memories. (Sometimes access latency is used to mean
`access time.)
`
` Since performance is the major reason for having a memory hierarchy, the
`
`Evaluating Performance of a Memory Hierarchy
`
`it is tempting to
`Because instruction count is independent of the hardware,
`evaluate CPU performance using that number. As we saw in Chapters 2 and 4,
`however, such indirect performance measures have waylaid many a computer
`designer. The corresponding temptation for evaluating memory-hierarchy
`performance is to concentrate on miss rate, for it, too,
`is independent of the
`speed of the hardware. As we shall see, miss rate can be just as misleading as
`‘
`x
`instruction count. A better ingsurggfimemorymierarchy,Vperfogrrnance is the
`
`aWQQSS memory:
`
`Average memory-access time = Hit time + Miss rate * Miss penalty
`
`The components of average access time can be measured either in absolute
`time—say, 10 nanoseconds on a hit—or in the number of clock cycles that the
`
`SAMSUNG EXHIBIT 1009
`
`Page 8 of 171
`
`
`
`SAMSUNG EXHIBIT 1009
`Page 8 of 171
`
`

`

`
`
`406
`
`8.2 General Principles of Memory Hierarchy
`
`CPU waits for the memory—such as a miss penalty of 12 Clock cycles.
`Remember that average memory-access time is still an indirect measure of
`performance; so while it is a better measure than miss rate. it is not a substitute
`for execution time.
`
`The relationship of block size to miss penalty and miss rate is shown
`abstractly in Figure 8.3. These representations assume that the size of the upper-
`level memory does not change. The access-time portion of the miss penalty is
`not affected by block size, but the transfer time does increase with block size. If
`access time is large, initially there will be little additional miss penalty relative
`to access time as block size increases. However, increasing block size means
`fewer blocks in the upper-level memory. Increasing block size lowers the miss
`rate until
`the reduced misses of larger blocks (spatial locality) are outweighed
`by the increased misses as the number of blocks shrinks (temporal locality).
`
`
`
`Miss
`penalty
`
`Block size
`
`
`
`
`Transfer
`time
`
`Miss
`rate
`
`Block size
`
`FIGURE 8.3 Block size versus miss penalty and miss rate. The transfer-time portion of
`the miss penalty obviously grows with increasing block size. For a fixed-size upper—level
`memory, miss rates fall with increasing block size until so much of the block is not used that
`it displaces useful information in the upper level, and miss rates begin to rise. The point on
`the curve on the right where miss rates begin to rise with increasing block size is
`sometimes called the pollution point.
`
`
`
`Block size
`
`Average
`access
`time
`
`FIGURE 8.4 The relationship between average memory-access time and block size.
`
`SAMSUNG EXHIBIT 1009
`
`Page 9 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 9 of 171
`
`

`

`
`Memory—Hierarchy Design
`407
`
`The goal of a memory hierarchy is to reduce execution time not misses.
`Hence, computer designersmm
`time rather than the lowest miss rate. This is related to the product of miss rate
`and miss penalty, as Figure 8.4 shows abstractly. Of course, overall CPU
`performance is the ultimate performance test, so care must be taken when re-
`ducing average memory-access time to be sure that changes to clock cycle time
`and CPI improve overall performance as well as average memory-access time.
`
`Implications of a Memory Hierarchy to the CPU
`
`Processors designed without a memory hierarchy are simpler because memory
`accesses always take the same amount of time. Misses in a memory hierarchy
`
`mean that the CPU must be able to handleyariabl’emmemory-access times. If the
`mfipé'n‘alfi is on the order of tens of clock cycles, the processor normally waits
`W .
`.
`.
`'forthe memory transfer to complete. On the other hand, if the miss penalty 15
`thousaEI-ds of processor clock cycles, it is too wasteful to let the CPU sit idle; in
`this case, the CPU is interrupted and used for another process during the miss
`handling. Thus, avoiding the overhead of a long miss penalty means any
`memory access can result in a CPU interrupt. This also means the CPU must be
`able to recover any memory address that can cause such an interrupt, so that the
`system can know what to transfer to satisfy the miss (see Section 5.6). When the
`memory transfer is complete, the original process is restored, and the instruction
`that missed is retried.
`
`The processor must also have some mechanism to determine whether or not
`information is in the top level of the memory hierarchy. This check happens on
`every memory access and affects hit time; maintaining acceptable performance
`usually requires the check to be implemented in hardware. The final implication
`of a memory hierarchy is that the computer must have a mechanism to transfer
`blocks between upper- and lower—level memorylW
`(WM...
`clock cycles, it is controlled by hardware; if it is thousands of clock cycles, it
`can be controlled by software.
`
`Four Questions for Classifying Memory Hierarchies
`
`The fundamental principles that drive all memory hierarchies allow us to use
`terms that transcend the levels we are talking about. These same principles allow
`us to pose four questions about any level of the hierarchy:
`
`Q1: Where can a block be placed in the upper level? (Block placement)
`
`Q2: How is a block found if it is in the upper level? (Block identification)
`
`Q3: Which block should be replaced on a miss? (Block replacement)
`
`Q4: What happens on a write? (Write strategy)
`
`These questions will help us gain an understanding of the different tradeoffs
`demanded by the relationships of memories at different levels of a hierarchy.
`
`Page 10 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 10 of 171
`
`

`

`408
`
`8.3 Caches
`
`8.3 Caches
`
`Cache: a safe place for hiding or storing things.
`
`Webster’s New World Dictionary of the American Language,
`Second College Edition (1976)
`
`Cache is the name first chosen to represent the level of the memory hierarchy
`between the CPU and main memory, and that is the dominant use of the term.
`While the concept of caches is younger than the IBM 360 architecture, caches
`appear today in every class of computer and in some computers more than once.
`In fact, the word has become so popular that it has replaced “buffer” in many
`computer-science circles.
`The general terms defined in the prior section can be used for caches.
`although the word line is often used instead of block. Figure 8.5 shows the
`typical range of memory—hierarchy parameters for caches.
`
`Block (line) size
`Hit time
`Miss penalty
`
`(Access time)
`(Transfer time)
`Miss rate
`Cache size
`
`
`
`
`
`
`4 — 128 bytes
`1 — 4 clock cycles (normally 1)
`8 — 32 clock cycles
`
`(6 — 10 clock cycles)
`(2 — 22 clock cycles)
`1% — 20%
`1 KB — 256 KB
`
`.
`
`7
`
`1
`
`(I
`it
`
`‘ j
`f.
`"
`
`t
`
`i
`1
`|
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters for caches in 1990
`workstations and minicomputers.
`
`Now let’s examine caches in more detail by answering the four memoryv
`hierarchy questions.
`
`Q1: Where Can a Block Be Placed in E; Cache?
`
`Restrictions on where a block is placed create three categories of cache
`organization:
`I
`If each block has only one place it can appear in the cache, the cache is said
`to be direct mapped. The mapping is usually (block-frame address) modulo
`(number of blocks in cache).
`If a block can be placed anywhere in the cache, the cache is said to be fully
`associative.
`
`u
`
`SAMSUNG EXHIBIT 1009
`Page 11 of 171
`
`.
`‘1
`
`
`
`
`
`SAMSUNG EXHIBIT 1009
`Page 11 of 171
`
`

`

`Memory-Hierarchy Design
`
`409
`
`u
`
`If a block can be placed in a restricted set of places in the cache, the cache is
`said to be set associative. A set is a group of two or more blocks in the cache.
`A block is first mapped onto a set, and then the block can be placed anywhere
`within the set. The set is usually chosen by bit selection; that is, (block-frame
`address) modulo (number of sets in cache). If there are n blocks in a set. the
`cache placement is called n-way set associative.
`
`The range of caches from direct mapped to fully associative is really a
`continuum of levels of set associativity: Direct mapped is simply one-way set
`associative and a fully associative cache with m blocks could be called m-way
`set associative. Figure 8.6 shows where block 12 can be placed in a cache
`according to the block—placement policy.
`
`Fully associative:
`block 12 can go
`anywhere
`
`Direct mapped:
`block 12 can go
`only into block 4
`(12 mode)
`
`Block 01234567
`
`Block 01234567
`
`no,
`
`no.
`
`
`
`Set associative:
`block 12 can go
`anywhere in set 0
`(72 mod4)
`
`01234567
`
`.
`
`no. 01234567890
`
`Set Set Set Set
`0
`1
`2
`3
`Block-frame address
`
`
`
`Block
`
`1
`
`FIGURE 8.6 The cache has 8 blocks, while memory has 32 blocks. The set-
`associative organization has 4 sets with 2 blocks per set, called two-way set associative.
`(Real caches contain hundreds of blocks and real memories contain hundreds of thousands
`of blocks.) Assume that there is nothing in the cache and that the block-frame address in
`question identifies lower-level block 12. The three options for caches are shown left to right.
`In fully associative, block 12 from the lower level can go into any of the 8 blocks of the
`cache. With direct mapped, block 12 can only be placed into block 4 (12 modulo 8). Set
`associative, which has some of both features, allows the block to be placed anywhere in set
`0 (12 modulo 4). With two blocks per set, this means block 12 can be placed either in block
`0 or block 1 of the cache.
`
`SAMSUNG EXHIBIT 1009
`
`Page 12 of 171
`
`
`
`SAMSUNG EXHIBIT 1009
`Page 12 of 171
`
`

`

`
`
`
`
`410
`8.3 Caches
`
`Q2: How Is a Block Found If It Is in the Cache?
`
`Caches include an address tag on each block that gives the block-frame address.
`The tag of every cache block that might contain the desired information is
`checked to see if it matches the block-frame address from the CPU. Figure 8.7
`gives an example. Because speed is of the essence, all possible tags are searched
`in parallel; serial search would make set associativity counterproductive.
`
`Fully associative
`Block 01234567
`
`Direct mapped
`Block 01234567
`
`Set associative
`Block 01234567
`
`Search
`
`IIIIIII
`Hill 1 ii
`
`IIIIIII
`i
`
`Set Set Set Set
`
`IIIIIII
`H
`
`FIGURE 8.7 In tulIy associative placement, the block for block-frame address 12 can
`appear in any of the 8 blocks; thus, all 8 tags must be searched. The desired data is
`found in cache block 6 in this example. In direct-mapped placement there is only one cache
`block where memory block 12 can be found. In set-associative placement, with 4 sets,
`memory block 12 must be in set 0 (12 mod 4); thus, the tags of cache blocks 0 and 1 are
`checked. In this case the data is found in cache block 1. Speed of cache access dictates
`that searching must be performed in parallel for fully associative and set—associative
`mappings.
`
`There must be a way to know that a cache block does not have valid
`information. The most common procedure is to add a valid bit to the tag to say
`whether or not this entry contains a iyalid address. If the bit is not set, there
`cannot be a match on this address.
`
`A common omission in finding the cost of caches is to forget the cost of the
`tag memory. One tag is required for each block. AWg
`
`block sizes is that the tag overhead per cache entry b'écomes a smaller fractiquof
`tWfl—g“
`
`We next question, let’s explore the relationship of a
`CPU address to the cache. Figure 8.8 shows how an address is divided into three
`fields to find data in a set—associative cache: the block—oflset field used to select
`the desired data from the block, the index field used to select the set, and the tag
`
`field used for the comparison. While the comparison could be made on more of
`the address than the tag, there is no need:
`
`SAMSUNG EXHIBIT 1009
`
`Page 13 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 13 of 171
`
`

`

`
`
`
`is that hardware decisions are
`A benefit of direct-mapped placement
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set-
`associative placement, there are several blocks to choose from on a miss. There
`are two primary strategies employed for selecting which block to replace:
`
`- Random—To spread allocation uniformly, candidate blocks are randomly
`selected. Some systems use a scheme for spreading data across a set of blocks
`in a pseudorandomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`I Least—recently used (LRU)—To reduce the chance of throwing out informa—
`tion that will be needed soon, accesses to blocks are recorded. The block
`
`replaced is the one that has been unused for the longest time. This makes use
`of a corollary of temporal locality: If recently used blocks are likely to be
`used again, then the best candidate for disposal is the least recently used.
`Figure 8.9 (page 412) shows which block is the least—recently used for a
`sequence of block-frame addresses in a fully associative memory hierarchy.
`
`
`
`SAMSUNG EXHIBIT 1009
`Page 14 of 171
`
`Memory-Hierarchy Design
`
`- Checking the index would be redundant, since it was used to select the set to
`
`be checked (an address stored in set 0, for example, must have 0 in the index
`field or it couldn’t be stored in set 0).
`
`n The offset is unnecessary in the comparison because all block offsets match
`and the entire block is present or not.
`
`If the total size is kept the same, increasing associativity increases the number of
`blocks per set, thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 moves to the right with
`increasing associativity.
`
`FIGURE 8.8 The 3 portions of an address in a set-associative or direct-mapped cache.
`The tag is used to check all the blocks in the set and the index is used to select the set. The
`block offset is the address of the desired data within the block.
`
`Q3: Which Block Should Be Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn’t,
`
`then it would be easy to select which block to replace. Alas, the high hit rate of
`caches means that the overwhelming decision is between blocks that have valid
`data.
`
`SAMSUNG EXHIBIT 1009
`Page 14 of 171
`
`

`

`412
`
`8.3 Caches
`
`A virtue of random is that it is simple to build in hardware. As the number of
`blocks to keep track of increases, LRU becomes increasingly expensive and is
`frequently only approximated. Figure 8.10 shows the difference in miss rates
`between LRU and random replacement. Replacement policy plays a greater role
`in smaller caches than in larger caches where there are more choices of what to
`replace.
`,
`,
`.,
`,
`_

`‘
`,
`l O
`
`
`
`3:Block-frame addresses ‘ to2 1 0 1} 0
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`00LRUblocknumber 0 0‘21 o 0 3‘3 3 1
`
`
`
`
`
`
`._.
`
`w
`
`w
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-frame addresses in
`a fully associative memory hierarchy. This assumes that there are 4 blocks and that in
`the beginning the LRU block is number 0. The LRU block number is shown below each
`new block reference. Another policy, First—in—first-out(FIFO), simply discards the block that
`was used N unique accesses before, independent of its reference pattern in the last N — 1
`references. Random replacement generally outperforms FIFO and it is easier to implement.
`
`LRU
`
`
`
`
`1
`‘
`256KB
`1.15%
`1.17%
`1.13%
`1.13%
`1.12%
`1.12%
`‘
`
`\
`
`I
`
`‘
`‘
`
`1
`
`—_—_—_—__—__fi
`Associativity:
`2-way
`4-way
`8-way
`Size
`Random
`Random
`Random
`
`LRU
`
`LRU
`
`
`
`16 KB 5.18% 5.69%
`4.67% 5.29%
`4.39% 4.96%
`'
`
`’
`
`'
`
`'
`
`‘
`l
`
`64KB
`
`1.88% 2.01%
`
`1.54% 1.66%
`
`1.39%
`
`1.53%
`
`FIGURE 8.10 Miss rates comparing least-recently used versus random replacement
`for several sizes and associativities. This data was collected for a block size of 16 bytes
`using one of the VAX traces containing user and operating system code (SAVEO). This
`trace is included in the software supplement for course use. There is little difference
`between LRU and random for larger size caches in this trace.
`
`Q4: What Happens on 3 Write?
`
`Reads dominate cache accesses. A11 instruction accesses are reads, and most
`instructions don’t write to memory. Figure 4.34 (page 18]) suggests a mix of 9%
`stores and 17% loads for four DLX programs, making writes less than 100/1 of
`the memory traffic. Making the common case fast means optimizing caches for
`reads, but Amdahl’s Law reminds us that high-performance designs cannot
`neglect the speed of writes.
`Fortunately, the common case is also the easy case to make fast. The block
`can be read at the same time that the tag is read and compared. so the block read
`begins as soon as the block-frame address is available. If the read is a hit, the
`block is passed on to the CPU immediately. If it is a miss, there is no benefit—«
`but also no harm.
`
`SAMSUNG EXHIBIT 1009
`
`Page 15 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 15 of 171
`
`

`

`
`
`Memory—Hierarchy Design 41 3
`
`Such is not the case for writes. The processor specifies the size of the write,
`usually between 1 and 8 bytes; only that portion of a block can be changed. In
`general this means a read-modify-write sequence of operations on the block:
`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block cannot begin until the tag is checked to see if it is
`a hit. Because tag checking cannot occur in parallel, then, writes normally take
`longer than reads.
`
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`I Write through (or store through)—The information is written to both the
`block in the cache and to the block in the lower—level memory.
`
`I Write back (also called copy back or store in)—The information is written
`only to the block in the cache. The modified cache block is written to main
`memory only when it is replaced.
`
`Write-back cache blocks are called clean or dirty, depending on whether the
`information in the cache differs from that in lower-level memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirty bit is
`commonly used. This status bit indicates whether or not the block was modified
`while in the cache. If it wasn’t, the block is not written, since the lower level has
`the same information as the cache.
`
`Both write back and write through have their advantages. With write bragki
`writes occur at the?@emme‘aafi‘rfiafia‘ry, add-multiple writes within a
`bl'o'c'k require only one write to the lower-level memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through, read misses don’t result
`in writes to the lower level, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the most
`current copy of the data. This is important in multiprocessors and for I/O, which
`we shall examine in Section 8.8. Hence, multiprocessors want write back to
`reduce the memory traffic per processor and write through to keep the cache and
`memory consistent.
`When the CPU must wait for writes to complete during write throughs, the
`CPU is said to write stall. A common optimization to reduce write stalls is a
`write bufler, which allows the processor to continue while the memory is
`updated. As we shall see in Section 8.8, write stalls can occur even with write
`uffers.
`
`
`
`There are two options on a write miss:
`
`- Write allocate (also calledfetch on write)—The block is loaded, followed by
`the write—hit actions above. This is similar to a read miss.
`
`I No write allocate (also called write around)—The block is modified in the
`lower level and not loaded into the cache.
`
`a;"
`
`SAMSUNG EXHIBIT 1009
`
`Page 16 of 171
`#—_4—i
`
`SAMSUNG EXHIBIT 1009
`Page 16 of 171
`
`

`

`414
`
`8.3 Caches
`
`
`
`While either write-miss policy could be used with write through or write back,
`generally write-back caches use write allocate (hoping that subsequent writes to
`that block will be captured by the cache) and write—through caches often use no
`write allocate (since subsequent writes to that block will still have to go to
`memory).
`
`An Example Cache: The VAX-1 1/780 Cache
`
`To give substance to these ideas, Figure 8.1] shows the organization of the
`cache on the VAX-11/780. The cache contains 8192 bytes of data in 8—byte
`blocks with two—way—set—associative placement, random replacement, write
`through with a one-word write buffer, and no write allocate on a write miss.
`Let’s trace a cache hit through the steps of a hit as labeled in Figure 8.11.
`(The five steps are shown as circled numbers.) The address coming into the
`cache is divided into two fields: the 29-bit block-frame address and 3-bit block
`offset. The block—frame address is further divided into an address tag and cache
`
`index. Step 1 shows this division.
`The cache index selects the set to be tested to see if the block is in the cache.
`
`(A set is one block from each bank in Figure 8.11.) The size of the index
`depends on cache size, block size, and set associativity. In this case, a 9—bit
`index results:
`
`
`_ 8192 _
`_ 9
`Blocks _
`Cache size
`Bank a Block size * Set associativity_ 8 >I< 2 T 512 _ 2
`
`In a two—way—set—associative cache, the index is sent to both banks. This is
`step 2.
`After reading an address tag from each bank, the tag portion of the block—
`frame address is compared to the tags. This is step 3 in the figure. To be sure the
`tag contains valid information, the valid bit must be set, or the results of the
`comparison are ignored.
`Assuming one of the tags does match, a 2:1 multiplexer (step 4) is set to
`select the block from the matching set. Why can’t both tags match? It is the job
`of the replacement algorithm to make,sure that an address appears in only one
`block. To reduce the hit time, the data is read at the same time as the address
`
`tags; thus, by the time the block multiplexer is ready, the data is also ready.
`This step is needed in set-associative caches, but it can be omitted from
`direct-mapped caches since there is no selection to be made. The multiplexer
`used in this step can be on the critical
`timing path, endangering the clock cycle
`time of the CPU. (The example on pages 418—419 and the fallacy on page 481
`explore the trade-off of lower miss rates and higher clock cycle time.)
`In the final step the word is sent to the CPU. All five steps occur within a
`single CPU clock cycle.
`What happens on a miss? The cache sends a stall signal to the CPU telling it
`to wait, and two words (eight bytes) are read from memory. That takes 6 clock
`cycles on the VAX—11/780 (ignoring bus interference). When the data arrives.
`
`\l
`
`4
`
`SAMSUNG EXHIBIT 100
`
`Page 17 of 171
`
`
`
`SAMSUNG EXHIBIT 1009
`Page 17 of 171
`
`

`

`
`
`
`
`
`
`
`the cache must pick a block to replace: the VAX—l 1/780 selects one of the two
`blocks at random. Replacing a block means updating the data. the address tag.
`and the valid bit. Once this is done, the cache goes through a regular hit cycle
`and returns the data to the CPU.
`Writes are more complicated in the VAX—1 1/780. as they are in any cache. If
`the word to be written is in the cache. the first four steps are the same. The next
`step is to write the data in the block. then write the Changed-data portion into the
`
`
`
`
`
`
`Memory-Hierarchy Design
`
`
`
`
`
`
`Blockrlrame Block
`address
`offset ‘
`<20>
`<9> <3>
`Tag I
`
`
`
`CPU
`dd
`Satares;
`
`a
`
`
`
`BI
`
`Kr
`1
`'1‘
`N
`I
`
`
`
`1 l
`
`.

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket