throbber
Page 1 of 171
`
`SAMSUNG EXHIBIT 1009
`Samsung v. Image Processing Techs.
`
`

`

`
`
`Computer
`Architecture
`A
`Quantitative
`Approach
`
`LOS ANGELES PUBLIC LIBRARY
`CENTRAL LIBRARY
`DEPT. OF SCIENCE, TECHNOLOGY & PATENTS
`630 WEST Sth ST.
`LOS ANGELES, CA. $0071
`
`David A. Patterson
`UNIVERSITY OF CALIFORNIA AT BERKELEY
`
`John L. Hennessy
`STANFORD UNIVERSITY
`
`With a Contribution by
`David Goldberg
`Xerox Palo Alto Research Center
`
`MORGAN KAUFMANN PUBLISHERS, INC.
`SAN MATEO, CALIFORNIA
`
`Page 2 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 2 of 171
`
`

`

`
`
`
`
`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Head
`Cover Design David Lance Goines
`CopyEditor Linda Medoff
`Proofreader Paul Medoff
`Computer Typesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging-in-Publication Data
`Patterson, David A.
`Computer architecture : a quantitative approach/ David A.
`Patterson, John L. Hennessy
`.
`cm,
`Includes bibliographical references
`ISBN 1-55860- 069-8
`1. Computerarchitecture.
`QA76.9.A73P377 1990
`004.2’2--dc20
`
`I. Hennessy, John L.
`
`If. Title.
`
`89-85227
`
`CIP
`
`Morgan Kaufmann Publishers, Inc.
`Editorial Office: 2929 Campus Drive. San Mateo, CA 94403
`Order from: P.O. Box 50490, Palo Alto, CA 94303-9953
`©1990 by Morgan Kaufmann Publishers, Inc.
`All rights reserved.
`No part of this publication may be reproduced, stored in a retrieval system,or transmitted
`in any form or by any means—electronic, mechanical, recording, or otherwise—without
`the prior permission of the publisher.
`All instruction sets and other design information ofthe DLX computer system contained
`herein is copyrighted by the publisher and may not be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`ADVICE, PRAISE, & ERRORS: Any correspondencerelated to this publication or
`intendedforthe authors should be addressed to the editorial offices of Morgan Kaufmann
`Publishers. Inc., Dept. P&H APE.Information regarding error sightingsis encouraged.
`Anyerror sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction upon availability
`of the new printing. Electronic mail can be sent to bugs3 @vsop.stanford.edu. (Please
`include your full name and permanent mailing address.)
`INSTRUCTOR SUPPORT:Forinformation on classroom software and otherinstructor
`materials available to adopters, please contact the editorial offices of Morgan Kaufmann
`Publishers, Inc. (415) 578-9911.
`Fourth Printing
`
`
`Page 3 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 3 of 171
`
`

`

`
`
`To Andrea, Linda, and our four sons
`
`Page 4 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 4 of 171
`
`

`

`Ideally one would desire an indefinitely large memory
`capacity such that anyparticular ... word would be im-
`mediately available. ...We are... forced to recognize the
`possibility of constructing a hierarchy of memories, each of
`which has greater capacity thanthe preceding but which is
`less quickly accessible.
`
`A. W. Burks, H. H. Goldstine, and J. von Neumann,
`
`Preliminary Discussion ofthe Logical Design
`ofan Electronic Computing Instrument (1946)
`
`fata
`TT
`
`‘ie
`i
`
`;i ti
`rl ;
`i
`ane
`\ |
`Is
`a
`
`8.1.
`
`Introduction: Principle of Locality
`
`403
`
`7
`an |
`Hh
`ii
`
`ul]
`
`|
`
`
`
`
`
`|
`|
`|
`|
`|
`|
`|
`
`!
`
`,
`|
`|
`
`|
`|
`
`\
`
`|
`
`| i
`| t ;
`i ;
`1
`a |
`_ |
`! |
`i |
`|
`
`eal
`
`af
`
`|
`in
`
`8.3. Caches
`
`8.5 Virtual Memory
`
`8.10 Fallacies and Pitfalls
`
`8.2 General Principles of Memory Hierarchy
`8.4 Main Memory
`8.6
`Protection and Examples of Virtual Memory
`8.7 More Optimizations Based on Program Behavior
`8.8 Advanced Topics—Improving Cache-Memory
`Performance
`8.9
`Putting It All Together: The VAX-11/780 Memory
`Hierarchy
`8.11 Concluding Remarks
`8.12 Historical Perspective and References
`Exercises
`
`408
`
`432
`
`480
`
`404
`425
`438
`449
`454
`475
`484
`485
`490
`
`.
`
`;
`
`Page 5 of 171
`
`|
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 5 of 171
`
`

`

`8 Memory-Hierarchy
`
`Design
`
`|
`|
`
`|
`|
`
`8.1 | Introduction: Principle of Locality
`Computer pioneers correctly predicted that programmers would want unlimited
`amounts of fast memory. As the 90/10 rule in the first chapter predicts, most
`programs fortunately do not access all code or data uniformly (see Section 1.3,
`pages 8-12). The 90/10 rule can be restated as the principle of locality. This
`hypothesis, which holds that all programs favor a portion of their address space
`at any instant of time, has two dimensions:
`
`a Temporal locality (locality in ime)—If an item is referenced. it will tend to
`be referenced again soon.
`
`a Spatial locality (locality in space)—Ifan item is referenced, nearby items will
`tend to be referenced soon.
`
`A memory hierarchy is a natural reaction to locality and technology. The
`principle of locality and the guideline that smaller hardware is faster yield the
`conceptofahierarchybasedon different speeds and sizes. Since slower memory
`is cheaper, amemoryhierarchyisorganizedintoseveral levels—each smaller,
`faster, and more expensive per byte than the level below.The levels of the
`hierarchy subset one another; all data in one level is also found in the level
`below, and al! data in that lower level is found in the one below it, and so on
`until we reach the bottom ofthe hierarchy.
`
`Page6 of 171
`
`SAMSUNG EXHIBIT 1009
`
`SAMSUNG EXHIBIT 1009
`Page 6 of 171
`
`

`

`404
`8.1 Introduction: Principle of Locality
`
`
`
` 8.2
`
`This chapter includes a half-dozen examples that demonstrate how taking
`advantage of the principle of locality can improve performance. All
`these
`strategies map addresses from a larger memory to a smaller but faster memory.
`As part of address mapping,
`the memory hierarchy is usually given the
`responsibility of address checking; protection schemes used for doing this are
`covered in this chapter. Later we will explore advanced memory hierarchy topics
`and trace a memory access through three levels of memory on the VAX-11/780.
`
`General Principles of Memory Hierarchy
`
`Before proceeding with examples of the memory hierarchy, let’s define some
`general
`terms applicable to all memory hierarchies. A memory hierarchy
`normally consists of many levels, but it is managed between two adjacentlevels
`at a time. The upper level—the onecloser to the processor—is smaller and faster
`than the /ower level (see Figure 8.1). The minimum unit of information that can
`be either present or not present in the two-level hierarchy is called a block. The
`size of a block maybeeitherfixed or variable.If it is fixed, the memory size is a
`multiple of that block size. Most of this chapter will be concerned with fixed
`block sizes, although a variable block design is discussed in Section 8.6.
`Success or failure of an access to the upper level is designated as a hit or a
`miss: A hit is a memory access found in the upper level, while a miss meansit is
`not found in that level. Hit rate, or hit ratio—like a batting average—is the
`fraction of memory accesses found in the upper level. This is sometimes repre-
`sented as a percentage. Miss rate (1.0 — hit rate) is the fraction of memory
`accesses not found in the upperlevel.
`
`
`
` Processor
`
`!
`
`Blocks
`
`FIGURE 8.1 Every pair of levels in the memory hierarchy can be thoughtof as
`having an upperand lowerlevel. Within each level the unit of information that is present
`or not is called a block.
`
`SAMSUNG EXHIBIT 1009
`Page 7 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 7 of 171
`
`

`

`
`
`
`
`
`
`Block-offset address
`Block-frame address
`
`
`01011010001000001001010 111901110]
`
`
`Since performance is the major reason for having a memory hierarchy, the
`speed of hits and misses is important. Hit time is the time to access the upper
`level of the memory hierarchy, which includes the time to determine whether the
`accessis a hit or a miss.Misspenaltyisthetimeto replace a block in the upper
`levelwiththe corresponding blockfromthelowerlev§l, plus the time to deliver
`this block to the requesting device (normally the CPU). The miss penalty is
`further divided into two components: accesstime—thetime to accessthefirst
`word of a block on a miss; and transfer time—the additional time to transfer the
`remaining wordsin the block. Access time is related to the latency of the lower-
`level memory, while transfer time is related to the bandwidth between the lower-
`level and upper-level memories. (Sometimes access latency is used to mean
`access time.)
`The memory address is dividedinto pieces that access each part of the
`
`
`hierarchy. The block-frame addr.essisthe higher-order piece of theeoftheaddressthat
`identifies ablockatthat levelof the”hierarchy (see Figure 8.2).The block-offset
`address is the lower-order piece of the address and identifies an item within a
`block. The size of the block-offset address is log, (size of block), the size of the
`block-frame address is then the size of the full address at this level less the size
`of the block-offset address.
`
`Memory-Hierarchy Design
`
`FIGURE 8.2 Example of the frame addressandoffset address portions of a 32-bit
`lower-level memory address.In this case the block size is 512, making the size of the
`offset address 9 bits and the size of the block-frame address 23bits.
`
`Evaluating Performance of a Memory Hierarchy
`
`it is tempting to
`Because instruction count is independent of the hardware,
`evaluate CPU performance using that number. As we saw in Chapters 2 and 4,
`however, such indirect performance measures have waylaid many a computer
`designer. The corresponding temptation for evaluating memory-hierarchy
`performance is to concentrate on miss rate, for it, too,
`is independentof the
`speed of the hardware. As we shall see, miss rate can be just as misleading as
`instruction count. A better measure_ofmemory-hierarchy performanceis thene
`average time
`cess memory:
`
`Average memory-access time = Hit time + Miss rate * Miss penalty
`
`The components of average access time can be measured either in absolute
`time—say, 10 nanoseconds on a hit—orin the numberof clock cycles that the
`
`SAMSUNG EXHIBIT 1009
`
`Page 8 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 8 of 171
`
`

`

`
`
`406
`
`8.2 General Principles of Memory Hierarchy
`
`CPU waits for the memory—such as a miss penalty of 12 clock cycles.
`Rememberthat average memory-access time is still an indirect measure of
`performance; so whileit is a better measure than missrate, it 1s not a substitute
`for execution time.
`The relationship of block size to miss penalty and miss rate is shown
`abstractly in Figure 8.3. These representations assumethat the size of the upper-
`level memory does not change. The access-time portion of the miss penalty is
`not affected by block size, but the transfer time does increase with block size. If
`access time is large, initially there will be little additional miss penalty relative
`to access time as block size increases. However, increasing block size means
`fewer blocks in the upper-level memory. Increasing block size lowers the miss
`rate until
`the reduced misses of larger blocks (spatial locality) are outweighed
`by the increased misses as the numberof blocks shrinks (temporallocality).
`
`
`
`
`Transfer
`time
`
`Miss
`rate
`
`Blocksize
`
`
`Block size
`
`Miss
`penalty
`
`Block size
`
`FIGURE 8.3 Block size versus miss penalty and miss rate. The transfer-time portion of
`the miss penalty obviously grows with increasing block size. For a fixed-size upper-level
`memory, miss rates fall with increasing block size until] so much of the block is not used that
`it displaces useful information in the upperlevel, and miss rates begin to rise. The point on
`the curve on the right where miss rates begin to rise with incréasing blocksizeis
`sometimescalled the pollution point.
`
`
`
`Average
`access
`time
`
`FIGURE 8.4 Therelationship between average memory-accesstime and blocksize.
`
`SAMSUNG EXHIBIT 1009
`Page 9 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 9 of 171
`
`

`

`
`
`Memory-Hierarchy Design 407
`
`The goal of a memory hierarchy is to reduce execution time, not misses.
`
`
`
`
`Hence, computer designersTavorablocksizewiththelowestaverageaccess
`time rather than the lowest miss rate. This is related to the product of miss rate
`and miss penalty, as Figure 8.4 showsabstractly. Of course, overall CPU
`performanceis the ultimate performance test, so care must be taken whenre-
`ducing average memory-access timeto be sure that changes to clock cycle time
`and CPI improve overall performanceas well as average memory-accesstime.
`
`Implications of a Memory Hierarchy to the CPU
`
`Processors designed without a memory hierarchy are simpler because memory
`accesses always take the same amountof time. Misses in a memory hierarchy
`
`mean that the CPU must be able to handle‘variablememory-access times. If the
`misspenalty 1s on the orderoftens of clock cycles,the processor normally waits
`thousandsofprocessor clock cycles,iit is too wasteful to let the CPUsit idle; in
`this case, the CPU is interrupted and used for another process during the miss
`handling. Thus, avoiding the overhead of a long miss penalty means any
`memory access can result in a CPU interrupt. This also means the CPU mustbe
`able to recover any memory address that can cause such an interrupt, so that the
`system can know whatto transfer to satisfy the miss (see Section 5.6). When the
`memorytransfer is complete, the original process is restored, and the instruction
`that missedis retried.
`The processor must also have some mechanism to determine whether or not
`information is in the top level of the memory hierarchy. This check happens on
`every memory access andaffects hit time; maintaining acceptable performance
`usually requires the check to be implemented in hardware. The final implication
`of a memory hierarchy is that the computer must have a mechanism to transfer
`blocks between upper- and lower-level memory. If the block transfer is tens of
`clock cycles, it is controlled by hardware; if it is thousands of clock cycles, it
`éan be controlled by software.
`
`Four Questions for Classifying Memory Hierarchies
`
`The fundamental principles that drive all memory hierarchies allow us to use
`termsthat transcend the levels we are talking about. These sameprinciples allow
`us to pose four questions about any level ofthe hierarchy:
`
`Q1: Where can a block be placedin the upper level? (Block placement)
`
`Q2: Howis a block foundif it is in the upper level? (Block identification)
`
`Q3: Which block should be replaced on a miss? (Block replacement)
`
`Q4: What happens on a write? (Write strategy)
`
`These questions will help us gain an understanding of the different tradeoffs
`demanded bythe relationships of memories at different levels of a hierarchy.
`
`SAMSUNG EXHIBIT 1009
`Page 10 of 171
`
`———
`
`SAMSUNG EXHIBIT 1009
`Page 10 of 171
`
`

`

`408
`
`8.3 Caches
`
`8.3 Caches
`
`Cache:a safe place for hiding or storing things.
`Webster’ s New World Dictionary of the American Language,
`Second College Edition (1976)
`
`Cacheis the namefirst chosen to represent the level of the memory hierarchy
`between the CPU and main memory,and that is the dominant use of the term.
`While the concept of caches is younger than the IBM 360 architecture, caches
`appear today in every class of computer and in some computers more than once.
`In fact, the word has become so popular that it has replaced “buffer” in many
`computer-science circles.
`The general terms defined in the prior section can be used for caches,
`although the word /ine is often used instead of block. Figure 8.5 shows the
`typical range of memory-hierarchy parameters for caches.
`
`Block(line) size
`4 — 128 bytes
`
`
`Hit time
`1 —4 clock cycles (normally 1)
`
`Miss penalty
`8 — 32 clock cycles
`
`(Access time)
`(6— 10 clock cycles)
`(Transfer time)
`(2 — 22 clock cycles)
`
`Missrate
`1% — 20%
`
`
`
`
`
`
`
`
`Cachesize
`
`1 KB - 256 KB
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters for caches in 1990
`workstations and minicomputers.
`
`Now let’s examine caches in more detail by answering the four memory-
`hierarchy questions.
`
`Q1: Where Can a Block Be Placed in 4 Cache?
`
`Restrictions on where a block is placed create three categories of cache
`organization:
`
`s
`
`a
`
`If each block has only one place it can appearin the cache, the cacheis said
`to be direct mapped. The mappingis usually (block-frame address) modulo
`(numberof blocks in cache).
`
`Ifa block can be placed anywherein the cache, the cacheis said to be fully
`associative,
`
`SAMSUNG EXHIBIT 1009
`Page 11 of 171
`
`:
`
`7
`
`*
`
`=
`L
`
`q
`
`ieee
`
`Wangarte
`
`‘
`|
`i E
`
`[iE
`
`SAMSUNG EXHIBIT 1009
`Page 11 of 171
`
`

`

`Memory-Hierarchy Design
`
`409
`
`
`
`a Ifa block can be placedinarestricted set of places in the cache, the cache is
`said to be ser associative. A set is a group of two or more blocksin the cache.
`A blockis first mapped onto a set, and then the block can be placed anywhere
`within the set. The set is usually chosen by bit selection; that is, (block-frame
`address) modulo (numberofsets in cache). If there are n blocks in a set, the
`cache placementis called n-way set associative.
`
`The range of caches from direct mapped to fully associative is really a
`continuum oflevels of set associativity: Direct mapped is simply one-way set
`associative and a fully associative cache with m blocks could be called m-way
`set associative. Figure 8.6 shows where block 12 can be placed in a cache
`according to the block-placementpolicy.
`
`Fully associative:
`block 12 can go
`anywhere
`
`01234567
`
`Direct mapped:
`block 12 can go
`only into block 4
`(12 mod 8)
`
`Block 01234567
`
`no.
`
`
`
`no. 0123456789
`
`Set associative:
`block 12 can go
`anywhere in set 0
`(12 mod 4)
`
`01234567
`
`. |
`
`
`
`Set Set Set Set
`0
`1
`2
`3
`Block-frame address
`
`
`
`Block
`
`FIGURE 8.6 The cachehas8 blocks, while memory has 32 blocks. Theset-
`associative organization has 4 sets with 2 blocks perset, called two-way set associative.
`(Real caches contain hundreds of blocks and real memories contain hundredsof thousands
`of blocks.) Assume thatthere is nothing in the cache and that the block-frame address in
`question identifies lower-level block 12. The three options for caches are shownleft to right.
`In fully associative, block 12 from the lower level can go into any of the 8 blocks of the
`cache. With direct mapped, block 12 can only be placed into block 4 (12 modulo 8). Set
`associative, which has someofbothfeatures, allows the block to be placed anywherein set
`0 (12 modulo 4). With two blocks perset, this means block 12 can be placedeither in block
`0 or block 1 of the cache.
`
`SAMSUNG EXHIBIT 1009
`Page 12 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 12 of 171
`
`

`

`410
`
`Q2: HowIs a Block FoundIf It Is in the Cache?
`
`
`
`Cachesinclude an address tag on each block that gives the block-frame address.
`The tag of every cache block that might contain the desired information is
`checked to see if it matches the block-frame address from the CPU. Figure 8.7
`gives an example. Because speedis of the essence,all possible tags are searched
`in parallel; serial search would makesetassociativity counterproductive.
`
`
`Fully associative
`01234567
`
`Direct mapped
`Block 01234567
`
`Set associative
`Block 01234567
`
`Search
`
`8.3 Caches
`
`There must be a way to know that a cache block does not have valid
`information. The most common procedure is to add a valid bit to the tag to say
`whetheror not this entry contains apalid address. If the bit is not set, there
`cannot be a match onthis address.
`A common omission in finding the cost of caches is to forget the cost of the
`tag memory. One tag is required for each block. An advantage of increasing
`
`block sizes is that the tag overhead per cache entry becomes a smaller fraction of
`
`
`
`thetotalcostofthecache.—FSC~™ _
`
`~~Beforeproceedingtothe next question, let’s explore the relationship of a
`CPU address to the cache. Figure 8.8 shows how anaddressis divided into three
`fields to find data in a set-associative cache: the block-offset field used to select
`the desired data from the block, the index field used to select the set, and the fag
`field used for the comparison. While the comparison could be made on more of
`the address than the tag, there is no need:
`
`uy
`mn PT
`
`FIGURE 8.7 In fully associative placement, the block for biock-frame address 12 can
`appearin any of the 8 blocks; thus,all 8 tags must be searched. The desired data is
`found in cache block 6 in this example. In direct-mapped placementthere is only one cache
`block where memory block 12 can be found. In set-associative placement, with 4 sets,
`memory block 12 must be in set 0 (12 mod 4); thus, the tags of cache blocks 0 and 1 are
`checked. In this case the data is found in cache block 1. Speed of cache accessdictates
`that searching must be performedin parallel for fully associative and set-associative
`mappings.
`
`SAMSUNG EXHIBIT 1009
`Page 13 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 13 of 171
`
`

`

`
`
`
`Memory-Hierarchy Design
`
`
`
`s Checking the index would be redundant, since it was used to select the set to
`be checked (an address stored in set 0, for example, must have 0 in the index
`field or it couldn’t be stored in set 0).
`
`
`
`
`
`
`
`
`u The offset is unnecessary in the comparison becauseall block offsets match
`and the entire block is present or not.
`
`If the total size is kept the same, increasing associativity increases the number of
`blocks perset, thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 movesto the right with
`increasing associativity.
`
`
`
`
`FIGURE 8.8 The 3 portions of an addressin a set-associative or direct-mapped cache.
`The tag is used to checkall the blocks in the set and the index is used to select the set. The
`block offset is the address of the desired data within the block.
`
`Q3: Which Biock Should Be Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn’t,
`then it would be easy to select which block to replace. Alas, the high hit rate of
`caches meansthat the overwhelming decision is between blocks that have valid
`data.
`is that hardware decisions are
`A benefit of direct-mapped placement
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set-
`associative placement, there are several blocks to choose from on a miss. There
`are two primary strategies employed for selecting which block to replace:
`
`a Random—Tospread allocation uniformly, candidate blocks are randomly
`selected. Some systems use a schemefor spreading data across a set of blocks
`in a pseudorandomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`a Least-recently used (LRU)—To reduce the chance of throwing out informa-
`tion that will be needed soon, accesses to blocks are recorded. The block
`replaced is the one that has been unusedfor the longest time. This makes use
`of a corollary of temporal locality: If recently used blocks are likely to be
`used again, then the best candidate for disposal is the least recently used.
`Figure 8.9 (page 412) shows which block is the least-recently used for a
`sequence of block-frame addressesin a fully associative memory hierarchy.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`SAMSUNG EXHIBIT 1009
`
`Page 14 of 171
`
`SAMSUNG EXHIBIT 1009
`Page 14 of 171
`
`

`

`Page 15 of 171
`
`
`Q4: What Happens on a Write?
`Reads dominate cache accesses. Ail instruction accesses are reads, and most
`instructions don’t write to memory. Figure 4.34 (page 181) suggests a mix of 9%
`stores and 17% loads for four DLX programs, making writes less than 10%of
`the memory traffic. Making the commoncase fast means optimizing caches for
`reads, but Amdahl’s Law reminds us that high-performance designs cannot
`neglect the speed of writes.
`Fortunately, the commoncaseis also the easy case to make fast. The block
`can beread at the sametimethat the tag is read and compared, so the block read
`begins as soon as the block-frame address is available. If the read is a hit, the
`block is passed on to the CPU immediately. If it is a miss, there is no benefit—
`but also no harm.
`
`_.. ee
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-frame addressesin
`a fully associative memory hierarchy. This assumesthat there are 4 blocks andthat in
`the beginning the LRU block is number 0. The LRU block number is shown below each
`new block reference. Another policy, First-in-first-out (FIFO), simply discards the block that
`was used N unique accessesbefore, independentof its reference pattern in the last N — 1
`references. Random replacement generally outperforms FIFO andit is easier to implement.
`
`—_————sv
`
`Associativity:
`
`Size
`
`2-way
`
`Random
`
`LRU
`
`4-way
`
`Random
`
`LRU
`
`8-way
`
`Random
`
`LRU
`
`|
`
`1.53%
`
`
`16 KB
`5.18% 5.69%
`4.67% 5.29%
`4.39% 4.96%
`— .
`_
`-
`sj
`1.88% 2.01%
`1.54% 1.66%
`
`1.39%
`
`-
`
`64 KB
`
`256 KB
`
`115% 1.17%
`
`1.13%
`
`1.13%
`
`1.12%
`
`1.12%
`
`FIGURE 8.10 Miss rates comparing least-recently used versus random replacement
`for several sizes and associativities. This data was collected for a block size of 16 bytes
`using one of the VAX traces containing user and operating system code (SAVE). This
`traceis included in the software supplementfor course use. Thereis little difference
`between LRU and random for larger size cachesin this trace.
`
`|
`
`“
`
`SAMSUNG EXHIBIT 1009
`
`412
`
`8.3 Caches
`
`A virtue of random is that it is simple to build in hardware. As the numberof
`blocks to keep track of increases, LRU becomes increasingly expensive andis
`frequently only approximated. Figure 8.10 shows the difference in miss rates
`between LRU and random replacement. Replacement policy plays a greater role
`in smaller caches than in larger caches where there are more choices of whatto
`replace.
`
` -
`
`
`
`
`
`
`
`
`Block-frame addresses | 0 i)1 3 2 l 0
`
`
`
`
`
`
`
`
`
`LRU block number|0) 0/0 0) 3) 3/3|1|0 02 |
`
`_
`
`fan
`
`—
`
`Wo
`
`-
`| o
`
`SAMSUNG EXHIBIT 1009
`Page 15 of 171
`
`

`

`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block cannot begin until the tag is checkedto seeif it is
`a hit. Because tag checking cannotoccurin parallel, then, writes normally take
`longer than reads.
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`a Write through (or store through)—The information is written to both the
`blockin the cache andto the block in the lower-level memory.
`
`a Write back (also called copy back or store in)—The information is written
`only to the block in the cache. The modified cache block is written to main
`memory only whenitis replaced.
`
`Write-back cache blocksare called clean or dirty, depending on whether the
`information in the cache differs from that in lower-level memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirty bit is
`commonly used. This status bit indicates whether or not the block was modified
`while in the cache.If it wasn’t, the block is not written, since the lower level has
`the same information as the cache.
`Both write back and write through have their advantages. With write back,
`writes occur at theSpeed_ofthecachememory, andmultiple writes within a
`blockrequire only one write to the lower-level memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through, read misses don’t result
`in writes to the lower level, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the most
`current copy of the data. This is important in multiprocessors and for I/O, which
`weshall examine in Section 8.8. Hence, multiprocessors want write back to
`reduce the memory traffic per processor and write through to keep the cache and
`memory consistent.
`When the CPU must wait for writes to complete during write throughs, the
`CPUis said to write stall. A common optimization to reduce write stalls is a
`write buffer, which allows the processor to continue while the memory is
`updated. As weshall see in Section 8.8, write stalls can occur even with write
`uffers.
`There are two options on a write miss:
`
`a Write allocate (also called fetch on write)—Theblockis loaded, followed by
`the write-hit actions above. This is similar to a read miss.
`
`a No write allocate (also called write around)—The block is modified in the
`lowerlevel and not loaded into the cache.
`
`
`
`=
`
`SAMSUNG EXHIBIT 1009
`Page 16 of 171
`ret
`
`SAMSUNG EXHIBIT 1009
`Page 16 of 171
`
`

`

` 414
`
`
`
`8.3 Caches
`
`While either write-miss policy could be used with write through or write back,
`generally write-back caches use write allocate (hoping that subsequent writes to
`that block will be captured by the cache) and write-through caches often use no
`write allocate (since subsequent writes to that block will still have to go to
`memory).
`
`An Example Cache: The VAX-11/780 Cache
`
`To give substance to these ideas, Figure 8.11 shows the organization of the
`cache on the VAX-11/780. The cache contains 8192 bytes of data in 8-byte
`blocks with two-way-set-associative placement, random replacement, write
`through with a one-word write buffer, and no write allocate on a write miss.
`Let’s trace a cache hit through the steps of a hit as labeled in Figure 8.11.
`(The five steps are shown as circled numbers.) The address coming into the
`cacheis divided into two fields: the 29-bit block-frame address and 3-bit block
`offset. The block-frame addressis further divided into an address tag and cache
`index. Step | showsthis division.
`The cache index selects the set to be tested to see if the block is in the cache.
`(A set is one block from each bank in Figure 8.11.) The size of the index
`depends on cachesize, block size, and set associativity. In this case, a 9-bit
`index results:
`
`
`Blocks _
`Cache size
`_ 8192 =
`_99
`Bank — Blocksize * Set associativity 8 * 27 sl2=2
`In a two-way-set-associative cache, the index is sent to both banks. This is
`step 2.
`After reading an address tag from each bank, the tag portion of the block-
`frame address is compared to the tags. This is step 3 in the figure. To be sure the
`tag contains valid information, the valid bit must beset, or the results of the
`comparison are ignored.
`Assuming one of the tags does match, a 2:1 multiplexer (step 4) is set to
`select the block from the matching set. Why can’t both tags match?It is the job
`of the replacement algorithm to makésure that an address appears in only one
`block. To reduce the hit time, the data is read at the same time as the address
`tags; thus, by the time the block multiplexer is ready, the data is also ready.
`This step is needed in set-associative caches, but it can be omitted from
`direct-mapped caches since there is no selection to be made. The multiplexer
`used in this step can be on the critical
`timing path, endangering the clock cycle
`time of the CPU. (The example on pages 418-419 and the fallacy on page 481
`explore the trade-off of lower miss rates and higher clock cycle time.)
`In the final step the word is sent to the CPU. All five steps occur within a
`single CPU clock cycle.
`Whathappenson a miss? Thecachesendsa stall signal to the CPUtelling it
`to wait, and two words (eight bytes) are read from memory. That takes 6 clock
`cycles on the VAX-11/780 (ignoring bus interference). When the data arrives,
`
`t
`
`fj
`
`SAMSUNG EXHIBIT 1009
`Page 17 of 17
`
`
`
`SAMSUNG EXHIBIT 1009
`Page 17 of 171
`
`

`

`the cache must pick a block to replace: the VAX-11/780 selects one of the two
`blocks at random. Replacing a block means updating the data, the addresstag,
`and the valid bit. Once this is done, the cache goes through a regular hit cycle
`and returns the data to the CPU.
`Writes are more complicated in the VAX-11/780, as they are in any cache.If
`the word to be written is in the cache, the first four steps are the same. The next
`step is to write the data in the block, then write the changed-data portion into the
`
`
`
`
`
`
`
`
`
`
`eeeee+
`
`2own.5
`nou
`
`
`sod
`bataDatata
`
`
`
`
`Tag
`Valid
`
`<i> <20>
`
`
`
`
`
`
`
`blocks)
`
`
`
`
`FIGURE 8.11 The organization of the VAX-11/780 cache. The 8-KB cacheis two-way
`set associative with 8-byte blocks. It has 512 sets with two blocks perset, the setis
`selected by the 9-bit index. The five steps of a read hit, shownascircled numbersin order
`of occurrence, label this organization. Theline from memoryto the cache is used on a miss
`to load the cache. Multiplexing as found in step 4 is not neededin a direct-mapped cache.
`Note that the offset is connectedto chip select of the data SRAMsto allow the proper
`words to be sent to the 2:1 multiplexer.
`
`
`
`
`
`
`SAMSUNG EXHIBIT 1009
`
`Page 18 of 17

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket