throbber
1COMPUTFKH1
`fARCHlT1ECTURE ’
`
`1
`
`E}
`
`1
`
`A
`
`1
`
`In --
`
`-I
`
`_
`
`..
`.-
`-
`I
`|'
`.
`..'&|-Iu|v$'.«.-..=g
`.._._;_- -r-
`.'_ H‘-I1
`1-"_I
`_- : I
`._ .
`I
`__ _
`I E_:_I::_.- -.__
`.,._.
`
`QUANTHAT
`APPROACH
`
`
`.
`-5
`ra-
`.
`
`I
`
`_
`
`_._:.-
`
`:2
`
`1
`
`L HENN'ES.S-_:Y?-'_
`an}
`; 1
`
`7
`
`1
`
`1
`
`APPLE 1012
`
`1
`
`APPLE 1012
`
`

`
`Computer
`Architecture
`
`A .
`
`Quantitative
`Approach
`
`4
`
`David A. Patterson
`UNIVERSITY OF CALIFORNIA AT BERKELEY
`
`John L. Hennessy
`STANFORD UNIVERSITY
`
`With a Contribution by
`David Goldberg
`Xerox Palo Alto Research Center
`
`MORGAN KAUFMANN PUBLISHERS, INC.
`SAN MATEO, CALIFORNIA
`
`2
`
`

`
`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Head
`Cover Design David Lance Goines
`Copy Editor Linda Medoff
`Proofreader Paul Medoff
`Computer Typesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging-in-Publication Data
`
`Patterson, David A.
`Computer architecture : a quantitative approach / David A.
`Patterson, John L. Hennessy
`p.
`cm.
`Includes bibliographical references
`ISBN 1-55880— 069-8
`
`1. Computer architecture. 2. Electronic digital computers
`_ —-Design and construction. I. Hennessy, John L. II. Title.
`QA76.9.A73P377 1990
`004.2’2--dc20
`
`_
`
`_
`Morgan Kaufmann Publishers, Inc.
`Editorial Office: 2929 Campus Drive.'San Mateo, CA 94403
`Order from: P.O. Box 50490, Palo Alto, CA’94303—9953
`
`©1990 by Morgan Kaufmarm Publishers, Inc.
`All rights reserved.
`
`89-85227
`CIP
`
`No part of this publication may be reproduced, stored in a retrieval system, or transmitted
`in any form or by any means—electronic, mechanical, recording, or otherwise—without
`the prior permission of the publisher.
`
`All instruction sets and other design information of the DLX computer system contained
`herein is copyrighted by the publisher and may not be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`ADVICE, PRAISE, & ERRORS: Any correspondence related to this publication or
`intended for the authors should be addressed to the editorial offices of Morgan Kaufmann
`Publishers, Inc., Dept. P&H APE. Information regarding error sightings is encouraged.
`Any error sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction. Electronic mail
`can be sent to bugs@ vsop.stanford.edu.
`INSTRUCTOR SUPPORT: For information on classroom software and other instructor
`materials available to adopters, please contact the editorial offices of Morgan Kaufmann
`Publishers, Inc.
`
`3
`
`

`
`Computer Architecture: A Quantitative Approach
`
`.
`
`_
`
`| Contents
`
`Foreword
`by C. GORDON BELL
`
`Preface
`
`Acknowledgements
`
`Fundamentals of Computer Design
`1.1
`introduction
`1.2
`Definitions of Perfonnance
`1.3
`Quantitative Principles of Computer Design
`~
`1.4
`The Job of a Computer Designer
`1.5
`Putting It All Together: The Concept of Memory Hierarchy
`1.6
`Fallacies and Pitfalls
`
`1.7
`1.8
`
`Concluding Flenlarks
`Historical Perspective and References
`Exercises
`
`Performanceand cost
`2.1
`Introduction
`
`2.2
`2.3
`2.4
`2.5
`2.6
`2.7
`
`Performance
`Cost
`Putting It All Together: Price/Performance of Three Machines
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Exercises
`
`Instruction Set Design:
`Alternatives and Principles
`3.1
`Introduction
`3.2
`Classifying instruction Set Architectures
`3.3
`Operand Storage in Memory: Classifying General~Purpose
`Register Machines
`Memory Addressing
`3.4
`Operations in the instruction Set
`3.5
`Type and Size of Operands
`3.6
`The Role of High-Level Languages and Compilers
`3.7
`Putting It All Together: How Programs Use Instruction Sets
`3.8
`Fallacies and Pitfalls
`3.9
`3.10 Concluding Remarks
`3.11
`Historical Perspective and References
`Exercises
`
`4
`
`

`
`Instruction Set Examples and
`Measurements of Use
`4.1
`Instruction Set Measurements: What and Why
`4.2
`The VAX Architecture
`4.3
`The 360/370 Architecture
`4.4
`The 8086 Architecture
`.
`4.5
`The DLX Architecture
`4.6
`Putting It All Together: Measurements
`of Instruction Set Usage
`Fallacies and Pitfalls
`4.7
`Concluding Remarks
`4.8
`Historical Perspective and References
`4.9
`Exercises
`‘
`Basic Processor Implementation Techniques
`5.1
`Introduction
`5.2
`Processor Datapath
`5.3
`Basic Steps of Execution
`5.4
`Hardwired Control
`5.5
`Microprogrammed Control
`5.6
`Interrupts and Other Entanglements
`5.7
`Putting It All Together: Control for DLX
`5.8
`Fallacies and Pitfalls
`5.9
`Concluding Remarks
`5.10 Historical Perspective and References
`Exercises
`‘
`
`Pipelining
`6.1
`What Is Pipelining?
`6.2
`The Basic Pipeline tor DLX
`6.3
`Making the Pipeline Work
`6.4
`The Major Hurdle of Pipelining—Pipeline Hazards
`6.5 What Makes Pipelining Hard to Implement
`'
`6.6
`Extending the DLX Pipeline to Handle Multicycle Operations
`6.7
`Advanced Pipelining—Dynamic Scheduling in Pipelines
`6.8
`Advanced Pipelining—Taking Advantage of More
`Instruction-Level Parallelism
`Putting It All Together: A Pipelined VAX
`6.9
`Fallacies and Pitfalls
`6.10
`Concluding Remarks
`6.11
`6.12 Historical Perspective and References
`Exercises
`
`5
`
`

`
`Computer Architecture: A Quantitative Approach
`
`Vector Processors
`7.1
`Why Vector Machines?
`Basic Vector Architecture
`7.2
`7.3
`Two Real—World Issues: Vector Length and Stride
`7.4
`A Simple Model for Vector Performance
`7.5
`Compiler Technology for Vector Machines
`7.6
`Enhancing Vector Perlormance
`7.7
`Putting It All Together: Evaluating the
`Performance of Vector Processors
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Exercises
`
`7.8
`7.9
`7.10
`
`Memory-Hierarchy Design
`8.1
`Introduction: Principle of Locality
`8.2
`General Principles of Memory Hierarchy
`8.3
`Caches
`8.4
`Main Memory
`8.5
`4
`Virtual Memory
`8.6
`Protection and Examples of Virtual Memory
`8.7
`More Optimizations Based on Program Behavior
`8.8
`Advanced Topics—lmproving Cache-Memory Performance
`8.9
`Putting It All Together: The VAX-11/780 Memory Hierarchy
`8.10
`Fallacies and Pitfalls
`'
`8.11
`Concluding Remarks
`8.12
`Historical Perspective and References
`Exercises
`
`Inputloutput
`9.1
`,
`introduction
`9.2
`Predicting System Performance
`9.3
`I/0 Performance Measures
`9.4
`Types of I/O Devices
`9.5
`Buses-Connecting l/O Devices to CPU/Memory
`9.6
`Interfacing to the CPU
`9.7
`Interfacing to an Operating System
`9.8
`Designing an I/O System‘
`9.9
`Putting It All Together:
`The IBM 3990 Storage Subsystem
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Exercises
`
`9.10
`9.11
`9.12
`
`6
`
`

`
`Future Directions
`10.1
`Introduction
`10.2
`Flynn Classification of Computers
`10.3
`SIMD Computers—Single instruction
`Stream, Multiple Data Streams
`10.4 MIMD Computers——MuItiple instruction
`Streams, Multiple Data Streams
`The Roads to El Dorado
`Special-Purpose Processors
`Future Directions for Compilers
`Putting It All Together: The Sequent Symmetry
`Multiprocessor
`Fallacies and Pitfalls
`10.9
`10.10 Concluding Remarks——Evolution Versus
`Revolution in Computer Architecture
`10.11 Historical Perspective and References
`Exercises
`
`10.5
`10.6
`10.7
`10.8
`
`‘
`
`Appendix A: Computer Arithinetic
`by DAVID GOLDBERG
`Xerox Palo Alto Research Center
`introduction
`A.1
`Basic Techniques of integer Arithmetic
`'A.2
`Floating Point
`A.3
`Floating-Point Addition
`A.4
`Floating-Point Multiplication
`A.5
`Division and Remainder
`A.6
`Precisions and Exception Handling
`A.7
`Speeding Up integer Addition p
`A.&
`Speeding Up integer Multiplication and Division
`A.9
`A.1O Putting It All Together
`A.11
`Fallacies and Pitfalls
`A.12 Historical Perspective and References
`Exercises
`,
`
`Appendix B: complete Instruction Set Tables
`B.1
`VAX User Instruction Set
`B.2
`System/360 instruction Set
`Ba
`8086 Instruction Set
`
`A-1
`
`A-1
`A-2
`A-12
`A-16
`A-20
`A-23
`A-28
`A-31
`A-39
`A-53
`A-57
`A-58
`A-63
`
`B-1
`B-2
`B-6
`B-9
`
`Appendix C: Detailed Instruction Set Measurements G-1
`C.1
`VAX Detailed Measurements
`C-2
`C.2
`360 Detailed Measurements
`C-3
`C.3
`Intel 8086 Detailed Measurements
`C-4
`C.4
`DLX Detailed Instruction Set Measurements
`C-5
`
`7
`
`

`
`Computer Architecture: A Quantitative Approach
`
`Appendix D: Time Versus Frequency Measurements D-1
`D—2
`D.1
`Time Distribution on the VAX-11/780
`D—4
`D.2
`Time Distribution on the IBM 370/168
`D-6
`D.3
`Time Distribution on an 8086 in an IBM PC
`D-8
`D.4
`Time Distribution on a DLX Relative
`
`Appendix E: Survey of RISC Architectures
`E.1
`Introduction
`E.2
`Addressing Modes and Instruction Formats
`E.3
`Instructions: The DLX Subset
`E.4
`Instructions: Common Extensions to DLX
`E.5
`Instructions Unique to MIPS
`E.6
`Instructions Unique to SPARC
`E.7
`Instructions Unique to M88000
`E.8
`Instructions Unique to i860
`E.9
`Concluding Remarks
`E.10 References
`
`References
`
`Index
`
`E-1
`E-1
`E—2
`E—4
`E-9
`E-12
`E-1 5
`E-1 7
`E19
`E-23
`' E-24
`
`R-1
`
`I-1
`
`8
`
`

`
`
`
`Ideally one would desire an indefinitely large memory
`capacity such that any particular .
`.
`. word would be im-
`mediately available. .
`.
`. We are .
`.
`. forced to recognize the
`possibility of constructing a hierarchy of memories, each of
`which has greater capacity than the preceding but which is
`less quickly accessible.
`
`A. w. Burks, H. H. Goldstine, and J. V011 NeumaIm,
`Preliminary Discussion of the Logical Design
`of an Electronic Computing Instrument (1946)
`s
`
`
`
`Introduction: Principle of Locality
`General Principles ol,Memory Hierarchy .
`caches
`Main Memory
`Virtual Memory
`Protection and Examples of Virtual Memory
`
`More Optimizations Based on Program Behavior
`Advanced Topics—|mproving cache-Memory
`Performance
`’
`
`Putting It All Together: The VAX-'l 1/780 Memory
`Hierarchy
`Fallacies and Pitlalls
`
`concluding Remarks
`Historical Perspective and References
`Exercises
`
`9
`
`

`
`.
`
`i Memory-Hierarchy
`
`Design
`
`,
`
`8.1 \ Introduction: Principle of Locality
`Computer pioneers correctly predicted that programmers would want unlimited
`amounts of fast memory. As the 90/10 rule in the first chapter predicts, most
`programs fortunately do not access all code or data uniformly (see Section 1.3,
`pages 8-12). The 90/10 rule can be restated as the‘principle of locality. This
`hypothesis, which holds that all programs favor a portion of their address space
`at any instant of time, has two dimensions:
`- Temporal locality (locality in time)—If an item is referenced, it will tend to
`be referenced again soon.
`‘
`u Spatial locality (locality in space)——_If an item is referenced, nearby items will
`tend to be referenced soon.
`_
`A memory hierarchy is a natural reaction to locality and technology. The
`principle of locality and the guideline that smaller hardware is faster yield the
`concept of a hierarchy based on different speeds and sizes. Since slower memory
`is cheaper, a memory hierarchy is organized into several levels—each smaller,
`faster, and more expensive per byte than the level below. The levels of the
`hierarchy subset one another; all data in one level is also found in the level
`below, and all data in that lower level is found in the one below it, and so on
`until we reach the bottom of the hierarchy.
`
`10
`
`

`
`8.1
`
`introduction: Principle of Locality
`
`'1
`
`This chapter includes a half-dozen examples that demonstrate how taking
`advantage of the principle of locality can improve performance. All these
`_ strategies map addresses from a larger memory to a smaller but faster memory
`As part of address mapping, the memory hierarchy is usually given the
`responsibility of address checking; protection schemes used for doing this are
`covered in this chapter. Later we will explore advanced memory hierarchy topim
`and trace a memory access through three levels of memory on the VAX-ll/780.
`
`General Principles of Memory Hierarchy
`
`Before proceeding with examples of the memory hierarchy, let’s define some
`general terms applicable to all memory hierarchies. A memory hierarchy"
`normally consists of many levels‘, but it is managed between two adjacent levels
`at a time. The upper level——the one closer to the processor—is smaller and faster
`than the lower level (see Figure 8.1). The minimum unit of information that CE
`be either present or not present in the two-level hierarchy is called a black. '11:
`size of a block may be either fixed or variable. If it is fixed, the memory size is 1
`"multiple of that block size. Most of this chapter will be concerned with fixed
`block sizes, although a variable block design is discussed in Section 8.6.
`Success or failure of an access to the upper level is designated as a hit or;
`miss: A hit is a memory access found in the upper level, while a miss means it 5.:
`not found in that level. Hit rate, or hit ratio—like a batting average——-is I1:
`fraction of memory accesses found in the upper level. This issometimes repre-
`sented as a" percentage. Miss 'rate (1.0 — hit rate) is the fraction of memoq
`accesses not found in the upper level.
`
`Processor
`
`..
`
`‘FIGURE 8.1 Every pair of levels in the memory hierarchy can be thought of as
`having an upper and lower level. Within each level the unit of information that is preserl
`or not is called a block.
`
`-._.,_;._
`
`,_-_,_-L millllllllIllIlllllllllllllIlIml|Jlrllmu.t....... ..
`
`.
`
`-
`
`11
`
`

`
`Memory-Hierarchy Design
`
`Since performance is the major reason for having a memory hierarchy, the
`speed of hits and misses is important. Hit time is the time to access the upper
`level of the memory hierarchy, which includes the time to determine whether the
`access is a hit or a miss. Miss penalty is the time to replace a block in the upper
`level with the corresponding block from the lower level, plus the time to deliver
`this block to the requesting device (normally the CPU). The miss penalty is
`further divided into two components: access time—the time to access the first
`word of a block on a miss; and transfer time—the additional time to transfer the
`remaining words in the block. Access time is related to the latency of the lower-
`level memory, while transfer time is related to the bandwidth between the lower-
`level and upper—level memories. (Sometimes access latency is used to mean
`access time.)
`The memory address is divided into pieces that access each part of the
`hierarchy. The block-frame address is the higher-order piece of the address that
`identifies a block at that level of the hierarchy (see Figure 8.2). The black—oflset
`address is the lower-order piece of the address and identifies an item within a
`block. The size of the block-offset address is log2.(size of block); the size of the
`block-frame address is then the size of the full address at»'this level less the size
`of the block-offset address.
`
`Block-frame address
`
`Block-offset address
`
`0l01'i01000_l0000010010‘lOI l11i)C111OI
`
`FIGURE 8.2 Example of the frame address and offset address portions of a 32-bit
`lower-level memory address. In this case the block size is 512, making the size of the
`offset address 9 bits and the size of the block-frame address 23 bits.
`
`Evaluating Performance of a Memory Hierarchy
`
`Because instruction count is independent of the hardware, it is tempting to
`evaluate CPU performance using that number. As we saw in Chapters 2 and 4,
`however, such indirect performance measures have waylaid many a computer
`designer. The corresponding temptation for evaluating memory-hierarchy
`perfomiance is to concentrate on miss rate, for it, too, is independent of the
`speed of the hardware. As we shall see, miss rate can be just as misleading as
`instruction count. A better measure of memory-hierarchy performance is the
`average time to access memory:
`
`Average memory-access time = Hit time + Miss rate * Miss penalty
`
`The components of average access time can be measured either in absolute
`time-—say, l0 nanoseconds on a hit—or in the number of clock cycles that the
`
`12
`
`

`
`8.2 General Principles of Memory Hierarchy
`
`1
`CPU waits for the memory——such as a miss penalty of 12 clock cycles
`Remember that average memory-access time is still an indirect measure of
`performance; so while it is a better measure than miss rate, it is not a substitute
`for execution time.
`
`The relationship of block size to miss penalty and miss rate is shown
`abstractly in Figure 8.3. These representations assume that the size of the upper-
`level memory does not change. The access-time portion of the miss penalty is
`not affected by block size’, but the transfer time does increase with block size. If
`access time is large, initially there will be little additional miss penalty relative
`to access time as block size increases. However, increasing block size means
`fewer blocks in the upper—level memory. Increasing block size lowers the miss
`rate until
`the reduced misses of larger blocks (spatial locality) are outweighed
`by the increased misses as the number of blocks shrinks (temporal locality).
`
`Transfer
`
`Block size
`
`Block size
`
`FIGURE 8.3 Block size versus miss penalty and miss rate. The transfer«time portion d
`the miss penalty obviously grows with increasing block size. For a fixed-size upper—level
`memory, miss rates fall with increasing block size until so much of the block is not used that
`it displaces useful information in the upper level, and miss rates begin to rise. The point on
`the curve on the right where miss rates begin to rise with increasing block size is
`sometimes called the pollution point.
`
`Average
`access
`time
`
`FIGURE 8.4 The relationship between average memory-access time and block size.
`
`13
`
`

`
`Memory-Hierarchy Design
`
`The goal of a memory hierarchy is to reduce execution time, not misses.
`Hence, computer designers favor a block size with the lowest average access
`time rather than the lowest miss rate. This is related to the product of miss rate
`and misspenalty, as Figure 8.4 shows abstractly. Of course, overall CPU
`performance is the ultimate performance test, so care must be taken when re-
`ducing average memory-access time to be sure that changes to clock cycle time
`and CPI improve overall performance as well as average memory-access time.
`
`Implications of a Memory Hierarchy to the CPU
`
`Processors designed without a memory hierarchy are simpler because memory
`accesses always take the same amount of time. Misses in a memory hierarchy
`mean that the CPU must be able to handle variable memory-access times. If the
`miss penalty is on the order of tens of clock cycles, the processor normally waits
`for the memory transfer to complete. On the other hand, if the miss penalty is
`thousands of processor clock cycles, it is too wasteful to let the CPU sit idle; in
`this case, the CPU is interrupted and used for another process during the miss
`handling. Thus, avoiding the overhead of a long miss penalty means any
`memory access can result in a CPU interrupt. This also means the CPU must be
`able to recover any memory address that can cause such an interrupt, so that the
`system can know what to transfer to satisfy the miss (see Section 5.6). When the
`memory transfer is complete, the original process is restored, and the instruction
`that missed is retried.
`The processor must also have some mechanism to determine whether or not
`information is in the top level of the memory hierarchy. This check happens on
`every memory access and affects hit time; maintaining acceptable performance
`usually requires the check to be implemented in hardware. The final implication ‘_
`of a memory hierarchy is that the computer must have a mechanism to transfer
`blocks between upper- and lower-level memory. If the block transfer is tens of
`clock cycles, it is controlled by hardware; if it is thousands of clock cycles, it
`can be controlled by software.
`
`Four Questions for classifying Memory Hierarchies
`
`The fundamental principles that drive all memory hierarchies allow us to use
`terms that transcend the levels we are talking about. These same principles allow
`us to pose four questions about any level of the hierarchy:
`
`Q1: Where can a block be placed in the upper level? (Block placement)
`
`Q2: How is a block found if it is in the upper level? (Block identification)
`
`Q3: Which block should be replaced on a miss? (Block replacement)
`
`Q4: What happens on a write? (Write strategy)
`
`These questions will help us gain an understanding of the different tradeoffs
`demanded by the relationships of memories at different levels of a hierarchy.
`
`14
`
`

`
`8.3 Caches
`
`caches
`
`Cache: a safe place for hiding or storing things.
`
`fI
`
`Webster’s New World Dictionary of the American Language
`Second College Edition (197e-
`
`Cache is the name first chosen to represent the level of the memory hierarcfi
`between the CPU and main memory, and that is the dominant use of the term.
`f
`While the concept of caches is younger than the IBM 360 architecture, cacti:
`appear today in every class of computer and in some computers more than once i
`In fact, the word has becomeso popular that it has replaced “buffer” in mall;
`computer-science circles.
`‘
`The general terms defined in the prior section can be used for each:
`although the word line is often used instead of block. Figure 8.5 shows that
`typical range of memory-hierarchy parameters for caches.
`
`I
`
`Block (line) size
`
`4 — 128 bytes
`
`Hit time
`
`Miss penalty
`
`(Access time)
`
`(Transfer time)
`Miss rate
`
`Cache size
`
`1 — 4 clock cycles (normally 1)
`
`8 -— 32 clock cycles
`
`(6 —— 10 clock cycles)
`
`(2 — 22 clock cycles)
`1% — 20%
`
`1 KB — 256 KB
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters for caches In 193
`workstations and minicomputers.
`
`Now let’s examine caches in more detail by answering the four memen-
`hierarchy questions.
`
`Q1: Where Can a Block Be Placed in a Cache?
`
`Restrictions on where a block is placed create three categories of tail.-
`organization:
`
`u
`
`u
`
`If each block has only one place it can appear in the cache, the cache is id
`to be direct mapped. The mapping is usually (block-frarne address) monthl-
`(number of blocks in cache).
`
`If a block can be placed anywhere in the cache, the cache is said to be_1"1d'.'_a
`associative.
`
`15
`
`

`
`Memory-Hierarchy Design
`
`If a block can be placed in a restricted set of places in the cache, the cache is
`said to be set associative. A set is a group of two or more blocks in the cache.
`A block is first mapped onto a set, and then the block can be placed anywhere
`within the set. The set is usually chosen by bit selection; that is, (block—frame
`address) modulo (number of sets in cache). If there are n blocks in a set, the
`cache placement is called n-way set associative.
`
`The range of caches from direct mapped to fully associative is really a
`continuum of levels of set associativity: Direct mapped is simply one-way set
`associative and a fully associative cache with m blocks could be called m-way
`set associative. Figure 8.6 shows where block 12 can be placed in a cache
`according to the block-placement policy.
`
`Fully associative:
`black 12 can go
`anywhere
`
`Direct mapped:
`black 12 can go
`only into block 4
`(12 mod 8)
`
`.
`
`i
`
`Set associative:
`block 12 can go
`anywhere in set 0
`(12 mod 4)
`Block 01234567
`U0.»
`
`Black-frame address
`
`Set Set Set Set
`0
`1
`2
`3
`
`1
`Block
`no. 01234567890
`
`FIGURE 8.6 The cache has 8 blocks, while memory has 32 blocks. The set-
`associative organization has 4 sets with 2 blocks per set, called two—way set associative.
`(Fteal caches oontain hundreds of blocks and real memories contain hundreds of thousands
`of blocks.) Assume that there is nothing in the cache and that the block-frame address in
`question identifies lower—level block 12. The three options for caches are shown left to right.
`In fully associative, block 12 from the lower level can go into any of the 8 blocks of the
`cache. With direct mapped, block 12 can only be placed into block 4 (12 modulo 8). Set
`associative, which has some of both features, allows the block to be placed anywhere in set
`0 (12 modulo 4). With two blocks per set, this means block 12 can be placed either in block
`0 or block 1 of the cache.
`
`16
`
`

`
`8.3 Caches
`
`T‘!
`
`Q2: How Is a Block Found If It Is in the Cache?
`
`Caches include an address tag on each block that gives the block—frame address.
`The tag of every cache block that might contain the desired information is
`checked to see if it matches the block-frame address from the CPU. Figure 8."
`gives an example. Because speed is of the essence, all possible tags are searched
`in parallel; serial search would make set associativity counterproductive.
`
`Fully associative
`
`Direct mapped
`Block 01234567
`
`Set associative
`Block 01234567
`
`C
`
`IIIIIII
`llllllll
`
`Search
`
`IIIIIII
`tr
`
`IIIII I:
`ll
`
`FIGURE 8.7 In fully associative placement, the block for block-frame address 12 an
`appear in any of the 8 blocks; thus, a|l_8 tags must be searched. The desired data is
`foundin cache block 6 in this example. In direct-mapped placement there is only one cad":
`block where memory block 12 can be found. in set-associative placement, with 4 sets,
`memory block 12 must be in set 0 (12 mod 4); thus, the tags of cache blocks 0 and 1 are
`checked. in this case the data is found in cache block 1. Speed of cache access dictates
`that searching must be performed in parallel for fully associative and set-associative
`mappings.
`
`There must be a way to know that a cache block does not have valid
`information. The most common procedure is to add a valid bit to the tag to :-.'-'_r
`whether or not this entry contains a valid address. If the bit is not set, then:
`cannot be a match on this address.
`
`A common omission in finding the cost of caches is to forget the cost of 1!:
`tag memory. One tag is required for each block. An advantage of increasing
`block sizes is that the tag overhead per cache entry becomes a smaller fraction 7“
`the total cost of the cache.
`
`Before proceeding to the next question, let’s explore the relationship of 5
`CPU address to the cache. Figure 8.8 shows how an address is divided into three
`fields to find data in a set-associative cache: the block-ofiset field used to selecl
`the desired data from the block, the index field used to select the set, and the l'..':
`field used for the comparison. While the comparison could be made on more
`the address than the tag, there is no need:
`
`17
`
`

`
`Memory—Hierarchy Design
`
`Checking the index would be redundant, since it was used to select the set to
`be checked (an address stored in set 0, for example, must have 0 in the index
`field or it couldn’t be stored in set 0).
`
`The offset is unnecessary in the comparison because all block offsets match
`and the entire block is present or not.
`
`If the total size is kept the same, increasing associativity increases the number of
`blocks per set, thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 moves to the right with
`increasing associativity.
`
`FIGURE 8.8 The :1 portions of an address in a set—associative or direct-mapped cache.
`The tag is used to check all the blocks in the set and the index is used to select the set. The
`block offset is the address of the desired data within the block.
`
`Q3: Which Block Should Be Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn’t,
`then it would be easy to select which block to replace. Alas, the high hit rate of
`caches means that the overwhelming decision is between blocks that have valid
`data.
`
`A benefit of direct-mapped placement is that hardware decisions are
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set-
`associative placement, there are several blocks to choose from on a miss. There
`are two primary strategies employed for selecting which block to replace:
`
`u Rand0m—To spread allocation uniformly, candidate blocks are randomly
`selected. Some systems use a scheme for spreading data across a set of blocks
`in a pseudorandomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`I Least-recently used (LRU)—To reduce the chance of throwing out informa-
`tion that will be needed soon, accesses to blocks are recorded. The block
`replaced is the one that has been unused for the longest time. This makes use
`of a corollary of temporal locality: If recently used blocks are likely to be
`used again, then the best candidate for disposal is the least recently used.
`Figure 8.9 (page 412) shows which block is the least—recently used for a
`sequence of block—frame addresses in a fully associative memory hierarchy.
`
`18
`
`

`
`8.3 Caches
`
`A virtue of random is that it is simple to build in hardware. As the number of
`blocks to keep track of increases, LRU becomes increasingly expensive and is
`frequently only approximated. Figure 8.10 shows the difference in miss rates
`between LRU and random replacement. Replacement policy plays a greater role
`in smaller caches than in larger caches where there are more choices of what to
`replace.
`
`Block-frame addresses
`
`LRU block number
`
`_0
`
`2
`
`0
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-frame addresses in
`a fully associative memory hierarchy. This assumes that there are 4 blocks and that in
`the beginning the LFIU block is number 0. The LFIU block number is shown below each new K
`block reference. Another policy, First-in—first-out (FIFO), simply discards the block that was
`used N unique accesses before, independent of its reference pattern in the last N — 1
`references. Random replacement generally outperforms FIFO and it is easier to implement.
`
`.
`
`.
`
`Associativity:
`Size
`
`2-way
`Random .LRU
`
`4-way
`Random
`
`LRU
`
`8-way
`Random
`
`LRU
`
`16KB
`
`64KB
`
`5.18% 5.69%
`
`4.67% 5.29%
`
`4.39% 4.96%
`
`1.88% 2.01%
`
`1.54% 1.66%
`
`1.39%
`
`1.53%
`
`256KB
`
`1.15%
`
`1.17% P
`
`1.13% 1.13%
`
`1.12%‘
`
`1.12%
`
`,
`
`FIGURE 8.10 Miss rates comparing least-recently used versus random replacement
`for several sizes and associativitles. This data was collected for a block size of 16 bytes
`using one of the VAX traces containing user and operating system code (SAVED). This
`trace is included in the software supplement for course use.. There is little difference
`between LFILI and random for larger size caches in this trace.
`
`Q4: What Happens on a Write?
`
`Reads dominate cache accesses. All instruction accesses are reads, and most
`instructions don’t write to memory. Figure 4.34 (page 181) suggests a mix of 9%
`stores and 17% loads for four DLX programs, making writes less than 10% of-
`the memory traffic. Making the common case fast means optimizing caches for
`reads, but Amdah1’s Lawreminds us that high-performance designs cannot
`neglect the speed of writes.
`Fortunately, the common case is also the easy case to make fast. The block
`can be read at the same time that the tag is read and compared, so the block read
`begins as soon as the block-frame address is available. If the read is a hit, the
`block is passed on to the CPU immediately. If it is a miss, there is no benefit-
`but also no harm.
`
`19
`
`

`
`Memory—Hierarchy Design
`
`Such is not the case for writes. The processor specifies the size of the write,
`usually between 1 and 8 bytes; only that portion of a block can be changed. In
`general this means a read—modify—write sequence of operations on the block:
`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block carmot begin until the tag is checked to see if it is
`a hit. Because tag checking cannot occur in parallel, then, writes normally take
`longer than reads.
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`- Write through (or store through)—The information is written to both the
`block in the cache and to the block in the lower-level memory.
`
`Write back (also called copy back or store in)—The information is written
`
`only to the block in the cache. The modified cache block is written to main
`memory only when it is replaced.
`
`Write-back cache blocks are called clean or dirty, depending on whether the
`information in the cache differs from that in lower—lev'el‘ memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirty bit is
`commonly used. This status bit indicates whether or not the block was modified
`while in the cache. If it wasn’t, the block is not written, since the lower level has
`the same information as the cache.
`
`Both write back and write through have their advantages. With write back,
`writes occur at the speed of the cache memory, and multiple writes within a
`block require only one write to the lower—level memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through, read misses don’t result
`in writes to the lower level, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the most
`current copy of the data. This is important in multiprocessors and for I/O, which
`we shall examine in Section 8.8. Hence, multiprocessors want write back to
`reduce the memory tra

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket