`fARCHlT1ECTURE ’
`
`1
`
`E}
`
`1
`
`A
`
`1
`
`In --
`
`-I
`
`_
`
`..
`.-
`-
`I
`|'
`.
`..'&|-Iu|v$'.«.-..=g
`.._._;_- -r-
`.'_ H‘-I1
`1-"_I
`_- : I
`._ .
`I
`__ _
`I E_:_I::_.- -.__
`.,._.
`
`QUANTHAT
`APPROACH
`
`
`.
`-5
`ra-
`.
`
`I
`
`_
`
`_._:.-
`
`:2
`
`1
`
`L HENN'ES.S-_:Y?-'_
`an}
`; 1
`
`7
`
`1
`
`1
`
`APPLE 1011
`
`1
`
`APPLE 1011
`
`
`
`Computer
`Architecture
`
`A .
`
`Quantitative
`Approach
`
`4
`
`David A. Patterson
`UNIVERSITY OF CALIFORNIA AT BERKELEY
`
`John L. Hennessy
`STANFORD UNIVERSITY
`
`With a Contribution by
`David Goldberg
`Xerox Palo Alto Research Center
`
`MORGAN KAUFMANN PUBLISHERS, INC.
`SAN MATEO, CALIFORNIA
`
`2
`
`
`
`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Head
`Cover Design David Lance Goines
`Copy Editor Linda Medoff
`Proofreader Paul Medoff
`Computer Typesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging-in-Publication Data
`
`Patterson, David A.
`Computer architecture : a quantitative approach / David A.
`Patterson, John L. Hennessy
`p.
`cm.
`Includes bibliographical references
`ISBN 1-55880— 069-8
`
`1. Computer architecture. 2. Electronic digital computers
`_ —-Design and construction. I. Hennessy, John L. II. Title.
`QA76.9.A73P377 1990
`004.2’2--dc20
`
`_
`
`_
`Morgan Kaufmann Publishers, Inc.
`Editorial Office: 2929 Campus Drive.'San Mateo, CA 94403
`Order from: P.O. Box 50490, Palo Alto, CA’94303—9953
`
`©1990 by Morgan Kaufmarm Publishers, Inc.
`All rights reserved.
`
`89-85227
`CIP
`
`No part of this publication may be reproduced, stored in a retrieval system, or transmitted
`in any form or by any means—electronic, mechanical, recording, or otherwise—without
`the prior permission of the publisher.
`
`All instruction sets and other design information of the DLX computer system contained
`herein is copyrighted by the publisher and may not be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`ADVICE, PRAISE, & ERRORS: Any correspondence related to this publication or
`intended for the authors should be addressed to the editorial offices of Morgan Kaufmann
`Publishers, Inc., Dept. P&H APE. Information regarding error sightings is encouraged.
`Any error sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction. Electronic mail
`can be sent to bugs@ vsop.stanford.edu.
`INSTRUCTOR SUPPORT: For information on classroom software and other instructor
`materials available to adopters, please contact the editorial offices of Morgan Kaufmann
`Publishers, Inc.
`
`3
`
`
`
`Computer Architecture: A Quantitative Approach
`
`.
`
`_
`
`| Contents
`
`Foreword
`by C. GORDON BELL
`
`Preface
`
`Acknowledgements
`
`Fundamentals of Computer Design
`1.1
`introduction
`1.2
`Definitions of Perfonnance
`1.3
`Quantitative Principles of Computer Design
`~
`1.4
`The Job of a Computer Designer
`1.5
`Putting It All Together: The Concept of Memory Hierarchy
`1.6
`Fallacies and Pitfalls
`
`1.7
`1.8
`
`Concluding Flenlarks
`Historical Perspective and References
`Exercises
`
`Performanceand cost
`2.1
`Introduction
`
`2.2
`2.3
`2.4
`2.5
`2.6
`2.7
`
`Performance
`Cost
`Putting It All Together: Price/Performance of Three Machines
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Exercises
`
`Instruction Set Design:
`Alternatives and Principles
`3.1
`Introduction
`3.2
`Classifying instruction Set Architectures
`3.3
`Operand Storage in Memory: Classifying General~Purpose
`Register Machines
`Memory Addressing
`3.4
`Operations in the instruction Set
`3.5
`Type and Size of Operands
`3.6
`The Role of High-Level Languages and Compilers
`3.7
`Putting It All Together: How Programs Use Instruction Sets
`3.8
`Fallacies and Pitfalls
`3.9
`3.10 Concluding Remarks
`3.11
`Historical Perspective and References
`Exercises
`
`4
`
`
`
`Instruction Set Examples and
`Measurements of Use
`4.1
`Instruction Set Measurements: What and Why
`4.2
`The VAX Architecture
`4.3
`The 360/370 Architecture
`4.4
`The 8086 Architecture
`.
`4.5
`The DLX Architecture
`4.6
`Putting It All Together: Measurements
`of Instruction Set Usage
`Fallacies and Pitfalls
`4.7
`Concluding Remarks
`4.8
`Historical Perspective and References
`4.9
`Exercises
`‘
`Basic Processor Implementation Techniques
`5.1
`Introduction
`5.2
`Processor Datapath
`5.3
`Basic Steps of Execution
`5.4
`Hardwired Control
`5.5
`Microprogrammed Control
`5.6
`Interrupts and Other Entanglements
`5.7
`Putting It All Together: Control for DLX
`5.8
`Fallacies and Pitfalls
`5.9
`Concluding Remarks
`5.10 Historical Perspective and References
`Exercises
`‘
`
`Pipelining
`6.1
`What Is Pipelining?
`6.2
`The Basic Pipeline tor DLX
`6.3
`Making the Pipeline Work
`6.4
`The Major Hurdle of Pipelining—Pipeline Hazards
`6.5 What Makes Pipelining Hard to Implement
`'
`6.6
`Extending the DLX Pipeline to Handle Multicycle Operations
`6.7
`Advanced Pipelining—Dynamic Scheduling in Pipelines
`6.8
`Advanced Pipelining—Taking Advantage of More
`Instruction-Level Parallelism
`Putting It All Together: A Pipelined VAX
`6.9
`Fallacies and Pitfalls
`6.10
`Concluding Remarks
`6.11
`6.12 Historical Perspective and References
`Exercises
`
`5
`
`
`
`Computer Architecture: A Quantitative Approach
`
`Vector Processors
`7.1
`Why Vector Machines?
`Basic Vector Architecture
`7.2
`7.3
`Two Real—World Issues: Vector Length and Stride
`7.4
`A Simple Model for Vector Performance
`7.5
`Compiler Technology for Vector Machines
`7.6
`Enhancing Vector Perlormance
`7.7
`Putting It All Together: Evaluating the
`Performance of Vector Processors
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Exercises
`
`7.8
`7.9
`7.10
`
`Memory-Hierarchy Design
`8.1
`Introduction: Principle of Locality
`8.2
`General Principles of Memory Hierarchy
`8.3
`Caches
`8.4
`Main Memory
`8.5
`4
`Virtual Memory
`8.6
`Protection and Examples of Virtual Memory
`8.7
`More Optimizations Based on Program Behavior
`8.8
`Advanced Topics—lmproving Cache-Memory Performance
`8.9
`Putting It All Together: The VAX-11/780 Memory Hierarchy
`8.10
`Fallacies and Pitfalls
`'
`8.11
`Concluding Remarks
`8.12
`Historical Perspective and References
`Exercises
`
`Inputloutput
`9.1
`,
`introduction
`9.2
`Predicting System Performance
`9.3
`I/0 Performance Measures
`9.4
`Types of I/O Devices
`9.5
`Buses-Connecting l/O Devices to CPU/Memory
`9.6
`Interfacing to the CPU
`9.7
`Interfacing to an Operating System
`9.8
`Designing an I/O System‘
`9.9
`Putting It All Together:
`The IBM 3990 Storage Subsystem
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Exercises
`
`9.10
`9.11
`9.12
`
`6
`
`
`
`Future Directions
`10.1
`Introduction
`10.2
`Flynn Classification of Computers
`10.3
`SIMD Computers—Single instruction
`Stream, Multiple Data Streams
`10.4 MIMD Computers——MuItiple instruction
`Streams, Multiple Data Streams
`The Roads to El Dorado
`Special-Purpose Processors
`Future Directions for Compilers
`Putting It All Together: The Sequent Symmetry
`Multiprocessor
`Fallacies and Pitfalls
`10.9
`10.10 Concluding Remarks——Evolution Versus
`Revolution in Computer Architecture
`10.11 Historical Perspective and References
`Exercises
`
`10.5
`10.6
`10.7
`10.8
`
`‘
`
`Appendix A: Computer Arithinetic
`by DAVID GOLDBERG
`Xerox Palo Alto Research Center
`introduction
`A.1
`Basic Techniques of integer Arithmetic
`'A.2
`Floating Point
`A.3
`Floating-Point Addition
`A.4
`Floating-Point Multiplication
`A.5
`Division and Remainder
`A.6
`Precisions and Exception Handling
`A.7
`Speeding Up integer Addition p
`A.&
`Speeding Up integer Multiplication and Division
`A.9
`A.1O Putting It All Together
`A.11
`Fallacies and Pitfalls
`A.12 Historical Perspective and References
`Exercises
`,
`
`Appendix B: complete Instruction Set Tables
`B.1
`VAX User Instruction Set
`B.2
`System/360 instruction Set
`Ba
`8086 Instruction Set
`
`A-1
`
`A-1
`A-2
`A-12
`A-16
`A-20
`A-23
`A-28
`A-31
`A-39
`A-53
`A-57
`A-58
`A-63
`
`B-1
`B-2
`B-6
`B-9
`
`Appendix C: Detailed Instruction Set Measurements G-1
`C.1
`VAX Detailed Measurements
`C-2
`C.2
`360 Detailed Measurements
`C-3
`C.3
`Intel 8086 Detailed Measurements
`C-4
`C.4
`DLX Detailed Instruction Set Measurements
`C-5
`
`7
`
`
`
`Computer Architecture: A Quantitative Approach
`
`Appendix D: Time Versus Frequency Measurements D-1
`D—2
`D.1
`Time Distribution on the VAX-11/780
`D—4
`D.2
`Time Distribution on the IBM 370/168
`D-6
`D.3
`Time Distribution on an 8086 in an IBM PC
`D-8
`D.4
`Time Distribution on a DLX Relative
`
`Appendix E: Survey of RISC Architectures
`E.1
`Introduction
`E.2
`Addressing Modes and Instruction Formats
`E.3
`Instructions: The DLX Subset
`E.4
`Instructions: Common Extensions to DLX
`E.5
`Instructions Unique to MIPS
`E.6
`Instructions Unique to SPARC
`E.7
`Instructions Unique to M88000
`E.8
`Instructions Unique to i860
`E.9
`Concluding Remarks
`E.10 References
`
`References
`
`Index
`
`E-1
`E-1
`E—2
`E—4
`E-9
`E-12
`E-1 5
`E-1 7
`E19
`E-23
`' E-24
`
`R-1
`
`I-1
`
`8
`
`
`
`
`
`Ideally one would desire an indefinitely large memory
`capacity such that any particular .
`.
`. word would be im-
`mediately available. .
`.
`. We are .
`.
`. forced to recognize the
`possibility of constructing a hierarchy of memories, each of
`which has greater capacity than the preceding but which is
`less quickly accessible.
`
`A. w. Burks, H. H. Goldstine, and J. V011 NeumaIm,
`Preliminary Discussion of the Logical Design
`of an Electronic Computing Instrument (1946)
`s
`
`
`
`Introduction: Principle of Locality
`General Principles ol,Memory Hierarchy .
`caches
`Main Memory
`Virtual Memory
`Protection and Examples of Virtual Memory
`
`More Optimizations Based on Program Behavior
`Advanced Topics—|mproving cache-Memory
`Performance
`’
`
`Putting It All Together: The VAX-'l 1/780 Memory
`Hierarchy
`Fallacies and Pitlalls
`
`concluding Remarks
`Historical Perspective and References
`Exercises
`
`9
`
`
`
`.
`
`i Memory-Hierarchy
`
`Design
`
`,
`
`8.1 \ Introduction: Principle of Locality
`Computer pioneers correctly predicted that programmers would want unlimited
`amounts of fast memory. As the 90/10 rule in the first chapter predicts, most
`programs fortunately do not access all code or data uniformly (see Section 1.3,
`pages 8-12). The 90/10 rule can be restated as the‘principle of locality. This
`hypothesis, which holds that all programs favor a portion of their address space
`at any instant of time, has two dimensions:
`- Temporal locality (locality in time)—If an item is referenced, it will tend to
`be referenced again soon.
`‘
`u Spatial locality (locality in space)——_If an item is referenced, nearby items will
`tend to be referenced soon.
`_
`A memory hierarchy is a natural reaction to locality and technology. The
`principle of locality and the guideline that smaller hardware is faster yield the
`concept of a hierarchy based on different speeds and sizes. Since slower memory
`is cheaper, a memory hierarchy is organized into several levels—each smaller,
`faster, and more expensive per byte than the level below. The levels of the
`hierarchy subset one another; all data in one level is also found in the level
`below, and all data in that lower level is found in the one below it, and so on
`until we reach the bottom of the hierarchy.
`
`10
`
`
`
`8.1
`
`introduction: Principle of Locality
`
`'1
`
`This chapter includes a half-dozen examples that demonstrate how taking
`advantage of the principle of locality can improve performance. All these
`_ strategies map addresses from a larger memory to a smaller but faster memory
`As part of address mapping, the memory hierarchy is usually given the
`responsibility of address checking; protection schemes used for doing this are
`covered in this chapter. Later we will explore advanced memory hierarchy topim
`and trace a memory access through three levels of memory on the VAX-ll/780.
`
`General Principles of Memory Hierarchy
`
`Before proceeding with examples of the memory hierarchy, let’s define some
`general terms applicable to all memory hierarchies. A memory hierarchy"
`normally consists of many levels‘, but it is managed between two adjacent levels
`at a time. The upper level——the one closer to the processor—is smaller and faster
`than the lower level (see Figure 8.1). The minimum unit of information that CE
`be either present or not present in the two-level hierarchy is called a black. '11:
`size of a block may be either fixed or variable. If it is fixed, the memory size is 1
`"multiple of that block size. Most of this chapter will be concerned with fixed
`block sizes, although a variable block design is discussed in Section 8.6.
`Success or failure of an access to the upper level is designated as a hit or;
`miss: A hit is a memory access found in the upper level, while a miss means it 5.:
`not found in that level. Hit rate, or hit ratio—like a batting average——-is I1:
`fraction of memory accesses found in the upper level. This issometimes repre-
`sented as a" percentage. Miss 'rate (1.0 — hit rate) is the fraction of memoq
`accesses not found in the upper level.
`
`Processor
`
`..
`
`‘FIGURE 8.1 Every pair of levels in the memory hierarchy can be thought of as
`having an upper and lower level. Within each level the unit of information that is preserl
`or not is called a block.
`
`-._.,_;._
`
`,_-_,_-L millllllllIllIlllllllllllllIlIml|Jlrllmu.t....... ..
`
`.
`
`-
`
`11
`
`
`
`Memory-Hierarchy Design
`
`Since performance is the major reason for having a memory hierarchy, the
`speed of hits and misses is important. Hit time is the time to access the upper
`level of the memory hierarchy, which includes the time to determine whether the
`access is a hit or a miss. Miss penalty is the time to replace a block in the upper
`level with the corresponding block from the lower level, plus the time to deliver
`this block to the requesting device (normally the CPU). The miss penalty is
`further divided into two components: access time—the time to access the first
`word of a block on a miss; and transfer time—the additional time to transfer the
`remaining words in the block. Access time is related to the latency of the lower-
`level memory, while transfer time is related to the bandwidth between the lower-
`level and upper—level memories. (Sometimes access latency is used to mean
`access time.)
`The memory address is divided into pieces that access each part of the
`hierarchy. The block-frame address is the higher-order piece of the address that
`identifies a block at that level of the hierarchy (see Figure 8.2). The black—oflset
`address is the lower-order piece of the address and identifies an item within a
`block. The size of the block-offset address is log2.(size of block); the size of the
`block-frame address is then the size of the full address at»'this level less the size
`of the block-offset address.
`
`Block-frame address
`
`Block-offset address
`
`0l01'i01000_l0000010010‘lOI l11i)C111OI
`
`FIGURE 8.2 Example of the frame address and offset address portions of a 32-bit
`lower-level memory address. In this case the block size is 512, making the size of the
`offset address 9 bits and the size of the block-frame address 23 bits.
`
`Evaluating Performance of a Memory Hierarchy
`
`Because instruction count is independent of the hardware, it is tempting to
`evaluate CPU performance using that number. As we saw in Chapters 2 and 4,
`however, such indirect performance measures have waylaid many a computer
`designer. The corresponding temptation for evaluating memory-hierarchy
`perfomiance is to concentrate on miss rate, for it, too, is independent of the
`speed of the hardware. As we shall see, miss rate can be just as misleading as
`instruction count. A better measure of memory-hierarchy performance is the
`average time to access memory:
`
`Average memory-access time = Hit time + Miss rate * Miss penalty
`
`The components of average access time can be measured either in absolute
`time-—say, l0 nanoseconds on a hit—or in the number of clock cycles that the
`
`12
`
`
`
`8.2 General Principles of Memory Hierarchy
`
`1
`CPU waits for the memory——such as a miss penalty of 12 clock cycles
`Remember that average memory-access time is still an indirect measure of
`performance; so while it is a better measure than miss rate, it is not a substitute
`for execution time.
`
`The relationship of block size to miss penalty and miss rate is shown
`abstractly in Figure 8.3. These representations assume that the size of the upper-
`level memory does not change. The access-time portion of the miss penalty is
`not affected by block size’, but the transfer time does increase with block size. If
`access time is large, initially there will be little additional miss penalty relative
`to access time as block size increases. However, increasing block size means
`fewer blocks in the upper—level memory. Increasing block size lowers the miss
`rate until
`the reduced misses of larger blocks (spatial locality) are outweighed
`by the increased misses as the number of blocks shrinks (temporal locality).
`
`Transfer
`
`Block size
`
`Block size
`
`FIGURE 8.3 Block size versus miss penalty and miss rate. The transfer«time portion d
`the miss penalty obviously grows with increasing block size. For a fixed-size upper—level
`memory, miss rates fall with increasing block size until so much of the block is not used that
`it displaces useful information in the upper level, and miss rates begin to rise. The point on
`the curve on the right where miss rates begin to rise with increasing block size is
`sometimes called the pollution point.
`
`Average
`access
`time
`
`FIGURE 8.4 The relationship between average memory-access time and block size.
`
`13
`
`
`
`Memory-Hierarchy Design
`
`The goal of a memory hierarchy is to reduce execution time, not misses.
`Hence, computer designers favor a block size with the lowest average access
`time rather than the lowest miss rate. This is related to the product of miss rate
`and misspenalty, as Figure 8.4 shows abstractly. Of course, overall CPU
`performance is the ultimate performance test, so care must be taken when re-
`ducing average memory-access time to be sure that changes to clock cycle time
`and CPI improve overall performance as well as average memory-access time.
`
`Implications of a Memory Hierarchy to the CPU
`
`Processors designed without a memory hierarchy are simpler because memory
`accesses always take the same amount of time. Misses in a memory hierarchy
`mean that the CPU must be able to handle variable memory-access times. If the
`miss penalty is on the order of tens of clock cycles, the processor normally waits
`for the memory transfer to complete. On the other hand, if the miss penalty is
`thousands of processor clock cycles, it is too wasteful to let the CPU sit idle; in
`this case, the CPU is interrupted and used for another process during the miss
`handling. Thus, avoiding the overhead of a long miss penalty means any
`memory access can result in a CPU interrupt. This also means the CPU must be
`able to recover any memory address that can cause such an interrupt, so that the
`system can know what to transfer to satisfy the miss (see Section 5.6). When the
`memory transfer is complete, the original process is restored, and the instruction
`that missed is retried.
`The processor must also have some mechanism to determine whether or not
`information is in the top level of the memory hierarchy. This check happens on
`every memory access and affects hit time; maintaining acceptable performance
`usually requires the check to be implemented in hardware. The final implication ‘_
`of a memory hierarchy is that the computer must have a mechanism to transfer
`blocks between upper- and lower-level memory. If the block transfer is tens of
`clock cycles, it is controlled by hardware; if it is thousands of clock cycles, it
`can be controlled by software.
`
`Four Questions for classifying Memory Hierarchies
`
`The fundamental principles that drive all memory hierarchies allow us to use
`terms that transcend the levels we are talking about. These same principles allow
`us to pose four questions about any level of the hierarchy:
`
`Q1: Where can a block be placed in the upper level? (Block placement)
`
`Q2: How is a block found if it is in the upper level? (Block identification)
`
`Q3: Which block should be replaced on a miss? (Block replacement)
`
`Q4: What happens on a write? (Write strategy)
`
`These questions will help us gain an understanding of the different tradeoffs
`demanded by the relationships of memories at different levels of a hierarchy.
`
`14
`
`
`
`8.3 Caches
`
`caches
`
`Cache: a safe place for hiding or storing things.
`
`fI
`
`Webster’s New World Dictionary of the American Language
`Second College Edition (197e-
`
`Cache is the name first chosen to represent the level of the memory hierarcfi
`between the CPU and main memory, and that is the dominant use of the term.
`f
`While the concept of caches is younger than the IBM 360 architecture, cacti:
`appear today in every class of computer and in some computers more than once i
`In fact, the word has becomeso popular that it has replaced “buffer” in mall;
`computer-science circles.
`‘
`The general terms defined in the prior section can be used for each:
`although the word line is often used instead of block. Figure 8.5 shows that
`typical range of memory-hierarchy parameters for caches.
`
`I
`
`Block (line) size
`
`4 — 128 bytes
`
`Hit time
`
`Miss penalty
`
`(Access time)
`
`(Transfer time)
`Miss rate
`
`Cache size
`
`1 — 4 clock cycles (normally 1)
`
`8 -— 32 clock cycles
`
`(6 —— 10 clock cycles)
`
`(2 — 22 clock cycles)
`1% — 20%
`
`1 KB — 256 KB
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters for caches In 193
`workstations and minicomputers.
`
`Now let’s examine caches in more detail by answering the four memen-
`hierarchy questions.
`
`Q1: Where Can a Block Be Placed in a Cache?
`
`Restrictions on where a block is placed create three categories of tail.-
`organization:
`
`u
`
`u
`
`If each block has only one place it can appear in the cache, the cache is id
`to be direct mapped. The mapping is usually (block-frarne address) monthl-
`(number of blocks in cache).
`
`If a block can be placed anywhere in the cache, the cache is said to be_1"1d'.'_a
`associative.
`
`15
`
`
`
`Memory-Hierarchy Design
`
`If a block can be placed in a restricted set of places in the cache, the cache is
`said to be set associative. A set is a group of two or more blocks in the cache.
`A block is first mapped onto a set, and then the block can be placed anywhere
`within the set. The set is usually chosen by bit selection; that is, (block—frame
`address) modulo (number of sets in cache). If there are n blocks in a set, the
`cache placement is called n-way set associative.
`
`The range of caches from direct mapped to fully associative is really a
`continuum of levels of set associativity: Direct mapped is simply one-way set
`associative and a fully associative cache with m blocks could be called m-way
`set associative. Figure 8.6 shows where block 12 can be placed in a cache
`according to the block-placement policy.
`
`Fully associative:
`black 12 can go
`anywhere
`
`Direct mapped:
`black 12 can go
`only into block 4
`(12 mod 8)
`
`.
`
`i
`
`Set associative:
`block 12 can go
`anywhere in set 0
`(12 mod 4)
`Block 01234567
`U0.»
`
`Black-frame address
`
`Set Set Set Set
`0
`1
`2
`3
`
`1
`Block
`no. 01234567890
`
`FIGURE 8.6 The cache has 8 blocks, while memory has 32 blocks. The set-
`associative organization has 4 sets with 2 blocks per set, called two—way set associative.
`(Fteal caches oontain hundreds of blocks and real memories contain hundreds of thousands
`of blocks.) Assume that there is nothing in the cache and that the block-frame address in
`question identifies lower—level block 12. The three options for caches are shown left to right.
`In fully associative, block 12 from the lower level can go into any of the 8 blocks of the
`cache. With direct mapped, block 12 can only be placed into block 4 (12 modulo 8). Set
`associative, which has some of both features, allows the block to be placed anywhere in set
`0 (12 modulo 4). With two blocks per set, this means block 12 can be placed either in block
`0 or block 1 of the cache.
`
`16
`
`
`
`8.3 Caches
`
`T‘!
`
`Q2: How Is a Block Found If It Is in the Cache?
`
`Caches include an address tag on each block that gives the block—frame address.
`The tag of every cache block that might contain the desired information is
`checked to see if it matches the block-frame address from the CPU. Figure 8."
`gives an example. Because speed is of the essence, all possible tags are searched
`in parallel; serial search would make set associativity counterproductive.
`
`Fully associative
`
`Direct mapped
`Block 01234567
`
`Set associative
`Block 01234567
`
`C
`
`IIIIIII
`llllllll
`
`Search
`
`IIIIIII
`tr
`
`IIIII I:
`ll
`
`FIGURE 8.7 In fully associative placement, the block for block-frame address 12 an
`appear in any of the 8 blocks; thus, a|l_8 tags must be searched. The desired data is
`foundin cache block 6 in this example. In direct-mapped placement there is only one cad":
`block where memory block 12 can be found. in set-associative placement, with 4 sets,
`memory block 12 must be in set 0 (12 mod 4); thus, the tags of cache blocks 0 and 1 are
`checked. in this case the data is found in cache block 1. Speed of cache access dictates
`that searching must be performed in parallel for fully associative and set-associative
`mappings.
`
`There must be a way to know that a cache block does not have valid
`information. The most common procedure is to add a valid bit to the tag to :-.'-'_r
`whether or not this entry contains a valid address. If the bit is not set, then:
`cannot be a match on this address.
`
`A common omission in finding the cost of caches is to forget the cost of 1!:
`tag memory. One tag is required for each block. An advantage of increasing
`block sizes is that the tag overhead per cache entry becomes a smaller fraction 7“
`the total cost of the cache.
`
`Before proceeding to the next question, let’s explore the relationship of 5
`CPU address to the cache. Figure 8.8 shows how an address is divided into three
`fields to find data in a set-associative cache: the block-ofiset field used to selecl
`the desired data from the block, the index field used to select the set, and the l'..':
`field used for the comparison. While the comparison could be made on more
`the address than the tag, there is no need:
`
`17
`
`
`
`Memory—Hierarchy Design
`
`Checking the index would be redundant, since it was used to select the set to
`be checked (an address stored in set 0, for example, must have 0 in the index
`field or it couldn’t be stored in set 0).
`
`The offset is unnecessary in the comparison because all block offsets match
`and the entire block is present or not.
`
`If the total size is kept the same, increasing associativity increases the number of
`blocks per set, thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 moves to the right with
`increasing associativity.
`
`FIGURE 8.8 The :1 portions of an address in a set—associative or direct-mapped cache.
`The tag is used to check all the blocks in the set and the index is used to select the set. The
`block offset is the address of the desired data within the block.
`
`Q3: Which Block Should Be Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn’t,
`then it would be easy to select which block to replace. Alas, the high hit rate of
`caches means that the overwhelming decision is between blocks that have valid
`data.
`
`A benefit of direct-mapped placement is that hardware decisions are
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set-
`associative placement, there are several blocks to choose from on a miss. There
`are two primary strategies employed for selecting which block to replace:
`
`u Rand0m—To spread allocation uniformly, candidate blocks are randomly
`selected. Some systems use a scheme for spreading data across a set of blocks
`in a pseudorandomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`I Least-recently used (LRU)—To reduce the chance of throwing out informa-
`tion that will be needed soon, accesses to blocks are recorded. The block
`replaced is the one that has been unused for the longest time. This makes use
`of a corollary of temporal locality: If recently used blocks are likely to be
`used again, then the best candidate for disposal is the least recently used.
`Figure 8.9 (page 412) shows which block is the least—recently used for a
`sequence of block—frame addresses in a fully associative memory hierarchy.
`
`18
`
`
`
`8.3 Caches
`
`A virtue of random is that it is simple to build in hardware. As the number of
`blocks to keep track of increases, LRU becomes increasingly expensive and is
`frequently only approximated. Figure 8.10 shows the difference in miss rates
`between LRU and random replacement. Replacement policy plays a greater role
`in smaller caches than in larger caches where there are more choices of what to
`replace.
`
`Block-frame addresses
`
`LRU block number
`
`_0
`
`2
`
`0
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-frame addresses in
`a fully associative memory hierarchy. This assumes that there are 4 blocks and that in
`the beginning the LFIU block is number 0. The LFIU block number is shown below each new K
`block reference. Another policy, First-in—first-out (FIFO), simply discards the block that was
`used N unique accesses before, independent of its reference pattern in the last N — 1
`references. Random replacement generally outperforms FIFO and it is easier to implement.
`
`.
`
`.
`
`Associativity:
`Size
`
`2-way
`Random .LRU
`
`4-way
`Random
`
`LRU
`
`8-way
`Random
`
`LRU
`
`16KB
`
`64KB
`
`5.18% 5.69%
`
`4.67% 5.29%
`
`4.39% 4.96%
`
`1.88% 2.01%
`
`1.54% 1.66%
`
`1.39%
`
`1.53%
`
`256KB
`
`1.15%
`
`1.17% P
`
`1.13% 1.13%
`
`1.12%‘
`
`1.12%
`
`,
`
`FIGURE 8.10 Miss rates comparing least-recently used versus random replacement
`for several sizes and associativitles. This data was collected for a block size of 16 bytes
`using one of the VAX traces containing user and operating system code (SAVED). This
`trace is included in the software supplement for course use.. There is little difference
`between LFILI and random for larger size caches in this trace.
`
`Q4: What Happens on a Write?
`
`Reads dominate cache accesses. All instruction accesses are reads, and most
`instructions don’t write to memory. Figure 4.34 (page 181) suggests a mix of 9%
`stores and 17% loads for four DLX programs, making writes less than 10% of-
`the memory traffic. Making the common case fast means optimizing caches for
`reads, but Amdah1’s Lawreminds us that high-performance designs cannot
`neglect the speed of writes.
`Fortunately, the common case is also the easy case to make fast. The block
`can be read at the same time that the tag is read and compared, so the block read
`begins as soon as the block-frame address is available. If the read is a hit, the
`block is passed on to the CPU immediately. If it is a miss, there is no benefit-
`but also no harm.
`
`19
`
`
`
`Memory—Hierarchy Design
`
`Such is not the case for writes. The processor specifies the size of the write,
`usually between 1 and 8 bytes; only that portion of a block can be changed. In
`general this means a read—modify—write sequence of operations on the block:
`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block carmot begin until the tag is checked to see if it is
`a hit. Because tag checking cannot occur in parallel, then, writes normally take
`longer than reads.
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`- Write through (or store through)—The information is written to both the
`block in the cache and to the block in the lower-level memory.
`
`Write back (also called copy back or store in)—The information is written
`
`only to the block in the cache. The modified cache block is written to main
`memory only when it is replaced.
`
`Write-back cache blocks are called clean or dirty, depending on whether the
`information in the cache differs from that in lower—lev'el‘ memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirty bit is
`commonly used. This status bit indicates whether or not the block was modified
`while in the cache. If it wasn’t, the block is not written, since the lower level has
`the same information as the cache.
`
`Both write back and write through have their advantages. With write back,
`writes occur at the speed of the cache memory, and multiple writes within a
`block require only one write to the lower—level memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through, read misses don’t result
`in writes to the lower level, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the most
`current copy of the data. This is important in multiprocessors and for I/O, which
`we shall examine in Section 8.8. Hence, multiprocessors want write back to
`reduce the memory tra