`
`. at a . I.r.
`App E In:
`Em.w.HmEM
`......
`|PR2E15-E5155.
`
`ityr, LLC
`-136151
`
`J
`
`-E0163, -1‘.}G1?2
`
`1
`
`
`
`
`
`
`
`Computer Organization and Design
`
`THE HARDWARE/SOFTWARE INTERFACE
`
`David A. Patterson
`
`University of California, Berkeley
`
`John L. Hennessy
`
`Stanford University
`
`With a contribution by
`Peter I. Ashenden
`Ashenden Designs Pty Ltd
`
`Iames R. Larus
`Microsoft Research
`
`Daniel I. Sorin
`Duke University
`
`
`
`AMSTERDAM ' BOSTON ' HEIDELBERG ' LONDON
`NEW YORK ' OXFORD ' PARIS ' SAN DIEGO
`SAN FRANCISCO ' SINGAPORE ' SYDNEY ' TOKYO
`
`®
`
`V
`L
`MORGAN KAUFMANN PUBLISHERS
`
`
`
`ER
`
`Morgan Kaufmann is an imprint of Elsevier
`
`
`
`‘
`
`
`_
`ELSEVI
`
`
`
`2
`
`
`
`
`
`Senior Editor
`Publishing Services Manager
`Editorial Assistant
`Cover Design
`Cover and Chapter Illustration
`Text Design
`Composition
`Technical Illustration
`Copyeditor
`Proofreader
`Indexer
`Interior printer
`Cover printer
`
`Denise E. M. Penrose
`Simon Crump
`Summer Block
`Ross Caron Design
`Chris Asimoudis
`GGS Book Services
`Nancy Logan and Dartmouth Publishing, Inc.
`Dartmouth Publishing, Inc.
`Ken DellaPenta
`lacqui Brownstein
`Linda Buskus
`Courier
`Courier
`
`Morgan Kaufmann Publishers is an imprint of Elsevier.
`500 Sansome Street, Suite 400, San Francisco, CA 94111
`
`This book is printed on acid—free paper.
`
`© 2005 by Elsevier Inc. All rights reserved.
`
`/
`
`Designations used by companies to distinguish their products are often claimed as trademarks or registered
`trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names
`appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies
`for more complete information regarding trademarks and registration.
`
`
`
`No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or
`by any means—electronic, mechanical, photocopying, scanning, or otherwise—without prior written per-
`mission of the publisher.
`
`Permissions may be sought directly from Elsevier’s Science 8: Technology Rights Department in Oxford,
`UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, e—mail: permissions@elsevier.com.ul<. You may
`also complete your request on—line via the Elsevier homepage (http://elsevier.com) by selecting “Customer
`Support” and then “Obtaining Permissions.”
`
`Library of Congress Cataloging—in—Publication Data
`Application submitted
`ISBN: l—55860-604-1
`
`For information on all Morgan Kaufmann publications,
`visit our Web site at wWw.mkp.com.
`
`Printed in the United States of America
`0405060708
`54321
`
`
`
`3
`
`
`
`7.2 The Basics of caches
`
`
`Levels in the
`
`memory hierarchy
`
`Increasing distance
`from the CPU in
`access time
`
`CPU
`
`
`
`4
`
`>
`
`Size of the memory at each level
`
`
`FIGURE 7.3 This diagram shows the structure of a memory hierarchy: as the distance
`from the processor increases, so does the size. This structure with the appropriate operating
`mechanisms allows the processor to have an access time that is determined primarily by level 1 of the hier-
`archy and yet have a memory as large as level n. Maintaining this illusion is the subject of this chapter.
`Although the local disk is normally the bottom of the hierarchy, some systems use tape or a file server over a
`local area network as the next levels of the hierarchy.
`
`The Basics of Caches
`
`In our library example, the desk acted as a cache—a safe place to store things
`(books) that We needed to examine. Cache was the name chosen to represent the
`level of the memory hierarchy between the processor and main memory in the
`first commercial computer to have this extra level. Today, although this remains
`the dominant use of the word cache, the term is also used to refer to any storage
`managed to take advantage of locality of access. Caches first appeared in research
`computers in the early 1960s and in production computers later in that same
`decade; every general—purpose computer built today, from servers to low~poWer
`embedded processors, includes caches.
`In this section, We begin by looking at a very simple cache in which the processor
`requests are each one word and the blocks also consist of a single Word. (Readers
`already familiar with cache basics may Want to skip to Section 7.3 on page 492.)
`
`
`
`
`
`Cache: a safe place for hid-
`ing or storing things.
`Wel7ster’s New World Diction-
`
`ary aftheAmerican Language,
`Third College Edition (1988)
`
`4
`
`
`
`chapter 7 Large and Fast: Exploiting Memory Hierarchy
`
`b. After the reference to X,,
`a. Before the reference to X,,
`
`FIGURE 7.4 The cache just before and just after a reference to a word X,, that is not
`initially in the cache. This reference causes a miss that forces the cache to fetch Xwfrom memory and
`insert it into the cache.
`
`Figure 7.4 shows such a simple cache, before and after requesting a data item that is
`not initially in the cache. Before the request, the cache contains a collection of recent
`references X1, X2, .
`.
`. , Xn_1, and the processor requests a word X” that is not in the
`cache. This request results in a miss, and the word X” is brought from memory into
`cache.
`,
`‘
`there are two questions to
`In looking at
`the scenario in Figure 7.4,
`answer: How do we know if a data item is in the cache? Moreover, if it is, how do
`we find it? The answers to these two questions are related. If each word can go in
`exactly one place in the cache, then it is straightforward to find the word if it is in
`the cache. The simplest way to assign a location in the cache for each word in
`memory is to assign the cache location based on the address of the word in mem—
`ory. This cache structure is called direct mapped, since each memory location is
`mapped directly to exactly one location in the cache. The typical mapping
`between addresses and cache locations for a direct—mapped cache is usually sim-
`ple. For example, almost all direct—mapped caches use the mapping
`
`(Block address) modulo (Number of cache blocks in the cache)
`
`This mapping is attractive because if the number of entries in the cache is a power
`of two, then modulo can be computed simply by using the low—order logz (cache
`size in blocks) bits of the address; hence the cache may be accessed directly with
`the low—order bits. For example, Figure 7.5 shows how the memory addresses
`
`direct—mapped cache A cache
`structure in which each memory
`location is mapped to exactly
`one location in the cache.
`
`5
`
`
`
`7.2 The Basics of caches
`
`Cache
`
`O00 O01 010 011 100 101 110 111
`
`
`
`
`
`00001
`
`00101
`
`
`
`01001
`
`01101
`
`10001
`
`
`
`10101
`
`11001
`
`11101
`
`Memory
`
`FIGURE 7.5 A direct-mapped cache with eight entries showing the addresses of memory
`words between 0 and 31 that map to the same cache locations. Because there are eight words in
`the cache, an address X maps to the cache word X modulo 8. That is, the low-order log2(8) = 3 bits are used as
`the cache index. Thus, addresses 00O01m,0, 010O1tWo, 100O1.m,O, and 11O01tWo all map to entry O01tW0 of the
`cache, while addresses 00101tW0, 011O1tWo, 10101tWO, and 11101m,D all map to entry 101tw0 of the cache.
`
`
`
`between 1ten(O0001tW0) and 29m (11101tWO) map to locations item (OO1tW0) and
`Stem (101m)) in a direct—mapped cache of eight words.
`Because each cache location can contain the contents of a number of different
`
`memory locations, how do we know whether the data in the cache corresponds to
`a requested Word? That is, how do we know whether a requested word is in the
`cache or not? We answer this question by adding a set of tags to the cache. The
`tags contain the address information required to identify whether a word in the
`cache corresponds to the requested word. The tag needs only to contain the upper
`portion of the address, corresponding to the bits that are not used as an index into
`the cache. For example, in Figure 7.5 we need only have the upper 2 of the 5
`address bits in the tag, since the lower 3-bit index field of the address selects the
`block. We exclude the index bits because they are redundant, since by definition
`the index field of every address must have the same Value.
`We also need a way to recognize that a cache block does not have Valid infor-
`mation. For instance, when a processor starts up, the cache does not have good
`
`tag A field in a table used for a
`memory hierarchy that contains
`the address information required
`to identify whether the associated
`block in the hierarchy corre-
`sponds to a requested word.
`
`6
`
`
`
` chapter 7 Large and Fast: Exploiting Memory Hierarchy
`
`valid bit A field in the tables ofa
`
`memory hierarchy that indicates
`that the associated block in the
`
`hierarchy contains valid data.
`
`data, and the tag fields will be meaningless. Even after executing many instruc-
`tions, some of the cache entries may still be empty, as in Figure 7.4. Thus, We need
`to know that the tag should be ignored for such entries. The most common
`method is to add a valid bit to indicate whether an entry contains a valid address.
`If the bit is not set, there cannot be a match for this block.
`For the rest of this section, We will focus on explaining how reads Work in a
`cache and how the cache control works for reads. In general, handling reads is a
`little simpler than handling writes, since reads do not have to change the contents
`of the cache. After seeing the basics of how reads work and how cache misses can
`be handled, we’ll examine the cache designs for real computers and detail how
`these caches handle writes.
`
`Accessing a cache
`
`Figure 7.6 shows the contents of an eight—word direct—mapped cache as it
`responds to a series of requests from the processor. Since there are eight blocks in
`the cache, the l0w—0rder 3 bits of an address give the block number. Here is the
`action for each reference:
`
`
`
`Decimal address
`of reference
`
`Binary address
`of reference
`
`Hit or miss
`in cache
`
`Assigned cache block
`(where found or placed)
`
`22
`26
`22
`26
`16
`3
`16
`18
`
`1o11o,w.,
`11010,)“,
`lOllOtwo
`1101oiwo
`1oooo,wo
`00011“),
`lOO0Otwo
`1oo1Otwo
`
`miss (7.6b)
`miss (7.60)
`hit
`hit
`miss (7.6d)
`miss (7.6e)
`hit
`miss (7.6f)
`
`(1o11otwc, mod 8): 110m
`(1101o,wo mod 8) = 010%
`(1o11o,wo mod 8) = 110%
`(11o1o,w° mod 8) = 010m,
`(1o00oW, mod 8): oootwo
`(ooo11,w0 mod 8) = 011m,
`(10o0o,w0 mod 8) = oootwo
`(1oo10tW,, mod 8) = 010%
`
`is brought into cache block 2
`When the word at address 18 (10O1Otw0)
`(01OtWO), the word at address 26 (1101OtW0), which was in cache block 2
`(O10twO), must be replaced by the newly requested data. This behavior allows a
`cache to take advantage of temporal locality: recently accessed words replace
`less recently referenced Words. This situation is directly analogous to needing a
`book from the shelves and having no more space on your desk——some book
`already on your desk must be returned to the shelves. In a direct—mapped cache,
`there is only one place to put the newly requested item and hence only one
`
`choice of What to replace.
`‘
`
`
`
`7
`
`
`
`9.6 Network Topologies
`
`9-27
`
`9.6
`
` Network Topologies
`
`9.6
`
`Chapter 8 reviewed off-the-shelf, switched, local area networks that are the foun-
`dation of clusters. In this section we describe proprietary networks used in multi-
`processors.
`Network costs include the number of switches, the number of links on a switch
`to connect to the network, the width (number of bits) per link, and length of the
`links when the network is mapped into a physical machine. For example, on a
`machine that scales between tens and hundreds of processors, some links may be
`metal rectangles within a chip that are a few millimeters long, and others may be
`cables that must stretch several meters from one cabinet to another. Network per-
`formance is multifaceted as well. It includes the latency on an unloaded network
`to send and receive a message, the throughput in terms of the maximum number
`of messages that can be transmitted in a given time period, delays caused by con-
`tention for a portion of the network, and variable performance depending on the
`pattern of communication. Another obligation of the network may be fault toler-
`ance, since very large systems may be required to operate in the presence of bro-
`ken components.
`Networks are normally drawn as graphs, with each arc of the graph represent-
`ing a link of the communication network. The processor-memory node is shown
`as a black square, and the switch is shown as a colored circle. In this section, all
`links are bidirectional; that is, information can flow in either direction. All net-
`works consist of switches whose links go to processor-memory nodes and to other
`switches. The first improvement over a bus is a network that connects a sequence
`of nodes together:
`
`This topology is called a ring. Since some nodes are not directly connected, some
`messages will have to hop along intermediate nodes until they arrive at the final
`destination.
`Unlike a bus, a ring is capable of many simultaneous transfers. Because there
`are numerous topologies to choose from, performance metrics are needed to dis-
`tinguish these designs. Two are popular. The first is total network bandwidth,
`which is the bandwidth of each link multiplied by the number of links. This repre-
`sents the very best case. For the ring network above with P processors, the total
`network bandwidth would be P times the bandwidth of one link; the total net-
`work bandwidth of a bus is just the bandwidth of that bus, or two times the band-
`width of that link.
`
`network bandwidth Infor-
`mally, the peak transfer rate of a
`network; can refer to the speed
`of a single link or the collective
`transfer rate of all links in the
`network.
`
`8