`Architecture
`
`A Hardware/Software Approach
`
`David E. Culler
`
`jaswinder Pal Singh
`
`with Anoop Gupta
`
`IWI4
`
`MORGAN KAUFMANN PUBLISHERS,
`San Francisco, California
`
`INC.
`
`
`
`Senior Editor
`Director of Production and
`Manufacturing
`Senior Production Editor
`
`Denise E.M. Penrose
`Yonie Overton
`
`Elisabeth Beller
`
`Editorial Coordinator
`Cover Design
`Cover Photo
`Text Design
`Copyeditor
`Proofreaders
`
`Cornpositor
`Illustrators
`Indexer
`
`Printer
`
`Meghan Keeffe
`Martin Heirakuji Graphic Design
`Image copyright © 1998 PhotoDisc, Inc.
`Mark Ong, Side by Side Studios
`Jennifer McClain
`Jennifer McClain, Jeff Van Beuren,
`Ken DellaPenta, Christine Sabooni
`Nancy Logan
`Nancy Logan, Dartmouth Publishing, Inc., Cherie Plumlee
`Steve Rath
`
`Quebecor Printing
`
`Designations used by companies to distinguish their products are often claimed as trademarks or regis-
`tered trademarks. In all instances where Morgan Kaufmann Publishers, Inc. is aware of a claim, the prod-
`uct names appear in initial capital or all capital letters. Readers, however, should contact the appropriate
`companies for more complete information regarding trademarks and registration.
`
`Morgan Kaufmann Publishers, Inc.
`Editorial and Sales Office
`340 Pine Street, Sixth Floor
`San Francisco, CA 94104-3205
`
`Telephone
`Facsimile
`
`415/392-2665
`415/982-2665
`
`mkp@mkp. corn
`http://www.mkp.com
`W W W
`Order toll free 800/745-7323
`
`Advice, Praise, and Errors: Any correspondence related to this publication or intended for the authors
`should be addressed to the Editorial and Sales Office of Morgan Kaufmann Publishers, Inc., Dept. PCA
`APE or sent electronically to pca@mkp.c0m. Please report errors by email to pcabugs@mkp.com. Please
`check the errata page at http://www.mkp.com/pca to see if the bug has already been reported and fixed.
`
`© 1999 Morgan Kaufmann Publishers, Inc.
`All rights reserved
`Printed in the United States of America
`
`02
`
`01
`
`00
`
`99
`
`5
`
`4
`
`3
`
`2
`
`1
`
`No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or
`by any Ineans—electronic, mechanical, photocopying, recording, or otherwise—Without the prior written
`permission of the publisher.
`
`Library of Congress Cataloging-in-Publication Data
`Culler, David E.
`Parallel computer architecture: a hardware/software approach /
`David E. Culler, Jaswinder Pal Singh, with Anoop Gupta.
`p.
`cm.
`Includes bibliographical references and index.
`ISBN 155860-343-3
`
`
`
`The Dragon protocol consists of four states: exclusive—clean (E), shared—clean
`(Sc), shared—modified (Sm), and modified (M). Exclusive—clean (or exclusive) has
`the same meaning and the same motivation as before: only one cache (this cache)
`has a copy of the block, and it has not been modified (i.e., the main memory is up-
`to~date). Shared-clean means that potentially two or more caches (including this
`one) have this block, and main memory may or may not be up—to-date. Shared-
`modified means that potentially two or more caches have this block, main memory is
`not up-to—date, and it is this cache’s responsibility to update the main memory at the
`time this block is replaced from the cache (i.e., this cache is the owner). A block
`may be in Sm state in only one cache at a time. However, it is quite possible that one
`cache has the block in Sm state, while others have it in Sc state. Or it may be that no
`cache has it in Sm state, but some have it in Sc state. This is why, when a cache has
`the block in Sc state, memory may or may not be up—to-date; it depends on whether
`some other cache has it in Sm state. M signifies exclusive ownership as before: the
`block is modified (dirty) and present in this cache alone, main memory is stale, and
`it is this cache’s responsibility to supply the data and to update main memory on
`replacement. Note that there is no explicit invalid (I) state as in the previous proto-
`cols. This is because Dragon is an update—based protocol; the protocol always keeps
`the blocks in the cache up—to-date, so it is always okay to use the data present in the
`cache if the tag match succeeds. However, if a block is not present in a cache at all, it
`can be imagined in a special invalid or not-present state.4
`The processor requests, bus transactions, and actions for the Dragon protocol are
`similar to the Illinois MES1 protocol. The processor is still assumed to issue only
`read (PrRd) and write (PrWr) requests. However, since we do not have an invalid
`state, to specify actions on a tag mismatch we add two more request types: processor
`read miss (PrRdMiss) and write miss (PrWrMiss). As for bus transactions, we have
`bus read (BusRd), bus write back (BusWB), and a new transaction called bus update
`(BusUpd). The BusRd and BusWB transactions have the usual semantics. The
`BusUpd transaction takes the specific word (or bytes) written by the processor and
`broadcasts it on the bus so that all other processors’ caches can update themselves.
`By broadcasting only the contents of the specific modified word rather than the
`whole cache block, it is hoped that the bus bandwidth is more efficiently utilized.
`(See Exercise 5.4 for reasons why this may not always be the case.) As in the MESI
`protocol, to support the E state, a shared signal (S) is available to the cache control-
`ler. Finally, the only new capability needed is for the cache controller to update a
`locally cached memory block (labeled an Update action) with the contents that are
`being broadcast on the bus by a relevant BusUpd transaction.
`
`
`
`4. Logically, there is another state as well, but it is rather crude and is used to bootstrap the protocol. A
`“miss mode” bit is provided with each cache line to force a miss when that block is accessed. Initializa-
`tion software reads data into every line in the cache with the miss mode bit turned on to ensure that the
`processor will miss the first time it references a block that maps to that line. After this first miss, the miss
`mode bit is turned off and the cache operates normally.
`
`
`
`it depends on the timing parameters used and on the burstiness of the traffic, which
`is not captured by the frequency measurements. Contention, timing, and hence per-
`formance are also affected by lower-level interactions with hardware structures (like
`queues and buffers) and policies.
`The simulations used in this section do not model contention. Instead, they use a
`simple PRAM cost model: all memory operations are assumed to complete in the
`same amount of time (here a single cycle) regardless of whether they hit or miss in
`the cache. There are three main reasons for this. First, the focus is on understanding
`inherent protocol behavior and trade-offs in terms of event frequencies, not so much
`on performance. Second, since we are experimenting with different cache block sizes
`and organizations, we would like the interleaving of references from application pro-
`cesses on the simulator to be the same regardless of these choices; that is, all proto-
`cols and block sizes should see the same trace of references. With the execution-
`
`driven rather than trace-driven simulation we use, this is only possible if we make the
`cost of every memory operation the same in the simulations. Otherwise, if a reference
`misses with a small cache block but hits with a larger one, for example, then it will be
`delayed by different amounts in the interleaving in the two cases. It would therefore
`be difficult to determine which effects are inherently due to the protocol and which
`are due to the particular parameter values chosen. Third, realistic simulations that
`model contention take much more time. The disadvantage of using this simple model
`even to measure frequencies is that the timing model may affect some of the frequen-
`cies we observe; however, this effect is small for the applications we study
`The illustrative workloads we use are the six parallel programs (from the
`SPLASH-2 suite) and one multiprogrammed workload described in Chapters 3 and
`4. The parallel programs run in batch mode with exclusive access to the machine
`and do not include operating system activity in the simulations, whereas the multi-
`programmed workload includes operating system activity The number of applica-
`tions used is relatively small, but the applications are primarily for illustration as
`discussed in Chapter 4; the emphasis here is on choosing programs that represent
`important classes of computation and with widely varying characteristics. The fre-
`quencies of basic operations for the applications appear in Table 4.1. We now study
`them in more detail to assess design trade-offs in cache coherency protocols.
`
`5.4.2
`
`Bandwidth Requirement under the MESI Protocol
`
`We begin by using the default 1-MB, single-level caches per processor, as discussed
`in Chapter 4. These are large enough to hold the important working sets for the
`default problem sizes, which is a realistic scenario for all applications. We use four-
`way set associativity (with LRU replacement) to reduce conflict misses and a 64-byte
`cache block size for realism. Driving the workloads through a cache simulator that
`models the Illinois MESI protocol generates the state transition frequencies shown
`in Table 5.1. The data is presented as the number of state transitions of a particular
`type per 1,000 references issued by the processors. Note in the table that a new state,
`
`
`
`
`
`5
`
`0.0362
`
`0.1856
`
`0.0002
`
`97.1712
`
`M
`
`0.0035
`
`0.0010
`
`0.0010
`
`0.1253
`
`0.1277
`
`902.782
`
`0.6593
`
`0.0002
`
`0.0004
`
`302.702
`
`0.0011
`
`0.0003
`
`0.2164
`
`0.0000
`
`0.2164
`
`697.129
`
`0.9565
`
`1.8676
`
`1.6787
`
`0.0015
`
`To
`
`E
`
`0.0011
`
`0.0001
`
`0.0153
`
`0.0000
`
`0.4454
`
`.2484
`
`0 0 1
`
`0
`
`0 0
`
`.0000
`
`0.2130
`
`0.0010
`
`0.0001
`
`0.0007
`
`NP
`
`0 0
`
`.0201
`
`0.0000
`
`0.0029
`
`0.0013
`
`0 0
`
`.0000
`
`0.0000
`
`0.0339
`
`0.0001
`
`0 0
`
`.6362
`
`Application
`Barnes—Hut
`
`LU
`
`Ocean
`
`From
`
`From
`
`NP
`
`:1./1I'|'I
`
`NP
`
`§mn-I
`
`NP
`
`14.0040
`
`0.0240
`
`134.716
`
`0.9955
`
`2.2392
`
`2.2996
`
`843.565
`
`0.2581
`
`0.5766
`
`0.0001
`
`162.569
`
`0.0354
`
`0.0324
`
`0.0060
`
`0.2768
`
`0.3125
`
`839.507
`
`5.4051
`
`1.7050
`
`0 0
`
`.3051
`
`1.3153
`
`0.4119
`
`0.0001
`
`84.6671
`
`1.4982
`
`906.945
`
`continued
`
`0 0 0
`
`.0068
`
`0.0241
`
`0.0030
`
`0.0284
`
`2.4994
`
`0.0015
`
`0.0003
`
`0.7264
`
`0.0305
`
`0.0008
`
`0.4156
`
`4.2886
`
`0.2040
`
`0.4175
`
`2.6259
`
`0 0
`
`.0262
`
`0 0
`
`.0092
`
`0.0219
`
`0 0
`
`.0485
`
`0.0006
`
`0.0109
`
`0.0173
`
`From
`
`From
`
`From
`
`ZL/'\F'l'l
`
`NP
`
`ZL/1r1'I
`
`NP
`
`§U1|'I'I
`
`Radiosity
`
`Radix
`
`
`
`To
`
`E
`
`1.3358
`0.0000
`
`29.0187
`0
`0
`
`0.1675
`0.0007
`
`11.6629
`0
`0
`
`3.2709
`0
`
`46.7898
`0
`0
`
`1.0241
`0.0079
`
`55.7680
`0
`0
`
`2.1799
`0
`
`5.2156
`0
`
`0
`
`S
`
`0.15486
`0.3403
`
`0.0026
`0.0000
`
`0.3639
`310.949
`0.2970
`
`0.0175
`0.2898
`661.011
`
`0.5253
`0.0072
`
`0.1843
`0.0013
`
`0.0221
`214.6523
`0.3732
`
`0.0680
`0.2570
`772.7819
`
`15.7722
`0
`
`1.8961
`981.2618
`0
`
`0
`0
`
`0
`0
`0
`
`1.7209
`1.1495
`
`4.0793
`0.1153
`
`0.0999
`393.5066
`2.0732
`
`0.3352
`1.7800
`542.4318
`
`26.5124
`0
`
`1.2223
`1,075.2158
`
`0
`
`0
`0
`
`0
`O
`
`0
`
`I
`
`0
`0
`
`0
`0.3740
`0.0001
`
`0
`0
`
`0.0008
`0.2787
`0.1196
`
`0
`0
`
`0
`0
`0
`
`0
`0
`
`0.0063
`2.0514
`0.3551
`
`0
`0
`
`0
`0
`
`0
`
`NP
`
`0
`0.0242
`
`0.8663
`1.1181
`0.0559
`
`0
`0.2619
`
`0.0729
`0.3062
`0.2134
`
`0
`0
`
`1.3029
`16.9032
`0
`
`0
`1.3950
`
`0.5511
`1.2740
`3.1827
`
`0
`0
`
`0.8829
`24.6963
`
`0
`
`NP
`I
`
`E
`5
`M
`
`NP
`1
`
`E
`5
`M
`
`NP
`I
`
`E
`5
`M
`
`NP
`1
`
`E
`5
`M
`
`NP
`1
`
`E
`5
`
`M
`
`E
`LL
`
`E9
`
`LL
`
`E
`9
`LL
`
`EE
`
`LL
`
`E
`LI.
`9
`
`Application
`
`Raytrace
`
`Multiprog
`User Data
`References
`
`Multiprog
`Use’
`.
`Instruction
`References
`
`l\/lultiprog
`Kama‘ Data
`References
`
`Multiprog
`K‘a'”a'
`.
`Instruction
`References
`
`The data assumes 16 processors, 1-MB four—way set-associative caches, 64-byte cache blocks, and the
`Illinois MESI coherence protocol.
`
`
`
`NP) and a new block is brought in (creating a transition from NP to one of I, E, S, or
`M). The sum of state transitions can be greater than 1,000 even though we are pre-
`senting averages per 1,000 references because some references cause multiple state
`transitions. For example, a write miss can cause two transitions in the local proces-
`sor’s cache (e.g., S —> NP for the old block and NP % M for the incoming block), in
`addition to transitions in other caches due to invalidations (I/E/S/M —> I).5 This state
`transition frequency data is very useful for answering “what if” questions. Example
`5.8 shows how we can determine the bandwidth requirement these workloads
`would place on the memory system.
`
`EXAMPLE 5.8 Suppose that the integer-intensive applications run at a sustained 200
`MIPS per processor and the floating-point—intensive applications at 200 MFLOPS per
`processor. Assuming that cache block transfers move 64 bytes on the data bus lines
`and that each bus transaction involves 6 bytes of command and address on the
`address lines, what is the traffic generated per processor?
`
`Answer The first step is to calculate the amount of traffic per instruction. We
`determine what bus action is taken for each of the possible state transitions and
`therefore how much traffic is associated with each transaction. For example, an M
`—> NP transition indicates that, due to a miss, a modified cache block needs to be
`written back. Similarly, an S —> M transition indicates that an upgrade request must
`be issued on the bus. Flushing a modified block response to a bus transaction (e.g.,
`the M —> S or M a I transition) leads to a BusWB transaction as well. The bus
`transactions for all possible transitions are shown in Table 5.2. All transactions
`generate 6 bytes of address bus traffic and 64 bytes of data traffic, except BusUpgr,
`which only generates address traffic. I
`
`We can now compute the traffic generated. Using Table 5.2, we can convert the
`state transitions per 1,000 memory references in Table 5.1 to bus transactions per
`1,000 memory references and convert this to address and data traffic by multiplying
`by the traffic per transaction. Then, using the frequency of memory accesses in
`Table 4.1, we can convert this to traffic per instruction or per FLOP Finally, multi-
`plying by the assumed processing rate, we get the address and data bandwidth
`requirement for each application. The result of this calculation is shown by the left-
`most bar for each application in Figure 5.18.
`
`5. For the Multiprog workload, to speed up the simulations, a 32-KB instruction cache is used as a filter
`before passing the instruction references to the 1—MB unified instruction and data cache. The state transi-
`tion frequencies for the instruction references are computed based only on those references that missed
`in the L1 instruction cache. This filtering does not affect the bus traffic data that we will compute using
`these numbers. In addition, for Multiprog we present data separately for kernel instructions, kernel data
`references, user instructions, and user data references. A given reference may produce transitions of mul-
`tiple types for user and kernel data. For example, if a kernel instruction miss causes a modified user data
`block to be written back, then we will have one transition for kernel instructions from NP —) E/S and
`another transition for the user data reference category from M -—> NP.