throbber
Parallel Computer
`Architecture
`
`A Hardware/Software Approach
`
`David E. Culler
`
`jaswinder Pal Singh
`
`with Anoop Gupta
`
`IWI4
`
`MORGAN KAUFMANN PUBLISHERS,
`San Francisco, California
`
`INC.
`
`

`
`Senior Editor
`Director of Production and
`Manufacturing
`Senior Production Editor
`
`Denise E.M. Penrose
`Yonie Overton
`
`Elisabeth Beller
`
`Editorial Coordinator
`Cover Design
`Cover Photo
`Text Design
`Copyeditor
`Proofreaders
`
`Cornpositor
`Illustrators
`Indexer
`
`Printer
`
`Meghan Keeffe
`Martin Heirakuji Graphic Design
`Image copyright © 1998 PhotoDisc, Inc.
`Mark Ong, Side by Side Studios
`Jennifer McClain
`Jennifer McClain, Jeff Van Beuren,
`Ken DellaPenta, Christine Sabooni
`Nancy Logan
`Nancy Logan, Dartmouth Publishing, Inc., Cherie Plumlee
`Steve Rath
`
`Quebecor Printing
`
`Designations used by companies to distinguish their products are often claimed as trademarks or regis-
`tered trademarks. In all instances where Morgan Kaufmann Publishers, Inc. is aware of a claim, the prod-
`uct names appear in initial capital or all capital letters. Readers, however, should contact the appropriate
`companies for more complete information regarding trademarks and registration.
`
`Morgan Kaufmann Publishers, Inc.
`Editorial and Sales Office
`340 Pine Street, Sixth Floor
`San Francisco, CA 94104-3205
`
`Telephone
`Facsimile
`
`415/392-2665
`415/982-2665
`
`mkp@mkp. corn
`Email
`http://www.mkp.com
`W W W
`Order toll free 800/745-7323
`
`Advice, Praise, and Errors: Any correspondence related to this publication or intended for the authors
`should be addressed to the Editorial and Sales Office of Morgan Kaufmann Publishers, Inc., Dept. PCA
`APE or sent electronically to pca@mkp.c0m. Please report errors by email to pcabugs@mkp.com. Please
`check the errata page at http://www.mkp.com/pca to see if the bug has already been reported and fixed.
`
`© 1999 Morgan Kaufmann Publishers, Inc.
`All rights reserved
`Printed in the United States of America
`
`02
`
`01
`
`00
`
`99
`
`5
`
`4
`
`3
`
`2
`
`1
`
`No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or
`by any Ineans—electronic, mechanical, photocopying, recording, or otherwise—Without the prior written
`permission of the publisher.
`
`Library of Congress Cataloging-in-Publication Data
`Culler, David E.
`Parallel computer architecture: a hardware/software approach /
`David E. Culler, Jaswinder Pal Singh, with Anoop Gupta.
`p.
`cm.
`Includes bibliographical references and index.
`ISBN 155860-343-3
`
`

`
`The Dragon protocol consists of four states: exclusive—clean (E), shared—clean
`(Sc), shared—modified (Sm), and modified (M). Exclusive—clean (or exclusive) has
`the same meaning and the same motivation as before: only one cache (this cache)
`has a copy of the block, and it has not been modified (i.e., the main memory is up-
`to~date). Shared-clean means that potentially two or more caches (including this
`one) have this block, and main memory may or may not be up—to-date. Shared-
`modified means that potentially two or more caches have this block, main memory is
`not up-to—date, and it is this cache’s responsibility to update the main memory at the
`time this block is replaced from the cache (i.e., this cache is the owner). A block
`may be in Sm state in only one cache at a time. However, it is quite possible that one
`cache has the block in Sm state, while others have it in Sc state. Or it may be that no
`cache has it in Sm state, but some have it in Sc state. This is why, when a cache has
`the block in Sc state, memory may or may not be up—to-date; it depends on whether
`some other cache has it in Sm state. M signifies exclusive ownership as before: the
`block is modified (dirty) and present in this cache alone, main memory is stale, and
`it is this cache’s responsibility to supply the data and to update main memory on
`replacement. Note that there is no explicit invalid (I) state as in the previous proto-
`cols. This is because Dragon is an update—based protocol; the protocol always keeps
`the blocks in the cache up—to-date, so it is always okay to use the data present in the
`cache if the tag match succeeds. However, if a block is not present in a cache at all, it
`can be imagined in a special invalid or not-present state.4
`The processor requests, bus transactions, and actions for the Dragon protocol are
`similar to the Illinois MES1 protocol. The processor is still assumed to issue only
`read (PrRd) and write (PrWr) requests. However, since we do not have an invalid
`state, to specify actions on a tag mismatch we add two more request types: processor
`read miss (PrRdMiss) and write miss (PrWrMiss). As for bus transactions, we have
`bus read (BusRd), bus write back (BusWB), and a new transaction called bus update
`(BusUpd). The BusRd and BusWB transactions have the usual semantics. The
`BusUpd transaction takes the specific word (or bytes) written by the processor and
`broadcasts it on the bus so that all other processors’ caches can update themselves.
`By broadcasting only the contents of the specific modified word rather than the
`whole cache block, it is hoped that the bus bandwidth is more efficiently utilized.
`(See Exercise 5.4 for reasons why this may not always be the case.) As in the MESI
`protocol, to support the E state, a shared signal (S) is available to the cache control-
`ler. Finally, the only new capability needed is for the cache controller to update a
`locally cached memory block (labeled an Update action) with the contents that are
`being broadcast on the bus by a relevant BusUpd transaction.
`
`
`
`4. Logically, there is another state as well, but it is rather crude and is used to bootstrap the protocol. A
`“miss mode” bit is provided with each cache line to force a miss when that block is accessed. Initializa-
`tion software reads data into every line in the cache with the miss mode bit turned on to ensure that the
`processor will miss the first time it references a block that maps to that line. After this first miss, the miss
`mode bit is turned off and the cache operates normally.
`
`

`
`it depends on the timing parameters used and on the burstiness of the traffic, which
`is not captured by the frequency measurements. Contention, timing, and hence per-
`formance are also affected by lower-level interactions with hardware structures (like
`queues and buffers) and policies.
`The simulations used in this section do not model contention. Instead, they use a
`simple PRAM cost model: all memory operations are assumed to complete in the
`same amount of time (here a single cycle) regardless of whether they hit or miss in
`the cache. There are three main reasons for this. First, the focus is on understanding
`inherent protocol behavior and trade-offs in terms of event frequencies, not so much
`on performance. Second, since we are experimenting with different cache block sizes
`and organizations, we would like the interleaving of references from application pro-
`cesses on the simulator to be the same regardless of these choices; that is, all proto-
`cols and block sizes should see the same trace of references. With the execution-
`
`driven rather than trace-driven simulation we use, this is only possible if we make the
`cost of every memory operation the same in the simulations. Otherwise, if a reference
`misses with a small cache block but hits with a larger one, for example, then it will be
`delayed by different amounts in the interleaving in the two cases. It would therefore
`be difficult to determine which effects are inherently due to the protocol and which
`are due to the particular parameter values chosen. Third, realistic simulations that
`model contention take much more time. The disadvantage of using this simple model
`even to measure frequencies is that the timing model may affect some of the frequen-
`cies we observe; however, this effect is small for the applications we study
`The illustrative workloads we use are the six parallel programs (from the
`SPLASH-2 suite) and one multiprogrammed workload described in Chapters 3 and
`4. The parallel programs run in batch mode with exclusive access to the machine
`and do not include operating system activity in the simulations, whereas the multi-
`programmed workload includes operating system activity The number of applica-
`tions used is relatively small, but the applications are primarily for illustration as
`discussed in Chapter 4; the emphasis here is on choosing programs that represent
`important classes of computation and with widely varying characteristics. The fre-
`quencies of basic operations for the applications appear in Table 4.1. We now study
`them in more detail to assess design trade-offs in cache coherency protocols.
`
`5.4.2
`
`Bandwidth Requirement under the MESI Protocol
`
`We begin by using the default 1-MB, single-level caches per processor, as discussed
`in Chapter 4. These are large enough to hold the important working sets for the
`default problem sizes, which is a realistic scenario for all applications. We use four-
`way set associativity (with LRU replacement) to reduce conflict misses and a 64-byte
`cache block size for realism. Driving the workloads through a cache simulator that
`models the Illinois MESI protocol generates the state transition frequencies shown
`in Table 5.1. The data is presented as the number of state transitions of a particular
`type per 1,000 references issued by the processors. Note in the table that a new state,
`
`
`
`

`
`5
`
`0.0362
`
`0.1856
`
`0.0002
`
`97.1712
`
`M
`
`0.0035
`
`0.0010
`
`0.0010
`
`0.1253
`
`0.1277
`
`902.782
`
`0.6593
`
`0.0002
`
`0.0004
`
`302.702
`
`0.0011
`
`0.0003
`
`0.2164
`
`0.0000
`
`0.2164
`
`697.129
`
`0.9565
`
`1.8676
`
`1.6787
`
`0.0015
`
`To
`
`E
`
`0.0011
`
`0.0001
`
`0.0153
`
`0.0000
`
`0.4454
`
`.2484
`
`0 0 1
`
`0
`
`0 0
`
`.0000
`
`0.2130
`
`0.0010
`
`0.0001
`
`0.0007
`
`NP
`
`0 0
`
`.0201
`
`0.0000
`
`0.0029
`
`0.0013
`
`0 0
`
`.0000
`
`0.0000
`
`0.0339
`
`0.0001
`
`0 0
`
`.6362
`
`Application
`Barnes—Hut
`
`LU
`
`Ocean
`
`From
`
`From
`
`NP
`
`:1./1I'|'I
`
`NP
`
`§mn-I
`
`NP
`
`14.0040
`
`0.0240
`
`134.716
`
`0.9955
`
`2.2392
`
`2.2996
`
`843.565
`
`0.2581
`
`0.5766
`
`0.0001
`
`162.569
`
`0.0354
`
`0.0324
`
`0.0060
`
`0.2768
`
`0.3125
`
`839.507
`
`5.4051
`
`1.7050
`
`0 0
`
`.3051
`
`1.3153
`
`0.4119
`
`0.0001
`
`84.6671
`
`1.4982
`
`906.945
`
`continued
`
`0 0 0
`
`.0068
`
`0.0241
`
`0.0030
`
`0.0284
`
`2.4994
`
`0.0015
`
`0.0003
`
`0.7264
`
`0.0305
`
`0.0008
`
`0.4156
`
`4.2886
`
`0.2040
`
`0.4175
`
`2.6259
`
`0 0
`
`.0262
`
`0 0
`
`.0092
`
`0.0219
`
`0 0
`
`.0485
`
`0.0006
`
`0.0109
`
`0.0173
`
`From
`
`From
`
`From
`
`ZL/'\F'l'l
`
`NP
`
`ZL/1r1'I
`
`NP
`
`§U1|'I'I
`
`Radiosity
`
`Radix
`
`

`
`To
`
`E
`
`1.3358
`0.0000
`
`29.0187
`0
`0
`
`0.1675
`0.0007
`
`11.6629
`0
`0
`
`3.2709
`0
`
`46.7898
`0
`0
`
`1.0241
`0.0079
`
`55.7680
`0
`0
`
`2.1799
`0
`
`5.2156
`0
`
`0
`
`S
`
`0.15486
`0.3403
`
`0.0026
`0.0000
`
`0.3639
`310.949
`0.2970
`
`0.0175
`0.2898
`661.011
`
`0.5253
`0.0072
`
`0.1843
`0.0013
`
`0.0221
`214.6523
`0.3732
`
`0.0680
`0.2570
`772.7819
`
`15.7722
`0
`
`1.8961
`981.2618
`0
`
`0
`0
`
`0
`0
`0
`
`1.7209
`1.1495
`
`4.0793
`0.1153
`
`0.0999
`393.5066
`2.0732
`
`0.3352
`1.7800
`542.4318
`
`26.5124
`0
`
`1.2223
`1,075.2158
`
`0
`
`0
`0
`
`0
`O
`
`0
`
`I
`
`0
`0
`
`0
`0.3740
`0.0001
`
`0
`0
`
`0.0008
`0.2787
`0.1196
`
`0
`0
`
`0
`0
`0
`
`0
`0
`
`0.0063
`2.0514
`0.3551
`
`0
`0
`
`0
`0
`
`0
`
`NP
`
`0
`0.0242
`
`0.8663
`1.1181
`0.0559
`
`0
`0.2619
`
`0.0729
`0.3062
`0.2134
`
`0
`0
`
`1.3029
`16.9032
`0
`
`0
`1.3950
`
`0.5511
`1.2740
`3.1827
`
`0
`0
`
`0.8829
`24.6963
`
`0
`
`NP
`I
`
`E
`5
`M
`
`NP
`1
`
`E
`5
`M
`
`NP
`I
`
`E
`5
`M
`
`NP
`1
`
`E
`5
`M
`
`NP
`1
`
`E
`5
`
`M
`
`E
`LL
`
`E9
`
`LL
`
`E
`9
`LL
`
`EE
`
`LL
`
`E
`LI.
`9
`
`Application
`
`Raytrace
`
`Multiprog
`User Data
`References
`
`Multiprog
`Use’
`.
`Instruction
`References
`
`l\/lultiprog
`Kama‘ Data
`References
`
`Multiprog
`K‘a'”a'
`.
`Instruction
`References
`
`The data assumes 16 processors, 1-MB four—way set-associative caches, 64-byte cache blocks, and the
`Illinois MESI coherence protocol.
`
`

`
`NP) and a new block is brought in (creating a transition from NP to one of I, E, S, or
`M). The sum of state transitions can be greater than 1,000 even though we are pre-
`senting averages per 1,000 references because some references cause multiple state
`transitions. For example, a write miss can cause two transitions in the local proces-
`sor’s cache (e.g., S —> NP for the old block and NP % M for the incoming block), in
`addition to transitions in other caches due to invalidations (I/E/S/M —> I).5 This state
`transition frequency data is very useful for answering “what if” questions. Example
`5.8 shows how we can determine the bandwidth requirement these workloads
`would place on the memory system.
`
`EXAMPLE 5.8 Suppose that the integer-intensive applications run at a sustained 200
`MIPS per processor and the floating-point—intensive applications at 200 MFLOPS per
`processor. Assuming that cache block transfers move 64 bytes on the data bus lines
`and that each bus transaction involves 6 bytes of command and address on the
`address lines, what is the traffic generated per processor?
`
`Answer The first step is to calculate the amount of traffic per instruction. We
`determine what bus action is taken for each of the possible state transitions and
`therefore how much traffic is associated with each transaction. For example, an M
`—> NP transition indicates that, due to a miss, a modified cache block needs to be
`written back. Similarly, an S —> M transition indicates that an upgrade request must
`be issued on the bus. Flushing a modified block response to a bus transaction (e.g.,
`the M —> S or M a I transition) leads to a BusWB transaction as well. The bus
`transactions for all possible transitions are shown in Table 5.2. All transactions
`generate 6 bytes of address bus traffic and 64 bytes of data traffic, except BusUpgr,
`which only generates address traffic. I
`
`We can now compute the traffic generated. Using Table 5.2, we can convert the
`state transitions per 1,000 memory references in Table 5.1 to bus transactions per
`1,000 memory references and convert this to address and data traffic by multiplying
`by the traffic per transaction. Then, using the frequency of memory accesses in
`Table 4.1, we can convert this to traffic per instruction or per FLOP Finally, multi-
`plying by the assumed processing rate, we get the address and data bandwidth
`requirement for each application. The result of this calculation is shown by the left-
`most bar for each application in Figure 5.18.
`
`5. For the Multiprog workload, to speed up the simulations, a 32-KB instruction cache is used as a filter
`before passing the instruction references to the 1—MB unified instruction and data cache. The state transi-
`tion frequencies for the instruction references are computed based only on those references that missed
`in the L1 instruction cache. This filtering does not affect the bus traffic data that we will compute using
`these numbers. In addition, for Multiprog we present data separately for kernel instructions, kernel data
`references, user instructions, and user data references. A given reference may produce transitions of mul-
`tiple types for user and kernel data. For example, if a kernel instruction miss causes a modified user data
`block to be written back, then we will have one transition for kernel instructions from NP —) E/S and
`another transition for the user data reference category from M -—> NP.

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket