`Cache Coherence in
`Large-Scale
`Multiprocessors
`
`David Chaiken, Craig Fields, Kiyoshi Kurihara,
`and Anant Agarwal
`Massachusetts Institute of Technology
`
`I n a shared-memory multiprocessor,
`
`the memory system provides access
`to the data to be processed and mecha-
`nisms for interprocess communication.
`The bandwidth of the memory system
`limits the speed of computation in current
`high-performance multiprocessors due to
`the uneven growth of processor and mem-
`ory speeds. Caches are fast local memories
`that moderate a multiprocessor’s memory-
`bandwidth demands by holding copies of
`recently used data, and provide a low-
`latency access path to the processor. Be-
`cattse of locality in the memory access
`
`patterns of multiprocessors, the cache sat-
`isfies a large fraction of the processor
`accesses, thereby reducing both the aver-
`age memory latency and the communica-
`tion bandwidth requirements imposed on
`the system’s interconnection network.
`Caches in a multiprocessing environ-
`ment introduce the cache-coherenceprob-
`lem. When multiple processors maintain
`locally cached copies of a unique shared
`memory location, any local modification
`of the location can result in a globally
`inconsistent view of memory. Cache-co-
`herence schemes prevent this problem by
`
`June 1990
`
`This article addresses
`the usefulness of
`shared-data caches in
`large-scale
`multiprocessors, the
`relative merits of
`different coherence
`schemes, and system-
`level methods for
`improving directory
`efficiency.
`
`maintaining a uniform state for each
`cached block of data.
`Several of today’s commercially avail-
`able multiprocessors use bus-based mem-
`ory systems. A bus is a convenient device
`for ensuring cache coherence because it
`
`allows all processors in the system to ob-
`serve ongoing memory transactions. If a
`bus transaction threatens the consistent
`state of a locally cached object, the cache
`controller can take such appropriate action
`as invalidating the local copy. Protocols
`that use this mechanism to ensure coher-
`ence are called snoopy protocols because
`each cache snoops on the transactions of
`other caches.’
`Unfortunately, buses simply don’t have
`the bandwidth to support a large number of
`processors. Bus cycle times are restricted
`by signal transmission times in multidrop
`environments and must be long enough to
`allow the bus to “ring out,” typically a few
`signal propagation delays over the length
`of the bus. As processor speeds increase,
`the relative disparity between bus and
`processor clocks will simply become more
`evident.
`Consequently, scalable multiprocessor
`systems interconnect processors using
`short point-to-point wires in direct or
`multistage networks. Communication
`
`along impedance-matched transmission
`line channels can occur at high speeds,
`providing communication bandwidth that
`
`49
`
`1
`
`APPLE 1004
`
`
`
`scales with the numbed of processors.
`store a copy of any block of data. That is,
`Unlike buses, the bandwidth of these net-
`each directory entry contains N pointers,
`works increases as more processors are
`
`where N is the number of processors in the
`added to the system. Unfortunately, such
`system. Such directories can be optimized
`networks don’t have a convenient snoop-
`to use a single bit pointer. Limited directo-
`ing mechanism and don’t provide an effi-
`ries’ differ from full-map directories in
`cient broadcast capability.
`that they have a fixed number of pointers
`In the absence of a systemwide broad-
`per entry, regardless of the number of
`cast mechanism, the cache-coherence
`processors in the system. Chained directo-
`problem can be solved with interconnec-
`ries4 emulate the full-map schemes by
`tion networks using some variant of direc-
`distributing the directory among the
`tory schemes.* This article reviews and
`caches.
`analyzes this class of cache-coherence
`To analyze these directory schemes, we
`protocols. We use a hybrid of trace-driven
`chose at least one protocol from each cate-
`simulation and analytical methods to
`gory. In each case, we tried to pick the
`evaluate the performance of these schemes
`protocol that was the least complex to
`for several parallel applications.
`implement in terms of the required hard-
`The research presented in this article is
`ware overhead. Our method for simplify-
`part of our effort to build a high-perfor-
`ing a protocol was to minimize the number
`mance large-scale multiprocessor. To that
`of cache states, memory states, and types
`end, we are studying entire multiprocessor
`of protocol messages. All of our protocols
`guarantee sequenrial
`consistency, which
`systems, including parallel algorithms,
`Lampor defined to ensure the correct exe-
`compilers, runtime systems, processors,
`cution of multiprocess programs.
`caches, shared memory, and interconnec-
`tion networks. We find that the best solu-
`tions to the cache-coherence problem re-
`Full-map directories. The full-map
`sult from a synergy between a multiproces-
`protocol uses directory entries with one bit
`sor’s software and hardware components.
`per processor and a dirty bit. Each bit
`represents the status of the block in the
`corresponding processor’s cache (present
`or absent). If the dirty bit is set, then one
`Classification of
`and only one processor’s bit is set, and that
`directory schemes
`processor has permission to write into the
`block. A cache maintains two bits of state
`per block. One bit indicates whether a
`A cache-coherence protocol consists of
`block is valid: the other bit indicates
`the set of possible states in the local caches,
`whether a valid block may be written. The
`the states in the shared memory, and the
`cache-coherence protocol must keep the
`state transitions caused by the messages
`state bits in the memory directory and those
`transported through the interconnection
`in the caches consistent.
`network to keep memory coherent. To
`Figure la illustrates three different
`simplify the protocol and the analysis, our
`states of a full-map directory. In the first
`data block size is the same for coherence
`state, location X is missing in all of the
`and cache fetch.
`caches in the system. The second state
`A cache-coherence protocol that does
`results from three caches (Cl, C2, and C3)
`not use broadcasts must store the locations
`requesting copies of location X. Three
`of all cached copies of each block of shared
`pointers (processor bits) are set in the entry
`data. This list of cached locations, whether
`to indicate the caches that have copies of
`centralized or distributed, is called adirec-
`the block of data. In the first two states, the
`tory. A directory entry for each block of
`dirty bit -on
`the left side of the directory
`data contains a number ofpoinrers to spec-
`entry-is set to clean (C), indicating that
`ify the locations of copies of the block.
`no processor has permission to write to the
`Each directory entry also contains a dirty
`block of data. The third state results from
`bit to specify whetheror not auniquecache
`cache C3 requesting write permission for
`has permission to write the associated
`the block. In this final state, the dirty bit is
`block of data.
`set to dirty (D), and there is a single pointer
`The different flavors of directory proto-
`to the block of data in cache C3.
`cols fall under three primary categories:
`It is worth examining the transition from
`full-map
`directories,
`limited
`directories,
`the second state to the third state in more
`and chaineddirecrories.
`Full-map directo-
`detail. Once processor P3 issues the write
`ries2 store enough state associated with
`to cache C3, the following events tran-
`each block in global memory so that every
`spire:
`cache in the system can simultaneously
`
`(1) Cache C3 detects that the block
`containing location X is valid but that the
`processor does not have permission to
`write to the block, indicated by the block’s
`write-permission bit in the cache.
`(2) Cache C3 issues a write request to
`the memory module containing location X
`and stalls processor P3.
`(3) The memory module issues Tnvali-
`date requests to caches Cl and C2.
`(4) Cache Cl and cache C2 receive the
`invalidate requests, set the appropriate bit
`to indicate that the block containing loca-
`tion X is invalid, and send acknowledg-
`ments back to the memory module.
`(5) The memory module receives the
`acknowledgments, sets the dirty bit, clears
`thepointers tocachesC1 andC2, andsends
`write permission to cache C3.
`(6) Cache C3 receives the write permis-
`sion message, updates the state in the
`cache, and reactivates processor P3.
`
`Note that the memory module waits to
`receive the acknowledgments before al-
`lowing processor P3 to complete its write
`transaction. By waiting for acknowledg-
`ments, the protocol guarantees that the
`memory system ensures sequential consis-
`tency.
`The full-map protocol provides a useful
`upper bound for the performance of cen-
`tralized directory-based cache coherence.
`However, it is not scalable with respect to
`memory overhead. Assume that the
`amount of distributed shared memory in-
`creases linearly with the number of
`processors N. Because the size of the direc-
`tory entry associated with each block of
`memory is proportional to the number of
`processors, the memory consumed by the
`directory is proportional to the size of
`memory (O(N)) multiplied by the size of
`the directory entry (Q(N)). Thus, the total
`memory overhead scales as the square of
`the number of processors (O(S)).
`
`Limited directories. Limited directory
`protocols are designed to solve the direc-
`tory size problem. Restricting the number
`of simultaneously cached copies of any
`particular block of dais limits the growth
`ofthedirectory toaconstantfactor.Forour
`analysis, we selected the limited directory
`protocol proposed.in Agarwal et al.)
`A directory protocol can be classified as
`Dir,X using the notation from Agarwal et
`aL3 The symbol i stands for the number of
`pointers, and X is NB for a scheme with no
`broadcast and B for one with broadcast. A
`full-map scheme without broadcast is rep-
`resented as DirpB. A limited directory
`
`50
`
`COMPUTER
`
`2
`
`
`
`protocol that uses i<N pointers is denoted
`Dir$NB. The limited directory protocol is
`similar to the full-map directory, except in
`the case when more than i caches request
`read copies of a particular block of data.
`Figure lb shows the situation when three
`caches request read copies in a memory
`system with a Dir,NB protocol. In this
`case, we can view the two-pointer direc-
`tory as a two-way set-associative cache of
`pointers to shared copies. When cache C3
`requests a copy of location X, the memory
`module must invalidate the copy in either
`cache Cl or cache C2. This process of
`pointer replacement is sometimes called
`eviction. Since the directory acts as a set-
`associative cache, it must have a pointer
`replacement policy. Our protocol uses an
`easily implemented pseudorandom evic-
`tion policy that requires no extra memory
`overhead. In Figure 1 b, the pointer to cache
`C3 replaces the pointer to cache C2.
`Why might limited directories succeed?
`If the multiprocessor exhibits processor
`locality in the sense that in any given inter-
`val of time only a small subset of all the
`processors access a given memory word,
`then a limited directory is sufficient to
`capture this small “worker-set” of proces-
`sors.
`Directory pointers in a DirENB protocol
`encode binary processor identifiers, so
`each pointer requires LogiN bits of mem-
`
`ory, where N is the number of processors in
`the system. Given the same assumptions as
`for the full-map protocol, the memory
`overhead of limited directory schemes
`grows as O(MogN). These protocols are
`considered scalable with respect to mem-
`ory overhead because the resources re-
`quired to implement them grow approxi-
`mately linearly with the number of proces-
`sors in the system.
`Dir,B protocols allow more than i copies
`of each block of data to exist, hut they
`resort to a broadcast mechanism when
`more than i cached copies of a block need
`to be invalidated. However, interconnec-
`tion networks with point-to-point wires do
`not provide an efficient systemwide broad-
`cast capability. In such networks, it is also
`difficult to determine the completion of a
`broadcast to ensure sequential consis-
`tency. While it is possible to limit some
`Dir,B broadcasts to a subset of the system
`(see Agarwal et al.l), we restrict our evalu-
`ation of limited directories to the Dir,NB
`protocols.
`
`mechanism, realize the scalability of lim-
`ited directories without restricting the
`number of shared copies of data blocks.4
`This type of cache-coherence scheme is
`called a chained scheme because it keeps
`track of shared copies of data by maintain-
`ing achainof directory pointers. We inves-
`tigated two chained directory schemes.
`The simpler of the two schemes imple-
`ments a singly linked chain, which is best
`described by example (see Figure Ic).
`Suppose there are no shared copies of loca-
`tion X. If processor Pl reads location X,
`the memory sends a copy to cache Cl,
`Chained directories. Chained directo-
`along with a chain termination (CT)
`ries, the third option for cache-coherence
`pointer. The memory also keeps a pointer
`schemes that do not utilize a broadcast
`
`June 1990
`
`Figure 1. Three types of directory protocols: (a) three states of a full-map direc-
`tory; (b) eviction in a limited directory; an? (c) chained directory.
`
`to cache Cl. Subsequently, when proces-
`sor P2 reads location X, the memory sends
`a copy to cache C2, along with the pointer
`to cache Cl. The memory then keeps a
`pointer to cache C2. By repeating this step,
`all of the caches can cache a copy of loca-
`tionX. IfprocessorP3 writes tolocationx,
`it is necessary to send a data invalidation
`message down the chain. To ensure se-
`quential consistency, the memory module
`denies processor P3 write permission until
`the processor with the chain termination
`pointer acknowledges the invalidation of
`the chain. Perhaps this scheme should be
`called a gossip protocol (as opposed to a
`snoopy protocol) because information is
`
`51
`
`3
`
`
`
`Postmortem
`scheduler
`
`Packet-switched
`network model
`
`I
`
`I
`
`T-MUL-T
`
`y
`
`Figure 2. Diagram of methodology.
`
`passed from individual to individual,
`rather than being spread by covert observa-
`tion.
`The possibility of cache-block replace-
`ment complicates chained directory proto-
`cols. Suppose that cache C, through cache
`C, all have copies of location X and that
`location X and location Y map to the same
`(direct-mapped) cache line. If processor P,
`reads location Y, it must first evict location
`X from its cache. In this situation, two
`possibilities exist:
`
`pointers), twice the pointer memory in the
`caches, and a more complex coherence
`protocol.
`Although the chained protocols are more
`complex than the limited directory proto-
`cols, they are still scalable in terms of the
`amount of memory used for the directo-
`ries. The pointer sizes grow as the loga-
`rithm of the number of processors, and the
`number of pointers per cache or memory
`block is independent of the number of
`processors.
`
`the memory system, which includes the
`cache, the memory, and the interconnec-
`tion network, we determine the contribu-
`tion of the memory system to the time
`needed to run a program on the system. Our
`analysis computes the processor
`utiliza-
`tion, or the fraction of time that each pro-
`cessor does useful work. One minus the
`utilization yields the fraction of processor
`cycles wasted due to memory system de-
`lays. The actual system speedup equals the
`number of processors multiplied by the
`processor utilization. This metric has been
`used in other studies of multiprocessor
`Caching only private data. Up to this
`cache and network performance.6
`point, we have assumed that caches are
`In a multiprocessor, processor utiliza-
`allowed to store local copies of shared
`tion (and therefore system speedup) is
`variables, thus leading to the cache-consis-
`affected by the frequency of memory refer-
`tency problem. An alternative shared
`ences and the latency of the memory sys-
`memory method avoids the cache-coher-
`
`tem. The latency (T) of a message through
`ence problem by disallowing caching of
`For our evaluation, we chose the second
`the interconnection network depends on
`shared data. In OUT analysis, we designate
`scheme because it can be implemented by
`several factors, including the network
`this scheme by saying it only caches private
`a less complex protocol than the first. In
`topology and speed, the number of proces-
`data. This scheme caches private data,
`
`eithercase,sequentiaIconsistency ismain-
`sors in the system, the frequency and size
`tained by locking the memory location
`shared data that is read-only, and instmc-
`of the messages, and the memory access
`tions, .while references to modifiable
`while invalidations are in progress.
`latency. The cache-coherence protocol
`shared data bypass the cache. In practice,
`Another solution to the replacement
`determines the request rate, message size,
`shared variables must be statically identi-
`problem is to use a doubly linked chain.
`and memory latency. To compute proces-
`fied to use this scheme.
`This scheme maintains forward and back-
`sor utilization, we need to use detailed
`ward chain pointers for each cached copy
`models of cache-coherence protocols and
`so that the protocol does not have to trav-
`Methodology
`interconnection networks.
`erse the chain when there is a cache re-
`Figure 2 shows an overview of our an-
`placement. The doubly linked directory
`alysis process. Multiprocessor address
`What is a good performance metric for
`optimizes the replacement condition at the
`traces generated using three tracing meth-
`comparing the various cache-coherence
`cost of a larger average message block size
`ods at Stanford University, IBM, and MIT
`schemes? To evaluate the performance of
`(due to the transmission of extra directory
`
`(1) Send a message down the chain to
`cache C,., with a pointer to cache C,+, and
`splice Cz out of the chain, or
`(2) Invalidate location X in cache Ct+,
`through cache C”.
`
`52
`
`COMPUTER
`
`4
`
`
`
`Table 1. Summary of trace statistics, with length values in millions of references
`to memory.
`
`Source
`
`Language
`
`Processors
`
`Application
`
`Length
`
`are run on a cache and directory simulator
`that counts the occurrences of different
`types of protocol transactions. A cost is
`assigned to each of these transaction types
`to compute the average processor request
`rate, the average network message block
`size, and the average memory latency per
`transaction. From these parameters, a
`model of a packet-switched, pipelined,
`multistage interconnection network calcu-
`lates the average processor utilization.
`
`VAX T-bit
`
`C
`
`Postmortem Fortran
`scheduler
`
`16
`
`64
`
`P-Thor
`MP3D
`LocusRoute
`SA-TSP
`FFT
`Weather
`Simple
`Speech
`
`7.09
`7.38
`7.05
`7.11
`7.44
`31.76
`27.03
`11.77
`
`T-M&T
`
`Mul-T
`
`64
`
`Getting multiprocessor address trace
`data. The address traces represent a wide
`range of parallel algorithms written in
`three different programming languages.
`The programs traced at Stanford were
`written in C; at IBM, in Fortran; and at
`dimensional grid and uses finite-differ-
`T programming environment that can be
`MIT, in M&T, a variant of Multilisp. The
`used to generate memory address traces for
`ence methods to solve a set of partial dif-
`implementation of the trace collector dif-
`programs running on an arbitrary number
`ferential equations describing the state of
`fers for each of the programming environ-
`of processors. Instructions are not cur-
`the system. Simple models the behavior of
`ments. Each tracing system can theoreti-
`fluids and employs finite difference meth-
`rently traced in T-M&T. We assume that
`cally obtain address traces for an arbitrary
`all instructions hit in the cache and, for
`ods to solve equations describing hydrody-
`number of processors, enabling a study of
`processor utilization computation, an in-
`namic behavior. FFT is a radix-2 fast
`the behavior of cache-coherent machines
`struction reference is associated with each
`Fourier transform.
`much larger than any built to date. Table 1
`data reference. We make these assump-
`Postmortem scheduling is a technique
`summarizes general characteristics of the
`tions only for the Speech application,
`that generates a parallel trace from a uni-
`traces. We will compare the relative per-
`because the other traces include instruc-
`processor execution trace of a parallel
`formance of the various coherence
`tions.
`application. The uniprocessor trace is a
`schemes individually for each application.
`The trace gathering techniques also dif-
`task trace with embedded synchronization
`fer in their treatment of private data loca-
`The SA-TSP, MP3D, P-Thor, and Lo-
`information that can be scheduled, after
`tions, which must be identified for the
`execution (posrmortem), into a parallel
`cusRoute traces were gathered via the
`Trap-Bit method using 16 processors. SA-
`scheme that only caches private data. The
`trace that obeys the synchronization con-
`private references are identified statically
`TSP uses simulated annealing to solve the
`straints. This type of trace generation uses
`(at compile time) in the Fortran traces and
`traveling salesman problem. MP3D is a 3D
`only one processor to produce the trace and
`are identified dynamically by post-
`to perform the postmortem scheduling. So,
`particle simulator for rarified flow. P-Thor
`processing the other traces. Since static
`is a parallel logic simulator. LocusRoute is
`
`the number of processes is limited only by
`methods must be more conservative than
`a global router for VLSI standard cells.
`the application’s synchronization con-
`dynamic methods when partitioning pri-
`Weber and Gupta* provide a detailed de-
`straints and by the number of parallel tasks
`vate and shared data, the performance that
`in the single processor trace.
`scription of the applications.
`we predict for the private data caching
`The Speech trace was generated by a
`Trap-bit (T-bit) tracing for multiproces-
`scheme on the C and M&T applications is
`sors is an extension of single-processor
`compiler-aided tracing scheme. The appli-
`slightly optimistic. In practice, the non-
`cation comprises the lexical decoding
`trap-bit tracing. In the single processor
`trivial problem of static data partitioning
`implementation, the processor traps after
`stage of a phonetically based spoken lan-
`makes it difficult to implement schemes
`each instruction if the trap bit is set, allow-
`guage understanding system developed by
`that cache only private data.
`the MIT Spoken Language Systems
`ing interpretation of the trapped instruc-
`Group. The Speech application uses a dic-
`tion and emission of the corresponding
`tionary of about 300 words represented by
`memory addresses. Multiprocessor T-bit
`a 3,500-node directed graph. The input to
`tracing extends this method by scheduling
`the lexical decoder is another directed
`a new process on every trapped instruc-
`graph representing possible sequences of
`tion. Once a process undergoes a trap, the
`phonemes in the given utterance. The
`trace mechanism performs several tasks. It
`application uses a modified Viterbi search
`records the corresponding memory ad-
`algorithm to find the best match between
`dresses, saves the processor state of the
`paths through the two graphs.
`trapped process, and schedules another
`In a compiler-based tracing scheme,
`
`process from its list of processes, typically
`code inserted into the instruction stream of
`in a round-robin fashion.
`a program at compile time records the
`The Weather, Simple, and fast Fourier
`addresses of memory references as a side
`transform traces were derived using the
`effect of normal execution. Our compiler-
`postmortem scheduling method at IBM.
`aided multiprocessor trace implementa-
`The Weather application partitions the
`tion is T-M&T, a modification of the Mul-
`atmosphere around the globe into a three-
`
`Simulating a cache-coherence strat-
`egy. For each memory reference in a trace,
`our cache and directory simulator deter-
`mines the effects on the state of the corre-
`sponding block in the cache and the shared
`memory. This state consists of the cache
`tags and directory pointers used to main-
`tain cache coherence. In the simulation,
`the network provides no feedback to the
`cache or memory modules. Assume all
`side effects from each memory transaction
`(entry in the trace) are stored simultane-
`ously. While this simulation strategy does
`not accurately model the state of the
`memory system on a cycle-by-cycle basis,
`
`June 1990
`
`53
`
`5
`
`
`
`Table 2. Simulation parameter defaults for the cache, directory, and network.
`
`Eggers and Katz9 and references therein).
`
`Type of Parameter Name
`
`Default Value
`
`Cache/Directory
`
`Network
`
`256 Kbytes
`Cache size
`Cache-block size
`16 bytes
`Direct mapped
`Cache associativity
`Cache-update policy
`Write back
`Directory pointer replace policy Random
`Network message header size
`16 bits
`Network switch size
`4x4
`Network channel width
`16 bits
`Processor cycle time
`2 x network
`switch cycle time
`32 bits
`6 x network switch
`cycle time
`
`Memory address size
`Base memory access time
`
`The interconnection network model.
`The directory schemes that we analyze
`transmit messages over an interconnection
`network to maintain cache coherence.
`They distribute shared memory and associ-
`ated directories over the processing nodes.
`Our analysis uses a packet-switched, buff-
`ered, multistage interconnection network
`that belongs to the general class of Omega
`networks. The network switches are pipe-
`lined so that a message header can leave a
`
`switch even while the rest of the message is
`still being serviced. A protocol message
`travels through n network switch stages to
`the destination node and takes M cycles for
`the memory access. The network is buff-
`ered and guarantees sequenced delivery of
`messages between any two nodes on the
`network.
`Computation of the processor utiliza-
`tion is based on the analysis method that
`Patell used. The network model yields the
`average latency T of a protocol message
`through the network with n stages, k X k
`size switches, and average memory delay
`M. We derive processor utilization (I from
`a set of three equations:
`
`1
`“=l+mT
`
`P= UmB
`
`it does produce accurate counts of each
`type of protocol transaction over the length
`of a correct execution of a parallel pro-
`gram.
`However, since we assume that all side
`effects of any transaction occur simultane-
`ously, we do not model the difference be-
`tween sequential and concurrent opera-
`tions. This inaccuracy particularly affects
`the analysis of chained directory schemes.
`Specifically, when a shared write is per-
`formed in a system that uses a chained
`directory scheme, the copies of the written
`location must be invalidated in sequence,
`while a centralized directory scheme may
`send the invalidations in parallel and keep
`track of the number of outstanding ac-
`knowledgments. Thus, the minimum la-
`tency for shared writes to clean cache
`blocks is greater for the distributed
`schemes than for the centralized schemes.
`Analyzing the trade-offs between cen-
`tralized and distributed schemes requires a
`much more detailed simulation. While it is
`possible to accurately model the memory
`system on a cycle-by-cycle basis, such a
`simulation requires much higher overhead
`than our simulations in terms of both pro-
`gramming time and simulation runtime.
`Our MIT research group is running experi-
`ments on a simulator for an entire multi-
`processor system. Simulations of the en-
`tire system run approximately 100 times
`slower than the trace-driven simulations
`used for this article. Variants of coherence
`schemes are harder to implement in the
`detailed simulator than in the trace-driven
`environment. To investigate a wide range
`of applications and cache-coherence
`protocols, we avoided the high overhead of
`
`such detailed simulations by performing
`trace-driven simulations.
`In a trace-driven simulation, a memory
`transaction consists of a processor-to-
`memory reference and its effect on the
`state of the memory system. Any transac-
`tion that causes a message to be sent out
`over the network contributes to the aver-
`age request rate, average message size,
`and average memory latency. Each type
`of transaction is assigned a cost in terms
`of the number of messages that must be
`sent over the network (including both the
`requests and the responses), the latency
`encountered at the memory modules, and
`the total number of words (including rout-
`ing information) transported through the
`where m is the probability a message is
`network. Given a trace and a particular
`generated on a given processor cycle, with
`corresponding network latency T. The
`cache-coherence protocol, the cache and
`directory simulator determines the per-
`channel utilization (p) is the product of the
`centage of each transaction type in the
`effective network request rate (Urn) and
`the average message size B. The latency
`trace. The percentage of the transaction
`
`equationuses thepacket-switchednetwork
`type, multiplied by its cost, gives the con-
`tribution of the transaction to each of the
`model by Kmskal and Snir.” The first term
`three parameters listed above.
`in the equation (n + B + M - 1) gives the
`latency through an unloaded network. The
`In addition to the cache-coherence
`strategy, other parameters affect the per-
`second term gives the increase in latency
`due to network contention, which is the
`formance of the memory system. We
`chose values for these parameters (listed
`product of the contention delay through
`one switch and the number of stages. We
`in Table 2) based on the technology used
`verified the model in the context of our
`for contemporary multiprocessors. Al-
`research by comparing its predictions to
`though we chose a 256-kilobyte cache, the
`the performance of a packet-switched net-
`results of our analysis do not differ sub-
`work simulator that transmitted messages
`stantially for cache sizes from 256 kilo-
`generated by a Poisson process.
`bytes down to 16 kilobytes because the
`Table 2 shows the default network para-
`working sets forthe applications are small
`meters we used in our analysis. While this
`when partitioned over a large number of
`article presents results for a packet-
`processors. The effect of other parame-
`switched multistage network, it is possible
`ters, including the cache-block size, has
`to derive results for other types of net-
`been explored in several studies (see
`
`54
`
`COMPUTER
`
`6
`
`
`
`works by varying the network model used
`in the final stage of the analysis. In fact, we
`repeated our analysis for the direct, two-
`dimensional mesh network that we plan to
`use in our own machine. With the direct
`network model, the cache-coherence
`schemes showed the same relative hehav-
`ior as they did with the network model
`described above. The ability to use the
`results from one set of directory simula-
`tions to derive statistics for a range of
`network or bus types displays the power of
`this modeling method.
`
`Analysis of
`directory schemes
`The graphs presented below plot various
`combinations of applications and cache-
`coherence schemes on the vertical axis and
`processor utilization on the horizontal
`axis. Since the data reference characteris-
`tics vary significantly between applica-
`tions and trace gathering methods, we do
`not average results from the different
`traces. The results presented here concen-
`trate on the Weather, Speech, and P-Thor
`applications. We discuss other applica-
`tions when they exhibit significantly dif-
`ferent behavior.
`Are caches useful for shared data?
`Figure 3 shows the processor utilizations
`realized for the Weather, Speech, and P-
`Thor applications using each of the coher-
`ence schemes we evaluated. The long bar
`at the bottom of each graph gives the value
`for “no cache coherence.” This number is
`derived by considering all addresses in
`each trace to be not shared. Processorutili-
`zation with no cache coherence gives, in a
`sense, the effect of the native hit/miss rate
`for the application. The number is artificial
`because it does not represent the behavior
`of a correctly operating system. However,
`the number does give an upper bound on
`the performance of any coherence scheme
`and allows us to focus on the component of
`processor utilization lost due to sharing
`betwe