throbber
Directory-Based
`Cache Coherence in
`Large-Scale
`Multiprocessors
`
`David Chaiken, Craig Fields, Kiyoshi Kurihara,
`and Anant Agarwal
`Massachusetts Institute of Technology
`
`I n a shared-memory multiprocessor,
`
`the memory system provides access
`to the data to be processed and mecha-
`nisms for interprocess communication.
`The bandwidth of the memory system
`limits the speed of computation in current
`high-performance multiprocessors due to
`the uneven growth of processor and mem-
`ory speeds. Caches are fast local memories
`that moderate a multiprocessor’s memory-
`bandwidth demands by holding copies of
`recently used data, and provide a low-
`latency access path to the processor. Be-
`cattse of locality in the memory access
`
`patterns of multiprocessors, the cache sat-
`isfies a large fraction of the processor
`accesses, thereby reducing both the aver-
`age memory latency and the communica-
`tion bandwidth requirements imposed on
`the system’s interconnection network.
`Caches in a multiprocessing environ-
`ment introduce the cache-coherenceprob-
`lem. When multiple processors maintain
`locally cached copies of a unique shared
`memory location, any local modification
`of the location can result in a globally
`inconsistent view of memory. Cache-co-
`herence schemes prevent this problem by
`
`June 1990
`
`This article addresses
`the usefulness of
`shared-data caches in
`large-scale
`multiprocessors, the
`relative merits of
`different coherence
`schemes, and system-
`level methods for
`improving directory
`efficiency.
`
`maintaining a uniform state for each
`cached block of data.
`Several of today’s commercially avail-
`able multiprocessors use bus-based mem-
`ory systems. A bus is a convenient device
`for ensuring cache coherence because it
`
`allows all processors in the system to ob-
`serve ongoing memory transactions. If a
`bus transaction threatens the consistent
`state of a locally cached object, the cache
`controller can take such appropriate action
`as invalidating the local copy. Protocols
`that use this mechanism to ensure coher-
`ence are called snoopy protocols because
`each cache snoops on the transactions of
`other caches.’
`Unfortunately, buses simply don’t have
`the bandwidth to support a large number of
`processors. Bus cycle times are restricted
`by signal transmission times in multidrop
`environments and must be long enough to
`allow the bus to “ring out,” typically a few
`signal propagation delays over the length
`of the bus. As processor speeds increase,
`the relative disparity between bus and
`processor clocks will simply become more
`evident.
`Consequently, scalable multiprocessor
`systems interconnect processors using
`short point-to-point wires in direct or
`multistage networks. Communication
`
`along impedance-matched transmission
`line channels can occur at high speeds,
`providing communication bandwidth that
`
`49
`
`1
`
`APPLE 1004
`
`

`
`scales with the numbed of processors.
`store a copy of any block of data. That is,
`Unlike buses, the bandwidth of these net-
`each directory entry contains N pointers,
`works increases as more processors are
`
`where N is the number of processors in the
`added to the system. Unfortunately, such
`system. Such directories can be optimized
`networks don’t have a convenient snoop-
`to use a single bit pointer. Limited directo-
`ing mechanism and don’t provide an effi-
`ries’ differ from full-map directories in
`cient broadcast capability.
`that they have a fixed number of pointers
`In the absence of a systemwide broad-
`per entry, regardless of the number of
`cast mechanism, the cache-coherence
`processors in the system. Chained directo-
`problem can be solved with interconnec-
`ries4 emulate the full-map schemes by
`tion networks using some variant of direc-
`distributing the directory among the
`tory schemes.* This article reviews and
`caches.
`analyzes this class of cache-coherence
`To analyze these directory schemes, we
`protocols. We use a hybrid of trace-driven
`chose at least one protocol from each cate-
`simulation and analytical methods to
`gory. In each case, we tried to pick the
`evaluate the performance of these schemes
`protocol that was the least complex to
`for several parallel applications.
`implement in terms of the required hard-
`The research presented in this article is
`ware overhead. Our method for simplify-
`part of our effort to build a high-perfor-
`ing a protocol was to minimize the number
`mance large-scale multiprocessor. To that
`of cache states, memory states, and types
`end, we are studying entire multiprocessor
`of protocol messages. All of our protocols
`guarantee sequenrial
`consistency, which
`systems, including parallel algorithms,
`Lampor defined to ensure the correct exe-
`compilers, runtime systems, processors,
`cution of multiprocess programs.
`caches, shared memory, and interconnec-
`tion networks. We find that the best solu-
`tions to the cache-coherence problem re-
`Full-map directories. The full-map
`sult from a synergy between a multiproces-
`protocol uses directory entries with one bit
`sor’s software and hardware components.
`per processor and a dirty bit. Each bit
`represents the status of the block in the
`corresponding processor’s cache (present
`or absent). If the dirty bit is set, then one
`Classification of
`and only one processor’s bit is set, and that
`directory schemes
`processor has permission to write into the
`block. A cache maintains two bits of state
`per block. One bit indicates whether a
`A cache-coherence protocol consists of
`block is valid: the other bit indicates
`the set of possible states in the local caches,
`whether a valid block may be written. The
`the states in the shared memory, and the
`cache-coherence protocol must keep the
`state transitions caused by the messages
`state bits in the memory directory and those
`transported through the interconnection
`in the caches consistent.
`network to keep memory coherent. To
`Figure la illustrates three different
`simplify the protocol and the analysis, our
`states of a full-map directory. In the first
`data block size is the same for coherence
`state, location X is missing in all of the
`and cache fetch.
`caches in the system. The second state
`A cache-coherence protocol that does
`results from three caches (Cl, C2, and C3)
`not use broadcasts must store the locations
`requesting copies of location X. Three
`of all cached copies of each block of shared
`pointers (processor bits) are set in the entry
`data. This list of cached locations, whether
`to indicate the caches that have copies of
`centralized or distributed, is called adirec-
`the block of data. In the first two states, the
`tory. A directory entry for each block of
`dirty bit -on
`the left side of the directory
`data contains a number ofpoinrers to spec-
`entry-is set to clean (C), indicating that
`ify the locations of copies of the block.
`no processor has permission to write to the
`Each directory entry also contains a dirty
`block of data. The third state results from
`bit to specify whetheror not auniquecache
`cache C3 requesting write permission for
`has permission to write the associated
`the block. In this final state, the dirty bit is
`block of data.
`set to dirty (D), and there is a single pointer
`The different flavors of directory proto-
`to the block of data in cache C3.
`cols fall under three primary categories:
`It is worth examining the transition from
`full-map
`directories,
`limited
`directories,
`the second state to the third state in more
`and chaineddirecrories.
`Full-map directo-
`detail. Once processor P3 issues the write
`ries2 store enough state associated with
`to cache C3, the following events tran-
`each block in global memory so that every
`spire:
`cache in the system can simultaneously
`
`(1) Cache C3 detects that the block
`containing location X is valid but that the
`processor does not have permission to
`write to the block, indicated by the block’s
`write-permission bit in the cache.
`(2) Cache C3 issues a write request to
`the memory module containing location X
`and stalls processor P3.
`(3) The memory module issues Tnvali-
`date requests to caches Cl and C2.
`(4) Cache Cl and cache C2 receive the
`invalidate requests, set the appropriate bit
`to indicate that the block containing loca-
`tion X is invalid, and send acknowledg-
`ments back to the memory module.
`(5) The memory module receives the
`acknowledgments, sets the dirty bit, clears
`thepointers tocachesC1 andC2, andsends
`write permission to cache C3.
`(6) Cache C3 receives the write permis-
`sion message, updates the state in the
`cache, and reactivates processor P3.
`
`Note that the memory module waits to
`receive the acknowledgments before al-
`lowing processor P3 to complete its write
`transaction. By waiting for acknowledg-
`ments, the protocol guarantees that the
`memory system ensures sequential consis-
`tency.
`The full-map protocol provides a useful
`upper bound for the performance of cen-
`tralized directory-based cache coherence.
`However, it is not scalable with respect to
`memory overhead. Assume that the
`amount of distributed shared memory in-
`creases linearly with the number of
`processors N. Because the size of the direc-
`tory entry associated with each block of
`memory is proportional to the number of
`processors, the memory consumed by the
`directory is proportional to the size of
`memory (O(N)) multiplied by the size of
`the directory entry (Q(N)). Thus, the total
`memory overhead scales as the square of
`the number of processors (O(S)).
`
`Limited directories. Limited directory
`protocols are designed to solve the direc-
`tory size problem. Restricting the number
`of simultaneously cached copies of any
`particular block of dais limits the growth
`ofthedirectory toaconstantfactor.Forour
`analysis, we selected the limited directory
`protocol proposed.in Agarwal et al.)
`A directory protocol can be classified as
`Dir,X using the notation from Agarwal et
`aL3 The symbol i stands for the number of
`pointers, and X is NB for a scheme with no
`broadcast and B for one with broadcast. A
`full-map scheme without broadcast is rep-
`resented as DirpB. A limited directory
`
`50
`
`COMPUTER
`
`2
`
`

`
`protocol that uses i<N pointers is denoted
`Dir$NB. The limited directory protocol is
`similar to the full-map directory, except in
`the case when more than i caches request
`read copies of a particular block of data.
`Figure lb shows the situation when three
`caches request read copies in a memory
`system with a Dir,NB protocol. In this
`case, we can view the two-pointer direc-
`tory as a two-way set-associative cache of
`pointers to shared copies. When cache C3
`requests a copy of location X, the memory
`module must invalidate the copy in either
`cache Cl or cache C2. This process of
`pointer replacement is sometimes called
`eviction. Since the directory acts as a set-
`associative cache, it must have a pointer
`replacement policy. Our protocol uses an
`easily implemented pseudorandom evic-
`tion policy that requires no extra memory
`overhead. In Figure 1 b, the pointer to cache
`C3 replaces the pointer to cache C2.
`Why might limited directories succeed?
`If the multiprocessor exhibits processor
`locality in the sense that in any given inter-
`val of time only a small subset of all the
`processors access a given memory word,
`then a limited directory is sufficient to
`capture this small “worker-set” of proces-
`sors.
`Directory pointers in a DirENB protocol
`encode binary processor identifiers, so
`each pointer requires LogiN bits of mem-
`
`ory, where N is the number of processors in
`the system. Given the same assumptions as
`for the full-map protocol, the memory
`overhead of limited directory schemes
`grows as O(MogN). These protocols are
`considered scalable with respect to mem-
`ory overhead because the resources re-
`quired to implement them grow approxi-
`mately linearly with the number of proces-
`sors in the system.
`Dir,B protocols allow more than i copies
`of each block of data to exist, hut they
`resort to a broadcast mechanism when
`more than i cached copies of a block need
`to be invalidated. However, interconnec-
`tion networks with point-to-point wires do
`not provide an efficient systemwide broad-
`cast capability. In such networks, it is also
`difficult to determine the completion of a
`broadcast to ensure sequential consis-
`tency. While it is possible to limit some
`Dir,B broadcasts to a subset of the system
`(see Agarwal et al.l), we restrict our evalu-
`ation of limited directories to the Dir,NB
`protocols.
`
`mechanism, realize the scalability of lim-
`ited directories without restricting the
`number of shared copies of data blocks.4
`This type of cache-coherence scheme is
`called a chained scheme because it keeps
`track of shared copies of data by maintain-
`ing achainof directory pointers. We inves-
`tigated two chained directory schemes.
`The simpler of the two schemes imple-
`ments a singly linked chain, which is best
`described by example (see Figure Ic).
`Suppose there are no shared copies of loca-
`tion X. If processor Pl reads location X,
`the memory sends a copy to cache Cl,
`Chained directories. Chained directo-
`along with a chain termination (CT)
`ries, the third option for cache-coherence
`pointer. The memory also keeps a pointer
`schemes that do not utilize a broadcast
`
`June 1990
`
`Figure 1. Three types of directory protocols: (a) three states of a full-map direc-
`tory; (b) eviction in a limited directory; an? (c) chained directory.
`
`to cache Cl. Subsequently, when proces-
`sor P2 reads location X, the memory sends
`a copy to cache C2, along with the pointer
`to cache Cl. The memory then keeps a
`pointer to cache C2. By repeating this step,
`all of the caches can cache a copy of loca-
`tionX. IfprocessorP3 writes tolocationx,
`it is necessary to send a data invalidation
`message down the chain. To ensure se-
`quential consistency, the memory module
`denies processor P3 write permission until
`the processor with the chain termination
`pointer acknowledges the invalidation of
`the chain. Perhaps this scheme should be
`called a gossip protocol (as opposed to a
`snoopy protocol) because information is
`
`51
`
`3
`
`

`
`Postmortem
`scheduler
`
`Packet-switched
`network model
`
`I
`
`I
`
`T-MUL-T
`
`y
`
`Figure 2. Diagram of methodology.
`
`passed from individual to individual,
`rather than being spread by covert observa-
`tion.
`The possibility of cache-block replace-
`ment complicates chained directory proto-
`cols. Suppose that cache C, through cache
`C, all have copies of location X and that
`location X and location Y map to the same
`(direct-mapped) cache line. If processor P,
`reads location Y, it must first evict location
`X from its cache. In this situation, two
`possibilities exist:
`
`pointers), twice the pointer memory in the
`caches, and a more complex coherence
`protocol.
`Although the chained protocols are more
`complex than the limited directory proto-
`cols, they are still scalable in terms of the
`amount of memory used for the directo-
`ries. The pointer sizes grow as the loga-
`rithm of the number of processors, and the
`number of pointers per cache or memory
`block is independent of the number of
`processors.
`
`the memory system, which includes the
`cache, the memory, and the interconnec-
`tion network, we determine the contribu-
`tion of the memory system to the time
`needed to run a program on the system. Our
`analysis computes the processor
`utiliza-
`tion, or the fraction of time that each pro-
`cessor does useful work. One minus the
`utilization yields the fraction of processor
`cycles wasted due to memory system de-
`lays. The actual system speedup equals the
`number of processors multiplied by the
`processor utilization. This metric has been
`used in other studies of multiprocessor
`Caching only private data. Up to this
`cache and network performance.6
`point, we have assumed that caches are
`In a multiprocessor, processor utiliza-
`allowed to store local copies of shared
`tion (and therefore system speedup) is
`variables, thus leading to the cache-consis-
`affected by the frequency of memory refer-
`tency problem. An alternative shared
`ences and the latency of the memory sys-
`memory method avoids the cache-coher-
`
`tem. The latency (T) of a message through
`ence problem by disallowing caching of
`For our evaluation, we chose the second
`the interconnection network depends on
`shared data. In OUT analysis, we designate
`scheme because it can be implemented by
`several factors, including the network
`this scheme by saying it only caches private
`a less complex protocol than the first. In
`topology and speed, the number of proces-
`data. This scheme caches private data,
`
`eithercase,sequentiaIconsistency ismain-
`sors in the system, the frequency and size
`tained by locking the memory location
`shared data that is read-only, and instmc-
`of the messages, and the memory access
`tions, .while references to modifiable
`while invalidations are in progress.
`latency. The cache-coherence protocol
`shared data bypass the cache. In practice,
`Another solution to the replacement
`determines the request rate, message size,
`shared variables must be statically identi-
`problem is to use a doubly linked chain.
`and memory latency. To compute proces-
`fied to use this scheme.
`This scheme maintains forward and back-
`sor utilization, we need to use detailed
`ward chain pointers for each cached copy
`models of cache-coherence protocols and
`so that the protocol does not have to trav-
`Methodology
`interconnection networks.
`erse the chain when there is a cache re-
`Figure 2 shows an overview of our an-
`placement. The doubly linked directory
`alysis process. Multiprocessor address
`What is a good performance metric for
`optimizes the replacement condition at the
`traces generated using three tracing meth-
`comparing the various cache-coherence
`cost of a larger average message block size
`ods at Stanford University, IBM, and MIT
`schemes? To evaluate the performance of
`(due to the transmission of extra directory
`
`(1) Send a message down the chain to
`cache C,., with a pointer to cache C,+, and
`splice Cz out of the chain, or
`(2) Invalidate location X in cache Ct+,
`through cache C”.
`
`52
`
`COMPUTER
`
`4
`
`

`
`Table 1. Summary of trace statistics, with length values in millions of references
`to memory.
`
`Source
`
`Language
`
`Processors
`
`Application
`
`Length
`
`are run on a cache and directory simulator
`that counts the occurrences of different
`types of protocol transactions. A cost is
`assigned to each of these transaction types
`to compute the average processor request
`rate, the average network message block
`size, and the average memory latency per
`transaction. From these parameters, a
`model of a packet-switched, pipelined,
`multistage interconnection network calcu-
`lates the average processor utilization.
`
`VAX T-bit
`
`C
`
`Postmortem Fortran
`scheduler
`
`16
`
`64
`
`P-Thor
`MP3D
`LocusRoute
`SA-TSP
`FFT
`Weather
`Simple
`Speech
`
`7.09
`7.38
`7.05
`7.11
`7.44
`31.76
`27.03
`11.77
`
`T-M&T
`
`Mul-T
`
`64
`
`Getting multiprocessor address trace
`data. The address traces represent a wide
`range of parallel algorithms written in
`three different programming languages.
`The programs traced at Stanford were
`written in C; at IBM, in Fortran; and at
`dimensional grid and uses finite-differ-
`T programming environment that can be
`MIT, in M&T, a variant of Multilisp. The
`used to generate memory address traces for
`ence methods to solve a set of partial dif-
`implementation of the trace collector dif-
`programs running on an arbitrary number
`ferential equations describing the state of
`fers for each of the programming environ-
`of processors. Instructions are not cur-
`the system. Simple models the behavior of
`ments. Each tracing system can theoreti-
`fluids and employs finite difference meth-
`rently traced in T-M&T. We assume that
`cally obtain address traces for an arbitrary
`all instructions hit in the cache and, for
`ods to solve equations describing hydrody-
`number of processors, enabling a study of
`processor utilization computation, an in-
`namic behavior. FFT is a radix-2 fast
`the behavior of cache-coherent machines
`struction reference is associated with each
`Fourier transform.
`much larger than any built to date. Table 1
`data reference. We make these assump-
`Postmortem scheduling is a technique
`summarizes general characteristics of the
`tions only for the Speech application,
`that generates a parallel trace from a uni-
`traces. We will compare the relative per-
`because the other traces include instruc-
`processor execution trace of a parallel
`formance of the various coherence
`tions.
`application. The uniprocessor trace is a
`schemes individually for each application.
`The trace gathering techniques also dif-
`task trace with embedded synchronization
`fer in their treatment of private data loca-
`The SA-TSP, MP3D, P-Thor, and Lo-
`information that can be scheduled, after
`tions, which must be identified for the
`execution (posrmortem), into a parallel
`cusRoute traces were gathered via the
`Trap-Bit method using 16 processors. SA-
`scheme that only caches private data. The
`trace that obeys the synchronization con-
`private references are identified statically
`TSP uses simulated annealing to solve the
`straints. This type of trace generation uses
`(at compile time) in the Fortran traces and
`traveling salesman problem. MP3D is a 3D
`only one processor to produce the trace and
`are identified dynamically by post-
`to perform the postmortem scheduling. So,
`particle simulator for rarified flow. P-Thor
`processing the other traces. Since static
`is a parallel logic simulator. LocusRoute is
`
`the number of processes is limited only by
`methods must be more conservative than
`a global router for VLSI standard cells.
`the application’s synchronization con-
`dynamic methods when partitioning pri-
`Weber and Gupta* provide a detailed de-
`straints and by the number of parallel tasks
`vate and shared data, the performance that
`in the single processor trace.
`scription of the applications.
`we predict for the private data caching
`The Speech trace was generated by a
`Trap-bit (T-bit) tracing for multiproces-
`scheme on the C and M&T applications is
`sors is an extension of single-processor
`compiler-aided tracing scheme. The appli-
`slightly optimistic. In practice, the non-
`cation comprises the lexical decoding
`trap-bit tracing. In the single processor
`trivial problem of static data partitioning
`implementation, the processor traps after
`stage of a phonetically based spoken lan-
`makes it difficult to implement schemes
`each instruction if the trap bit is set, allow-
`guage understanding system developed by
`that cache only private data.
`the MIT Spoken Language Systems
`ing interpretation of the trapped instruc-
`Group. The Speech application uses a dic-
`tion and emission of the corresponding
`tionary of about 300 words represented by
`memory addresses. Multiprocessor T-bit
`a 3,500-node directed graph. The input to
`tracing extends this method by scheduling
`the lexical decoder is another directed
`a new process on every trapped instruc-
`graph representing possible sequences of
`tion. Once a process undergoes a trap, the
`phonemes in the given utterance. The
`trace mechanism performs several tasks. It
`application uses a modified Viterbi search
`records the corresponding memory ad-
`algorithm to find the best match between
`dresses, saves the processor state of the
`paths through the two graphs.
`trapped process, and schedules another
`In a compiler-based tracing scheme,
`
`process from its list of processes, typically
`code inserted into the instruction stream of
`in a round-robin fashion.
`a program at compile time records the
`The Weather, Simple, and fast Fourier
`addresses of memory references as a side
`transform traces were derived using the
`effect of normal execution. Our compiler-
`postmortem scheduling method at IBM.
`aided multiprocessor trace implementa-
`The Weather application partitions the
`tion is T-M&T, a modification of the Mul-
`atmosphere around the globe into a three-
`
`Simulating a cache-coherence strat-
`egy. For each memory reference in a trace,
`our cache and directory simulator deter-
`mines the effects on the state of the corre-
`sponding block in the cache and the shared
`memory. This state consists of the cache
`tags and directory pointers used to main-
`tain cache coherence. In the simulation,
`the network provides no feedback to the
`cache or memory modules. Assume all
`side effects from each memory transaction
`(entry in the trace) are stored simultane-
`ously. While this simulation strategy does
`not accurately model the state of the
`memory system on a cycle-by-cycle basis,
`
`June 1990
`
`53
`
`5
`
`

`
`Table 2. Simulation parameter defaults for the cache, directory, and network.
`
`Eggers and Katz9 and references therein).
`
`Type of Parameter Name
`
`Default Value
`
`Cache/Directory
`
`Network
`
`256 Kbytes
`Cache size
`Cache-block size
`16 bytes
`Direct mapped
`Cache associativity
`Cache-update policy
`Write back
`Directory pointer replace policy Random
`Network message header size
`16 bits
`Network switch size
`4x4
`Network channel width
`16 bits
`Processor cycle time
`2 x network
`switch cycle time
`32 bits
`6 x network switch
`cycle time
`
`Memory address size
`Base memory access time
`
`The interconnection network model.
`The directory schemes that we analyze
`transmit messages over an interconnection
`network to maintain cache coherence.
`They distribute shared memory and associ-
`ated directories over the processing nodes.
`Our analysis uses a packet-switched, buff-
`ered, multistage interconnection network
`that belongs to the general class of Omega
`networks. The network switches are pipe-
`lined so that a message header can leave a
`
`switch even while the rest of the message is
`still being serviced. A protocol message
`travels through n network switch stages to
`the destination node and takes M cycles for
`the memory access. The network is buff-
`ered and guarantees sequenced delivery of
`messages between any two nodes on the
`network.
`Computation of the processor utiliza-
`tion is based on the analysis method that
`Patell used. The network model yields the
`average latency T of a protocol message
`through the network with n stages, k X k
`size switches, and average memory delay
`M. We derive processor utilization (I from
`a set of three equations:
`
`1
`“=l+mT
`
`P= UmB
`
`it does produce accurate counts of each
`type of protocol transaction over the length
`of a correct execution of a parallel pro-
`gram.
`However, since we assume that all side
`effects of any transaction occur simultane-
`ously, we do not model the difference be-
`tween sequential and concurrent opera-
`tions. This inaccuracy particularly affects
`the analysis of chained directory schemes.
`Specifically, when a shared write is per-
`formed in a system that uses a chained
`directory scheme, the copies of the written
`location must be invalidated in sequence,
`while a centralized directory scheme may
`send the invalidations in parallel and keep
`track of the number of outstanding ac-
`knowledgments. Thus, the minimum la-
`tency for shared writes to clean cache
`blocks is greater for the distributed
`schemes than for the centralized schemes.
`Analyzing the trade-offs between cen-
`tralized and distributed schemes requires a
`much more detailed simulation. While it is
`possible to accurately model the memory
`system on a cycle-by-cycle basis, such a
`simulation requires much higher overhead
`than our simulations in terms of both pro-
`gramming time and simulation runtime.
`Our MIT research group is running experi-
`ments on a simulator for an entire multi-
`processor system. Simulations of the en-
`tire system run approximately 100 times
`slower than the trace-driven simulations
`used for this article. Variants of coherence
`schemes are harder to implement in the
`detailed simulator than in the trace-driven
`environment. To investigate a wide range
`of applications and cache-coherence
`protocols, we avoided the high overhead of
`
`such detailed simulations by performing
`trace-driven simulations.
`In a trace-driven simulation, a memory
`transaction consists of a processor-to-
`memory reference and its effect on the
`state of the memory system. Any transac-
`tion that causes a message to be sent out
`over the network contributes to the aver-
`age request rate, average message size,
`and average memory latency. Each type
`of transaction is assigned a cost in terms
`of the number of messages that must be
`sent over the network (including both the
`requests and the responses), the latency
`encountered at the memory modules, and
`the total number of words (including rout-
`ing information) transported through the
`where m is the probability a message is
`network. Given a trace and a particular
`generated on a given processor cycle, with
`corresponding network latency T. The
`cache-coherence protocol, the cache and
`directory simulator determines the per-
`channel utilization (p) is the product of the
`centage of each transaction type in the
`effective network request rate (Urn) and
`the average message size B. The latency
`trace. The percentage of the transaction
`
`equationuses thepacket-switchednetwork
`type, multiplied by its cost, gives the con-
`tribution of the transaction to each of the
`model by Kmskal and Snir.” The first term
`three parameters listed above.
`in the equation (n + B + M - 1) gives the
`latency through an unloaded network. The
`In addition to the cache-coherence
`strategy, other parameters affect the per-
`second term gives the increase in latency
`due to network contention, which is the
`formance of the memory system. We
`chose values for these parameters (listed
`product of the contention delay through
`one switch and the number of stages. We
`in Table 2) based on the technology used
`verified the model in the context of our
`for contemporary multiprocessors. Al-
`research by comparing its predictions to
`though we chose a 256-kilobyte cache, the
`the performance of a packet-switched net-
`results of our analysis do not differ sub-
`work simulator that transmitted messages
`stantially for cache sizes from 256 kilo-
`generated by a Poisson process.
`bytes down to 16 kilobytes because the
`Table 2 shows the default network para-
`working sets forthe applications are small
`meters we used in our analysis. While this
`when partitioned over a large number of
`article presents results for a packet-
`processors. The effect of other parame-
`switched multistage network, it is possible
`ters, including the cache-block size, has
`to derive results for other types of net-
`been explored in several studies (see
`
`54
`
`COMPUTER
`
`6
`
`

`
`works by varying the network model used
`in the final stage of the analysis. In fact, we
`repeated our analysis for the direct, two-
`dimensional mesh network that we plan to
`use in our own machine. With the direct
`network model, the cache-coherence
`schemes showed the same relative hehav-
`ior as they did with the network model
`described above. The ability to use the
`results from one set of directory simula-
`tions to derive statistics for a range of
`network or bus types displays the power of
`this modeling method.
`
`Analysis of
`directory schemes
`The graphs presented below plot various
`combinations of applications and cache-
`coherence schemes on the vertical axis and
`processor utilization on the horizontal
`axis. Since the data reference characteris-
`tics vary significantly between applica-
`tions and trace gathering methods, we do
`not average results from the different
`traces. The results presented here concen-
`trate on the Weather, Speech, and P-Thor
`applications. We discuss other applica-
`tions when they exhibit significantly dif-
`ferent behavior.
`Are caches useful for shared data?
`Figure 3 shows the processor utilizations
`realized for the Weather, Speech, and P-
`Thor applications using each of the coher-
`ence schemes we evaluated. The long bar
`at the bottom of each graph gives the value
`for “no cache coherence.” This number is
`derived by considering all addresses in
`each trace to be not shared. Processorutili-
`zation with no cache coherence gives, in a
`sense, the effect of the native hit/miss rate
`for the application. The number is artificial
`because it does not represent the behavior
`of a correctly operating system. However,
`the number does give an upper bound on
`the performance of any coherence scheme
`and allows us to focus on the component of
`processor utilization lost due to sharing
`between pro

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket