`
`of Directory
`
`Schemes
`
`for Cache Coherence
`
`Anant Agarwal,* Richard Simoni, John Hennessy, and Mark Horowitz
`Computer Systems Laboratory
`Stanford University, CA 94305
`
`Abstract
`in shared-memory multipro-
`The problem of cache coherence
`two basic approaches:
`direc-
`cessors has been addressed using
`tory schemes and snoopy cache schemes. Directory
`schemes
`have been given
`less attention
`in the past several years, while
`snoopy cache methods have become extremely
`popular.
`Di-
`rectory
`schemes
`for cache coherence are potentially
`attrac-
`tive in large multiprocessor
`systems
`that are beyond
`the scal-
`ing limits of the snoopy cache schemes. Slight modifications
`to directory
`schemes can make
`them competitive
`in perfor-
`mance with snoopy cache schemes for small multiprocessors.
`Trace driven simulation,
`using data collected
`from several real
`multiprocessor
`applications,
`is used to compare
`the perfor-
`mance of standard
`directory
`schemes, modifications
`to these
`schemes, and snoopy cache protocols.
`
`Introduction
`1
`multiprocessors
`shared-memory
`the past several
`years,
`In
`of
`have gained wide-spread
`attention
`due
`to the simplicity
`the shared-memory
`parallel
`programming
`model. However,
`allowing
`the processors
`to share memory complicates
`the de-
`sign of the memory hierarchy.
`The most prominent
`example
`of this
`is the cache coherency or cache consistency problem,
`which is introduced
`if the system
`includes caches for each pre
`cessor. A system of caches is said to be coherent
`if all copies of
`a main memory
`location
`in multiple
`caches remain consistent
`when
`the contents of that memory
`location are modified
`[l].
`A cache coherency protocol
`is the mechanism by which
`the co-
`herency of the caches is maintained.
`Maintaining
`coherency
`entails
`taking
`special action when one processor writes
`to a
`block of data
`that exists
`in other caches. The data
`in the
`other caches, which
`is now stale, must be either
`invalidated
`or updated with
`the new value, depending
`on the protocol.
`Similarly,
`if a read miss occurs on a shared data
`item and
`memory has not been updated with
`the most
`recent value
`(as would happen
`in a copy-back
`cache),
`that most
`recent
`value must be found and supplied
`to the cache
`that missed.
`These
`two actions are the essence of all cache coherency pro-
`tocols. The protocols differ primarily
`in how they determine
`whether
`the block
`is shared, how
`they
`find out where block
`copies reside, and how
`they
`invalidate
`or update copies.
`Most of the consistency
`schemes
`that have been or are be-
`ing implemented
`in multiprocessors
`are called snoopy cache
`protocols
`[2,3,4,5,6,7]
`because each cache
`in
`the system
`must watch all coherency
`transactions
`to determine when
`consistency-related
`actions should
`take place for shared data.
`Snoopy cache schemes store
`the state of each block of cached
`
`*Anant Agarwalis
`Science (NE43-418),
`
`currently with the Laboratory
`MAT, Cambridge, MA 02139.
`
`for Computer
`
`CH2545-2/88/0000/0280$01.00
`
`0 1988 IEEE
`
`280
`
`about
`
`the state
`
`- the information
`data in the cache directories
`of the cached data
`is distributed.
`is directory-based
`protocols
`Another
`class of coherency
`[8,9,10,11]. Directory-based
`protocols
`keep a separate direc-
`tory associated with main memory
`that stores
`the state of
`each block of main memory.
`Each entry
`in this centralized
`directory may contain several
`fields depending
`on the proto-
`col, for example, a dirty bit, a bit
`indicating
`whether or not
`the block
`is cached, pointers
`to the caches
`that contain
`the
`block, etc.
`scheme
`How do snoopy cache protocols work? A typical
`enforces consistency
`by allowing multiple
`readers but only
`one writer.
`The state associated with a block’s cached copy
`denotes whether
`the block
`is, for example,
`(i)
`invalid,
`(ii),
`valid
`(possibly
`shared), or (iii) dirty
`(exclusive
`copy). When
`a cache miss occurs,
`the address
`is broadcast on the shared
`bus.
`If another cache has the block
`in state dirty,
`the state
`is
`changed
`to valid and
`the block
`is supplied
`to the requesting
`cache.
`In addition,
`for write misses all copies of the block
`are invalidated.
`Similarly,
`on a write hit
`to a clean block,
`the
`address
`is broadcast and each cache must
`invalidate
`its copy.
`In general, all cache
`transactions
`that may
`require a data
`transfer or state change
`in other caches must be broadcast
`over the bus.
`Snoopy cache schemes are popular because small-scale mul-
`tiprocessors
`can
`live within
`the bandwidth
`constraints
`im-
`posed by a single, shared bus to memory.
`This shared bus
`makes
`the implementation
`of the broadcast
`actions straight-
`forward.
`However,
`snoopy cache schemes will not scale be-
`yond
`the range of the number of processors
`that can be ac-
`commodated
`on a bus (probably
`no more than 20). Attempts
`to scale them by replacing
`the bus with a higher bandwidth
`communication
`network will not be successful since
`the con-
`sistency protocol
`relies on low-latency
`broadcasts
`to maintain
`coherency.
`For
`this
`reason, shared-memory
`multiprocessors
`with
`large numbers of processors,
`such as the RP3
`[12], do
`not provide cache coherency
`support
`in hardware.
`the
`with
`These
`snoopy
`cache
`schemes also
`interfere
`Because
`the caches of all pro-
`processor-cache
`connection.
`cessors are examined
`on each coherency
`transaction,
`inter-
`ference between
`the processor and
`its cache
`is unavoidable.
`This
`interference
`can be reduced by duplicating
`the tags and
`snooping on the duplicate
`tags. However,
`the processor must
`write both sets of tags and thus arbitration
`is required on the
`duplicate
`tags. This
`impacts
`the cache write
`time which may
`slow down
`the overall cycle
`time, especially
`in a high perfor-
`mance machine. Attempts
`to reduce
`the bus traffic generated
`by cache coherency
`requests
`in a snoopy cache scheme results
`in
`fairly
`complex
`protocols.
`These may
`impact
`either
`the
`cache access time or the coherency
`transaction
`time.
`In this paper we propose
`that directory-based
`schemes are
`better
`suited
`to building
`large-scale,
`cache-coherent multi-
`
`1
`
`APPLE 1012
`
`
`
`for a communi-
`processors, where a single bus is unsuitable
`cation mechanism.
`This paper
`is a first step
`in evaluating
`directory
`schemes using
`traces
`from
`real multiprocess
`appli-
`cations. Although we do not have sufficient data
`to demon-
`strate quantitatively
`that
`the directory
`schemes are effective
`in a large-scale multiprocessor,
`we do discuss how
`these di-
`rectory schemes can be scaled and we demonstrate
`that
`their
`performance
`in a small-scale multiprocessor
`is acceptable.
`We use trace-driven
`simulation,
`with
`traces obtained
`from
`real multiprocessor
`applications,
`to evaluate a basic directory-
`based coherency protocol
`that uses bus broadcasts
`and ver-
`ify
`that
`its performance
`approaches
`that of snoopy cache
`schemes. We then obviate broadcasts by including
`a valid bit
`per cache in each directory
`entry, allowing
`sequential
`invali-
`dation of multiple
`cached copies. Performance
`is not signifi-
`cantly degraded by this modification,
`and in most cases (over
`85% of writes
`to previously-clean
`blocks) no more
`than one
`sequential
`invalidation
`request
`is necessary. Unfortunately,
`the need for a valid bit per cache restricts
`the ability
`to add
`on to an existing multiprocessor
`without modifying
`parts of
`the existing
`system. This motivates
`a scheme
`that can per-
`form up
`to some small number of sequential
`invalidates
`to
`handle
`the most frequent case, and that
`resorts
`to some form
`of “limited
`broadcast”
`otherwise.
`schemes and dis-
`The paper
`first reviews previous directory
`created by snoopy
`cusses how they overcome
`the limitations
`cache schemes.
`It also proposes a general classification
`of
`these
`techniques,
`and
`identifies
`a few that seem most
`inter-
`esting for performance
`and implementation
`reasons. Section 3
`outlines
`the schemes that we evaluate. We describe our eval-
`uation method and
`the characteristics
`of our multiprocessor
`address
`traces
`in Section 4. Section 5 evaluates
`basic di-
`rectory and snoopy cache schemes and discusses
`their per-
`formance.
`Section 6 then extends
`the discussion
`to include
`more scalable directory
`protocols,
`and Section 7 concludes
`the paper.
`
`2 Directory
`Consistency
`
`Schemes
`
`for Cache
`
`that snoopy cache schemes possess are
`The major problems
`limited
`scalability
`and interference with
`the processor-cache
`write path. How do directory
`schemes address
`these prob-
`lems?
`The major advantage
`directory
`schemes have over
`snooping protocols
`is that
`the
`location
`of the caches
`that
`have a copy of a shared data
`item are known.
`This means
`that a broadcast
`is not required
`to find all the shared copies.
`Instead,
`individual messages can be sent to the caches with
`copies when an invalidate
`occurs. Since
`these messages are
`directed
`(i.e., not broadcast),
`they can be easily sent over any
`arbitrary
`interconnection
`network, as opposed
`to just a bus.
`The absence of broadcasts eliminates
`the major
`limitation
`on
`scaling cache coherent multiprocessors
`to a large number of
`processors.
`for a
`to examine every cache
`Because we no longer need
`In-
`tags can be eliminated.
`copy of the data,
`the duplicate
`stead, we store pointers
`in main memory
`to the caches where
`the data
`is known
`to reside and invalidate
`their copies. The
`snoopy algo-
`protocols are also simpler
`than
`the distributed
`rithms because of the centralization
`of the information
`about
`each datum.
`Several directory-based
`
`schemes have been pro-
`
`consistency
`
`[8] allows clean blocks
`Tang’s method
`posed in the literature.
`to exist
`in many caches, but disallows dirty blocks
`from
`re-
`siding
`in more than one cache (most snoopy cache coherency
`schemes use the same policy).
`In this scheme, each cache
`maintains
`a dirty bit
`for each of its blocks, and
`the central
`directory
`kept at memory contains a copy of all the tags and
`dirty bits in each cache. On a read miss, the central directory
`is checked
`to see if the block
`is dirty
`in another cache.
`If so,
`consistency
`is maintained
`by copying
`the dirty block back to
`memory before supplying
`the data;
`if the directory
`indicates
`the data
`is not dirty
`in another
`cache,
`then
`it supplies
`the
`data
`from memory.
`The directory
`is then updated
`to indi-
`cate
`that
`the requesting
`cache now has a clean copy of the
`data. The central directory
`is also checked on a write miss.
`In this case, if the block
`is dirty
`in another
`cache then
`the
`block
`is first
`flushed
`from
`that cache back
`to memory before
`supplying
`the data;
`if the block
`is clean in other caches then it
`is invalidated
`in those caches (i.e., removed
`from
`the caches).
`The data
`is then supplied
`to the
`requesting
`cache and
`the
`directory modified
`to show that
`the cache has a dirty copy of
`the block. On a write hit,
`the cache’s dirty bit
`is checked.
`If
`the block
`is already dirty,
`there
`is no need to check the central
`directory,
`so the write can proceed
`immediately.
`If the block
`is clean,
`then
`the cache notifies
`the central directory, which
`must
`invalidate
`the block
`in all of the other caches where
`it
`resides.
`consistency
`[9] proposed a similar
`Censier and Feautrier
`as the Tang
`the same actions
`mechanism
`that performs
`scheme but organizes
`the central directory
`differently.
`Tang
`duplicates
`each of the individual
`cache directories
`as his main
`directory.
`To find out which caches contain a block, Tang’s
`scheme must search each of these duplicate
`directories.
`In
`the Censier and Feautrier
`central directory,
`a dirty bit and
`“present”)
`bits equal
`to the number
`a number of valid
`(or
`of caches are associated with each block
`in main memory.
`This organization
`provides
`the same
`information
`as the du-
`plicate cache directory method but allows
`this information
`to
`be accessed directly
`using
`the address supplied
`to the central
`directory
`by the requesting
`cache. Each valid bit
`is set if the
`corresponding
`cache contains a valid copy of the block. Since
`a dirty block can only exist
`in at most one cache, no more
`than one of a block’s valid bits may be set if the dirty bit
`is
`set.
`to the Censier
`[ll]
`refinement
`Yen and Fu suggest a small
`and Feautrier
`consistency
`technique.
`The central directory
`is unchanged,
`but
`in addition
`to
`the valid and dirty bits,
`a flag called
`the single Lit is associated with each block
`in
`the caches. A cache block’s single bit
`is set if and only
`if
`that cache
`is the only one in
`the system
`that contains
`the
`block. This saves having
`to complete a directory
`access before
`writing
`to a clean block
`that
`is not cached elsewhere.
`The
`major drawback
`of this scheme
`is that extra bus bandwidth
`is consumed
`to keep the single bits updated
`in all the caches.
`Thus,
`the scheme saves central directory
`accesses, but does
`not reduce
`the number of bus accesses versus the Censier and
`Feautrier
`protocol.
`consistency
`Archibald
`and Baer present a directory-based
`for the central
`mechanism
`[lo] with a different organization
`directory
`that
`reduces
`the amount
`of storage space
`in
`the
`directory,
`and also makes it easier to add more caches to the
`system. The directory
`saves only
`two bits with each block
`in
`main memory. These bits encode one of four possible states:
`block not cached,
`block clean
`in exactly
`one cache, block clean
`in an unknown number of cachea, and block dirty
`in exactly
`one cache.
`The directory
`therefore
`contains no information
`
`281
`
`2
`
`
`
`the scheme relies on
`to indicate which caches contain a block;
`broadcasts
`to perform
`invalidates
`and write-back
`requests.
`The block clean in exactly one cache state obviates
`the need
`for a broadcast when writing
`to a clean block
`that
`is not
`contained
`in any other caches.
`these directory
`Two clear differences
`are present among
`schemes:
`the number of processor
`indices contained
`in
`the
`directories
`and
`the presence of a broadcast bit. We can thus
`classify
`the schemes as Diri X, where i is the number of indices
`is either B or NB
`kept in the directory
`and X
`for Broadcast
`or No Broadcast.
`In a no-broadcast
`scheme
`the number of
`processors
`that have copies of a datum must always be less
`than or equal
`to i, the number of indices kept in the directory.
`If the scheme allows broadcast
`then
`the numbers of proces-
`sors can be larger and when
`it is (indicated
`by a bit
`in the
`directory)
`a broadcast
`is used to invalidate
`the cached data.
`The one case that does not make sense is Dire NB, since there
`is no way
`to obtain exclusive access.
`as
`is classified
`scheme
`In
`this
`terminology,
`the Tang
`is Dir, NB also,
`Dir,, NB, the Censier and Feautrier
`scheme
`and
`the Baer and Archibald
`scheme
`is DircB.
`Our evalu-
`in
`ation concentrates
`on a couple of key points
`the design
`space: DirrNB
`and DirsB. We will also present
`results
`for
`Dir, NB.
`that prevent scalability
`difficulties
`There are two potential
`of the directory
`schemes. First,
`if the scheme always or fie-
`quently
`requires broadcast,
`then it will do no better
`than
`the
`snoopy schemes. Variations
`in the directory
`schemes
`(e.g.,
`increasing
`the value of i
`in a DiriB
`scheme) decrease
`the
`frequency of broadcast. We must also examine
`the dynamic
`numbers of caches
`that contain a shared datum
`to evaluate
`the actual
`frequency of occurrence. Second,
`the access to the
`directory
`is a potential
`bottleneck.
`However, we will show
`that
`the directory
`is not much more of a bottleneck
`than
`main memory, and the bandwidth
`to both can be increased by
`having a distributed
`memory hierarchy
`rather
`than central-
`ized. That
`is, memory
`is distributed
`together with
`individual
`processors.
`In addition
`to certain advantages
`in providing
`bandwidth
`to the memories
`from
`the local processor,
`scalable
`the organization
`distributes
`the directory,
`associating
`it with
`the individual memory modules.
`
`3 Schemes Evaluated
`schemes (called Dir1 NB and
`two directory
`We will evaluate
`DirsB),
`and
`two snoopy
`cache schemes
`(Write-Through-
`With-Invalidate
`and Dragon)
`for comparison purposes. These
`particular
`snoopy cache techniques were selected because they
`represent
`two extremes of performance
`and complexity.
`The
`two directory
`schemes are also extremes
`in
`the number of
`simultaneous
`cached copies allowed.
`The
`following
`is a de-
`scription
`of these four protocols.
`is Dir1 NB in
`four schemes
`The most
`restrictive
`of the
`that a given block
`is allowed
`to reside
`in no more
`than one
`cache at a time;
`therefore,
`there can be no data
`inconsistency
`across caches. The directory
`entry
`for each block consists of
`a pointer
`to the cache
`that contains
`the block. On a cache
`miss,
`the directory
`is accessed
`to find out which cache con-
`tains
`the block,
`that cache is notified
`to invalidate
`the block
`and write
`it back
`to memory
`if dirty, and
`the data
`is then
`cache. Dirr NB is included
`supplied
`to the requesting
`in the
`evaluation
`because
`it is perhaps
`the simplest directory-based
`consistency
`scheme and
`is easily scaled
`to support
`a large
`
`number of processors.
`The DiroB
`out-
`and Baer scheme
`is the Archibald
`[lo]
`proto-
`Like many consistency
`lined
`in the previous
`section.
`in many caches, while a dirty
`cols, a clean block may reside
`block may exist
`in exactly one cache.
`Invalidations
`are accom-
`plished with broadcasts;
`a similar
`scheme
`that uses sequen-
`tial
`invalidates
`in place of broadcasts
`(Dir,,NB)
`will
`later be
`For
`the
`ini-
`shown
`to have nearly
`the same performance.
`tial evaluation,
`broadcasts are used in both
`the directory
`and
`snooping
`schemes because
`it results
`in a simpler cost model
`and allows a fair comparison
`of the two.
`is a simple snoopy
`Write-Through-With-Invalidate
`(WTI)
`cache protocol
`that
`relies on a write-through
`(as opposed
`to
`copy-back)
`cache policy and
`is used
`in several commercial
`multiprocessors.
`All writes
`to cache blocks are transmitted
`to main memory. Other caches snooping on the bus check to
`see if they have
`the block
`that
`is being written;
`if so, they
`invalidate
`that block
`in their own cache. When a different
`processor accesses the block, a cache miss will occur and the
`Like DiroB, mul-
`current data wilI be read
`from memory.
`tiple cached copies of clean blocks can exist simultaneously.
`Because of the high
`level of bus traffic caused by the write-
`through strategy, WTI
`is generally
`considered
`to be one of the
`lowest-performance
`snooping cache consistency
`protocols.
`pro-
`While
`the three previous
`schemes are all invalidation
`con-
`tocols, Dragon
`is an update protocol,
`i.e.,
`it maintains
`sistency by updating
`stale cached data with
`the new value
`rather
`than by
`invalidating
`the stale data
`[13]. The cache
`keeps state with each block
`to indicate whether or not each
`block
`is shared; all writes
`to shared blocks must be broadcast
`on the bus so that
`the other copies can be updated. Dragon
`uses a special
`“shared”
`line
`to determine whether a block
`is
`currently
`being shared or not. Each cache snoops on the bus
`and pulls
`the shared line whenever
`it sees an address for which
`it has a cached copy of the data. Dragon
`is often considered
`to have the best performance
`among snoopy cache schemes.
`
`4 Evaluation
`
`Methodology
`
`traces is our method
`address
`using multiprocessor
`Simulation
`studies
`that evaluated
`direc-
`of evaluation.
`Most previous
`models
`[14,9] and
`those
`that
`tory schemes used analytical
`rough assumptions
`about
`the
`used simulation
`had
`to make
`characteristics
`of shared memory
`references
`[lo]. Because
`the
`performance
`of cache coherence schemes
`is very sensitive
`to
`the shared-memory
`reference patterns,
`both of these previ-
`ous methods have
`the drawback
`that
`the results are highly
`dependent
`on the assumptions made. Trace-driven
`simula-
`tion has the drawback
`that
`the same trace
`is used to evaluate
`all consistency
`protocols, while
`in reality
`the reference pat-
`tern would be different
`for each of the schemes due to their
`timing differences. But
`the traces
`represent at least one pos-
`sible
`run of a real program,
`and can accurately
`distinguish
`the performance
`of various schemes for that
`run.
`in multi-
`This paper deals with
`the inherent
`cost of sharing
`processors and the memory
`traffic
`required
`to maintain
`cache
`consistency. We therefore exclude
`the misses caused by the
`first reference
`to a block
`in the trace because
`these occur
`in a
`uniprocessor
`infinite
`cache as well. The additional
`overhead
`due
`to multiprocessing
`now consists of (i)
`the extra misses
`that occur due to fetching
`the block
`into multiple
`caches and
`(ii)
`the cache consistency-related
`operations. Our results
`rep
`resent exactly
`this overhead.
`
`282
`
`3
`
`
`
`incurred
`traffic
`the
`to isolate and measure only
`We wish
`in maintaining
`a coherent shared memory system
`in a multi-
`processor. To this end our simulations
`use infinite
`caches to
`eliminate
`the
`traffic caused by
`interference
`in finite caches.
`The performance
`of an infinite
`cache is also a good approxi-
`mation
`to that of a very
`large cache, where
`the miss rate
`is
`essentially
`the cost of first-time
`fetches. Moreover,
`the per-
`formance of a system with smaller caches can be estimated
`to first order by adding
`the costs due to the finite cache size.
`Typical
`cache miss rates are reported
`in [15,16].
`
`4.1 Performance
`Measures
`of a multiprocessor
`To determine
`the absolute performance
`a simulation must
`system using
`total processor utilizations,
`be carried out for every hardware model desired. A problem
`with
`this approach
`is that
`the sharing
`characteristics
`may
`change because
`the simulation model
`is different
`from
`the
`hardware used for gathering
`data.
`to
`is not tied
`that
`We would
`like a metric
`for performance
`network architec-
`any particular
`processor or interconnection
`ture. We use the communication
`cost per memory
`reference
`as our basic metric. This cost is simply
`the average number of
`cycles that
`the bus (or network)
`is busy during a data
`transfer
`from a cache
`to another cache, cache to directory,
`and from
`cache to or from main memory. We refer
`to this metric sim-
`ply as bus cycles per memory
`reference. This metric abstracts
`away details of how
`the directories
`are implemented,
`either
`as centralized
`or distributed.
`It also requires no assumptions
`about
`the
`relative
`speeds of local and non-local memories,
`local and non-local buses, or processor and
`the bus.
`Since the snoopy cache schemes require a bus-based archi-
`tecture, we often
`talk of a bus in our directory models. How-
`ever, the directory
`schemes we discuss are general enough
`to
`work in any network architecture. While
`the bus cycles metric
`allows us to compare
`the relative merits of various cache con-
`sistency schemes,
`it cannot
`indicate
`accurately
`the absolute
`performance
`of a multiprocessor.
`However,
`in lightly
`loaded
`systems, multiprocessor
`performance
`could still be approxi-
`mated
`to first order
`from
`the number of bus cycles used per
`memory
`reference.
`for a given cache consistency
`The bus cycles per reference
`scheme are computed as follows. First we measure event
`fre-
`quencies
`for various schemes by simulating multiple
`infinite
`caches, where events are different
`types of memory
`references.
`The simulator
`reads a reference
`from a trace and takes a set
`of actions depending
`on the type of the reference,
`the state
`of the referenced block, and the given cache consistency pro-
`tocol.
`respective
`frequencies are now weighted by their
`The event
`costs in bus cycles
`to give the aggregate number of bus cycles
`used per reference.
`For example,
`a cache miss event might
`require 5 bus cycles of communication
`cost (1 cycle
`to send
`the address, and 4 cycles
`to get 4 words of data back).
`If
`the rate of cache misses is, say, l%,
`then
`the bus cycles used
`up by cache misses per reference
`is 0.05.
`In like manner,
`the
`costs due to other events are added
`to get the aggregate cost
`per reference. Since
`the choice of the hardware model
`(i.e.,
`cost per event)
`is independent
`of the event
`frequencies, we
`need just one simulation
`run per protocol
`to compute
`the
`event
`frequencies,
`and we can
`then vary costs for different
`hardware models.
`Details
`of
`traces used
`
`in simulations
`
`are given
`
`in Sec-
`
`is 4 words
`this paper
`tion 4.4. The block size used throughout
`(16 bytes).
`In all the schemes we assume that
`instructions
`do
`not cause any cache consistency
`related
`traffic.
`In addition,
`we do not include
`the bus traffic caused by instruction misses
`in our performance
`estimations.
`
`4.2 Event Frequencies
`scheme are those
`The event
`types of interest
`in a particular
`the schemes
`re-
`that may
`result
`in a bus
`transaction.
`All
`quire
`the frequency of read and write misses (read-miss or rm
`and write-miss
`or wm. Depending
`on the scheme some other
`events rates are also needed:
`
`of
`to
`
`to
`the fraction of references
`include
`The Dragon events
`in another cache on a read
`blocks
`that are clean or dirty
`or write miss (rm-ilk-cln,
`rm-blk-drty,
`rum-blk-cln, and
`rum-blk-drty).
`The clean and dirty numbers
`indicate
`when a block is supplied by another cache as opposed
`to
`from main memory.
`In addition, we need the frequency
`of write updates
`to blocks present
`in multiple
`caches on
`a write hit (t&-d&rib).
`frequency
`the
`requires
`The write-through
`scheme
`writes
`because all writes are
`transmitted
`(write)
`main memory.
`the fractions of read
`scheme, we need
`In the DiriNB
`and write
`references
`that miss
`in
`the cache, but are
`present
`in a dirty
`or clean state
`in another
`cache
`(rm-blk-cln,
`rm-blk-drty,
`wm-blk-cln,
`and turn-blk-drty).
`These events
`indicate when
`invalidation
`requests must
`be sent to another cache and when dirty blocks have to
`be written
`back
`to main memory.
`to the four events for
`In the Dire B scheme, in addition
`the Dir1 NB scheme, we need the proportion
`of write
`hits to a clean block (wh-blk-cln).
`This event represents
`queries
`to the directory
`to check whether
`the block
`re-
`sides in any other cache and has to be invalidated. We
`also measure
`the distribution
`of the number of caches
`the block
`resides in during a possible
`invalidation
`situ-
`ation
`to determine
`the impact of various
`invalidations
`methods. The various
`invalidation methods
`include
`full
`broadcast,
`limited
`broadcast,
`and sequential
`invalida-
`tion messages
`to each cache.
`
`4.3 Bus Models
`the various events depend on the
`The bus cycle costs
`for
`sophistication
`of the bus and main memory.
`The examples
`given
`in this paper use the bus timing
`depicted
`in Table 1.
`From
`this basic bus model, and some assumptions
`about
`the
`sophistication
`of the bus, we can estimate
`the cost in bus cy-
`cles for each of the events
`that cause bus traffic. Because
`the
`costs can differ depending
`on the type of bus or interconnec-
`tion network used, we will use two bus types of widely diverse
`complexity
`to give an idea of how the schemes will perform
`over a range of bus and memory organizations.
`On
`the so-
`phisticated
`end of the spectrum, we use a pipelined bus model
`that has separate data and address paths. At
`the other end
`we use a non-pipelined
`bus that has to multiplex
`the address
`and data on the same bus lines. The data
`transfer width of
`both buses is assumed
`to be one word
`(32 bits).
`For the pipelined
`bus with separate
`for address and
`lines
`data, memory
`or non-local
`cache accesses cost 5 cycles
`(1
`
`283
`
`4
`
`
`
`Table 3: Summary of trace characteristics.
`are in thousands.
`n Trace
`I Refs
`’ POPS
`3142
`THOR
`3222
`PER0
`3508
`
`I Instr
`1624
`1456
`1834
`
`I DRd
`1257
`1398
`1266
`
`All numbers
`
`I DWrt
`261
`368
`409
`
`I User
`2817
`2727
`3242
`
`I Svs 1
`3i5
`495
`266
`
`ad-
`interleaved
`trace contains
`An address
`four processors.
`dress streams of the four processors. CPU numbers and pro-
`cess identifiers
`of the active processes are also
`included
`in
`the
`trace so that any address
`in
`the
`trace can be identified
`as coming
`from a given CPU and given process. A current
`limitation
`of ATUM
`traces
`is that only
`four-CPU
`traces can
`be obtained. We are currently
`developing
`a multiprocessor
`simulator
`that builds on top of the VAX T-bit mechanism
`and can provide accurate
`simulated
`traces of a much
`larger
`number of processors.
`The
`traces show some amount of sharing between proces-
`sors that
`is induced
`solely by process migration.
`The char-
`acteristics
`of migration-induced
`sharing
`is significantly
`differ-
`ent from sharing present
`in the application
`processes
`[18]. We
`would
`like to exclude
`this form of sharing
`from our study since
`a large multiprocessor
`would probably
`try
`to minimize
`pro-
`cess migration.
`Therefore,
`for this study, we consider sharing
`between
`processes
`(as opposed
`to sharing between proces-
`sors), which means
`that a block
`is considered
`shared only
`if
`it
`is accessed by more
`than one process. Because
`the
`time
`sequence of the references
`in the trace
`is strictly maintained,
`the temporal
`ordering of various synchronization
`activities
`in
`the trace, such as getting or releasing a synchronization
`lock,
`is still
`retained. As a check on this model, we collected all our
`statistics based on both process sharing and processor sharing
`and
`found
`that
`the numbers were not significantly
`different.
`The similarity
`is due to the few instances of process migration
`in our
`traces.
`traces are
`for this study. The
`traces
`use three
`We currently
`operating
`under
`the MACH
`of parallel
`applications
`running
`system
`[19]. Table 3 d escribes
`the characteristics
`of the traces
`used for this study. POPS
`[20] is a parallel
`implementation
`of
`OPS5, which
`is a rule-based programming
`language.
`THOR
`is a parallel
`implementation
`of a logic simulator done by Larry
`Soule at Stanford University.
`PER0
`is a psrallel VLSI
`router
`written
`by Jonathan Rose at Stanford.
`All
`traces
`include
`operating
`system activity,
`which comprises
`roughly
`10% of
`the traces.
`reference
`read-tewrite
`The traces show a larger-than-usual
`The spins
`ratio due to spins on locks
`in POPS and THOR.
`synchro-
`correspond
`to the first
`test
`in a test-and-test-&-set
`nization
`primitive.
`These appear as reads of a data word.
`Roughly one-third
`of all the reads correspond
`to reads due to
`spinning on a lock. We will
`look at how the number of spins
`on a lock affect
`the performance
`of cache consistency
`schemes
`in Section 5.2. The ratio of reads
`to writes
`in PER0
`is also
`high, but
`this reference behavior
`is a result of the algorithm
`used in the program.
`
`bus operations.
`for fundamental
`Table 1: Timing
`Bus Cycles
`Bus Operatron
`1
`Send address
`
`Table 2: Summary of bus cycle costs.
`
`Jl
`
`wt or wup
`dir access
`
`1
`1
`
`2
`3
`
`II
`
`to send the address and 4 cycles to get the data). The
`cycle
`bus is not held during
`the access. Write-backs
`cost 4 cycles:
`the first cycle sends the address and the first data word;
`the
`remaining
`3 words are sent
`in the next
`three cycles. When
`the data
`is transferred
`to memory during a write-back,
`the
`requesting
`cache also receives
`it. The bus cycles used for data
`transfer are then counted under
`the write-back
`category.
`A
`write-through
`to memory or a write update
`to another cache
`is 1 cycle. A directory
`check uses 1 cycle
`to send the address,
`and invalidates
`are also 1 cycle.
`the bus has to be held
`In the non-pipelined
`bus model,
`during
`the memory or non-local
`cache access. Here a memory
`access costs 7 cycles, 1 cycle
`to send the address, 2 cycles
`to
`wait
`for the memory
`access, and 4 cycles
`to get
`the data.
`An access from another
`cache
`is 6 cycles, and
`takes a cycle
`less than
`the memory
`access because
`the cache access wait
`is only one cycle. Write-backs
`still cost 4 cycles;
`the waiting
`for memory
`is counted under
`the memory
`access category,
`and the bus need not be held while
`the write
`into me