IPR2015-01353, No. 1005 Exhibit - Lenoski The Directory Based Cache Coherence Protocol for the DASH Multiprocessor (P.T.A.B. Jun. 8, 2015)

The Directory-Based Cache Coherence Protocol
`for the DASH Multiprocessor
`
`Daniel Lenoski, James Laudon, Kourosh Gharachorloo,
`h o o p Gupta, and John Hennessy
`
`Computer Systems Laboratory
`Stanford University, CA 94305
`
`Abstract
`
`DASH is a scalable shared-memory multiprocessor currently
`being developed at Stanford's Computer Systems Laboratory.
`The architecture consists of powerful processing nodes, each
`with a portion of the shared-memory, connected to a scalable
`interconnection network. A key feature of DASH is its dis-
`tributed directory-based cache coherence protocol. Unlike tra-
`ditional snoopy coherence protocols, the DASH protocol does
`not rely on broadcast; instead it uses point-to-point messages
`sent between the processors and memories to keep caches con-
`sistent. Furthermore, the DASH system does not contain any
`single serialization or control point. While these features pro-
`vide the basis for scalability, they also force a reevaluation of
`many fundamental issues involved in the design of a proto-
`col. These include the issues of correctness, performance and
`protocol complexity. In this paper, we present the design of
`the DASH coherence protocol and discuss how it addresses the
`above issues. We also discuss our strategy for verifying the
`correctness of the protocol and briefly compare our protocol to
`the IEEE Scalable Coherent Interface protocol.
`
`1 Introduction
`
`The limitations of current uniprocessor speeds and the ability to
`replicate low cost, high-performance processors and VLSI com-
`ponents have provided the impetus for the design of multipro-
`cessors which are capable of scaling to a large number of pro-
`cessors. Two major paradigms for these multiprocessor archi-
`tectures have developed, message-passing and shared-memory.
`In a message-passing multiprocessor, each processor has a lo-
`cal memory, which is only accessible to that processor. Inter-
`processor communication occurs only through explicit message
`passing. In a shared-memory multiprocessor, all memory is ac-
`cessible to each processor. The shared-memory paradigm has
`the advantage that the programmer is not burdened with the
`issues of data partitioning, and accessibility of data from all
`processors simplifies the task of dynamic load distribution. The
`primary advantage of the message passing systems is the ease
`with which they scale to support a large number of proces-
`sors. For shared-memory machines providing such scalability
`has traditionally proved difficult to achieve.
`We are currently building a prototype of a scalable shared-
`memory multiprocessor. The system provides high processor
`performance and scalability though the use of coherent caches
`and a directory-based coherence protocol. The high-level or-
`
`CH2887-8/90/0000/0148$01 .OO 0 1990 IEEE
`
`148
`
`e
`e
`e
`
`Figure 1: General architecture of DASH.
`
`ganization of the prototype, called DASH (Directory Architec-
`ture for SHared memory) [17], is shown in Figure 1. The ar-
`chitecture consists of a number of processing nodes connected
`through a high-bandwidth low-latency interconnection network.
`The physical memory in the machine is distributed among the
`nodes of the multiprocessor, with all memory accessible to each
`node. Each processing node, or cluster, consists of a small
`number of high-performance processors with their individual
`caches, a portion of the shared-memory, a common cache for
`pending remote accesses, and a directory controller interfacing
`the cluster to the network. A bus-based snoopy scheme is used
`to keep caches coherent within a cluster, while inter-node cache
`consistency is maintained using a distributed directory-based
`coherence protocol.
`The concept of directory-based cache coherence was first pro-
`posed by Tang [20] and Censier and Feautxier [6]. Subsequently,
`it has been been investigated by others ([1],[2] and [23]). Build-
`ing on this earlier work, we have developed a new directory-
`based cache-coherence protocol which works with distributed
`directories and the hierarchical cluster configuration of DASH.
`The protocol also integrates support for efficient synchroniza-
`tion operations using the directory. Furthermore, in designing
`the machine we have addressed many of the issues left unre-
`solved by earlier work.
`In DASH, each processing node has a directory memory cor-
`responding to its portion of the shared physical memory. For
`each memory block, the directory memory stores the identities
`
`1
`
`APPLE 1005
`
`

`
`of all remote nodes caching that block. Using the directory
`memory, a node writing a location can send point-to-point in-
`validation or update messages to those processors that are ac-
`tually caching that block. This is in contrast to the invalidating
`broadcast required by the snoopy protocol. The scalability of
`DASH depends on this ability to avoid broadcasts. Another im-
`portant attribute of the directory-based protocol is that it does
`not depend on any specific interconnection network topology.
`As a result, we can readily use any of the low-latency scalable
`networks, such as meshes or hypercubes, that were originally
`developed for message-passing machines [7].
`While the design of bus-based snoopy coherence protocols
`is reasonably well understood, this is not true of distributed
`directory-based protocols. Unlike snoopy protocols, directory-
`based schemes do not have a single serialization point for all
`memory transactions. While this feature is responsible for their
`scalability, it also makes them more complex and forces one to
`rethink how the protocol should address the fundamental issues
`of correctness, system performance, and complexity.
`The next section outlines the important issues in designing
`a cache coherence protocol. Section 3 gives an overview of
`the DASH hardware architecture. Section 4 describes the de-
`sign of the DASH coherence protocol, relating it to the issues
`raised in section 2. Section 5 outlines some of the additional
`operations supported beyond the base protocol, while Section 6
`discusses scaling the directory structure. Section 7 briefly de-
`scribes our approach to verifying the correctness of the proto-
`col. Section 8 compares the DASH protocol with the proposed
`IEEE-SCI (Scalable Coherent Interface) protocol for distributed
`directory-based cache coherence. Finally, section 9 presents
`conclusions and summarizes the current status of the design
`effort.
`
`2 Design Issues for Distributed Coher-
`ence Protocols
`The issues that arise in the design of any cache coherence pro-
`tocol and, in particular, a distributed directory-based protocol,
`can be divided into three categories: those that deal with cor-
`rectness, those that deal with the performance, and those related
`to the distributed control of the protocol.
`
`2.1 Correctness
`The foremost issue that any multiprocessor cache coherence
`protocol must address is Correctness. This translates into re-
`quirements in three areas:
`Memory Consistency Model: For a uniprocessor, the model
`of a correct memory system is well defined. Load operations
`return the last value written to a given memory location. Like-
`wise, store operations bind the value returned by subsequent
`loads of the location until the next store. For multiprocessors,
`however, the issue is more complex because the definitions of
`“last value written”, “subsequent loads” and “next store” be-
`come less clear as there may be multiple processors reading and
`writing a location. To resolve this difficulty a number of mem-
`ory consistency models have been proposed in the literature,
`most notably, the sequential and weak consistency models [SI.
`Weaker consistency models attempt to loosen the constraints on
`the coherence protocol while still providing a reasonable pro-
`gramming model for the user. Although most existing systems
`
`utilize a relatively strong consistency model, the larger latencies
`found in a distributed system favor the less constrained models.
`Deadlock: A protocol must also be deadlock free. Given
`the arbitrary communication pattems and finite buffering within
`the memory system there are numerous opportunities for dead-
`lock. For example, a deadlock can occur if a set of transactions
`holds network and buffer resources in a circular manner, and
`the consumption of one request requires the generation of an-
`other request. Similarly, lack of flow control in nodes can cause
`requests to back up into the network, blocking the flow of other
`messages that may be able to release the congestion.
`Error Handling: Another issue related to correctness is sup-
`port for data integrity and fault tolerance. Any large system will
`exhibit failures, and it is generally unacceptable if these fail-
`ures result in corrupted data or incorrect results without a fail-
`ure indication. This is especially true for parallel applications
`where algorithms are more complex and may contain some non-
`determinism which limits repeatability. Unfortunately, support
`for data integrity and fault-tolerance within a complex protocol
`that attempts to minimize latency and is executed directly by
`hardware is difficult. The protocol must attempt to balance the
`level of data integrity with the increase in latency and hard-
`ware complexity. At a minimum, the protocol should be able to
`flag a l l detectable failures, and convey this information to the
`processors affected.
`
`2.2 Performance
`Given a protocol that is correct, performance becomes the next
`important design criterion. The two key metrics of memory
`system performance are latency and bandwidth.
`Latency: Performance is primarily determined by the la-
`tency experienced by memory requests. In DASH, support for
`cachable shared data provides the major reduction in latency.
`The latency of write misses is reduced by using write buffers
`and by the support of the release consistency model. Hiding
`the latency for read misses is usually more critical since the
`processor is stalled until data is retumed. To reduce the la-
`tency for read misses, the protocol must minimize the number
`of intercluster messages needed to service a miss and the delay
`associated with each such message.
`Bandwidth: Providing high memory bandwidth that scales
`with the number of processors is key to any large system.
`Caches and distributed memory form the basis for a scal-
`able, high-bandwidth memory system in DASH. Even with dis-
`tributed memory, however, bandwidth is limited by the serial-
`ization of requests in the memory system and the amount of
`traffic generated by each memory request.
`Servicing a memory request in a distributed system often
`requires several messages to be transmitted. For example, a
`message to access a remote location generates a reply message
`containing the data, and possibly other messages invalidating
`remote caches. The component with the largest serialization in
`this chain limits the maximum throughput of requests. Serial-
`ization affects performance by increasing the queuing delays,
`and thus the latency, of memory requests. Queueing delays can
`become critical for locations that exhibit a large degree of shar-
`ing. A protocol should attempt to minimize the service time
`at all queuing centers. In particular, in a distributed system no
`central resources within a node should be blocked while inter-
`node communication is taking place to service a request. In this
`way serialization is limited only by the time of local, intra-node
`operations.
`
`149
`
`2
`
`

`
`The amount of traffic generated per request also limits the
`effective throughput of the memory system. Traffic seen by
`the global interconnect and memory subsystem increases the
`queueing for these shared resources. DASH reduces traffic by
`providing coherent caches and by distributing memory among
`the processors. Caches filter many of the requests for shared
`data while grouping memory with processors removes private
`references if the corresponding memory is allocated within the
`local cluster. At the protocol level, the number of messages
`required to service different types of memory requests should
`be minimized, unless the extra messages directly contribute to
`reduced latency or serialization.
`
`2.3 Distributed Control and Complexity
`A coherence protocol designed to address the above issues must
`be partitioned among the distributed components of the multi-
`processor. These components include the processors and their
`caches, the directory and main memory controllers, and the in-
`terconnection network. The lack of a single serialization point,
`such as a bus, complicates the control since transactions do not
`complete atomically. Furthermore, multiple paths within the
`memory system and lack of a single arbitration point within the
`system allow some operations to complete out of order. The re-
`sult is that there is a rich set of interactions that can take place
`between different memory and coherence transactions. Parti-
`tioning the control of the protocol requires a delicate balance
`between the performance of the system and the complexity of
`the components. Too much complexity may effect the ability
`to implement the protocol or ensure that the protocol is correct.
`
`3 Overview of DASH
`Figure 2 shows a high-level picture of the DASH prototype we
`are building at Stanford. In order to manage the size of the
`prototype design effort, a commercial bus-based multiprocessor
`was chosen as the processing node. Each node (or cluster) is
`a Silicon Graphics POWER Station 4D/240 [4]. The 4D/240
`system consists of four high-performance processors, each con-
`nected to a 64 Kbyte first-level instruction cache, and a 64 Kbyte
`write-through data cache. The 64 Kbyte data cache interfaces
`to a 256 Kbyte second-level write-back cache through a read
`buffer and a 4 word deep write-buffer. The main purpose of this
`second-level cache is to convert the write-through policy of the
`first-level to a write-back policy, and to provide the extra cache
`tags for bus snooping. Both the first and second-level caches
`are direct-mapped.
`In the 4D/240, the second-level caches are responsible for bus
`snooping and maintaining consistency among the caches in the
`cluster. Consistency is maintained using the Illinois coherence
`protocol [19], which is an invalidation-based ownership proto-
`col. Before a processor can write to a cache line, it must first
`acquire exclusive ownership of that line by requesting that all
`other caches invalidate their copy of that line. Once a processor
`has exclusive ownership of a cache line, it may write to that
`line without consuming further bus cycles.
`The memory bus (MPBUS) of the 4D/240 is a pipelined syn-
`chronous bus, supporting memory-to-cache and cache-to-cache
`transfers of 16 bytes every 4 bus clocks with a latency of 6 bus
`clocks. While the MPBUS is pipelined, it is not a split transac-
`tion bus. Consequently, it is not possible to efficiently interleave
`long duration remote transactions with the short duration local
`
`1
`
`Figure 2: Block diagram of sample 2 x 2 DASH system.
`
`transactions. Since this ability is critical to DASH, we have
`extended the MPBUS protocol to support a retry mechanism.
`Remote Kequests are signaled to retry while the inter-cluster
`messages are being processed. To avoid unnecessary retries the
`processor is masked from arbitration until the response from the
`remote request has been received. When the response arrives,
`the requesting processor is unmasked, retries the request on the
`bus, and is supplied the remote data.
`A DASH system consists of a number of modified 4D/240
`systems that have been supplemented with a directory controller
`board. This directory controller board is responsible for main-
`taining the cache coherence across the nodes and serving as the
`interface to the interconnection network.
`The directory board is implemented on a single printed cir-
`cuit board and consists of five major subsystems as shown in
`Figure 3. The directory controller @C) contains the directory
`memory corresponding to the portion of main memory present
`within the cluster. It also initiates out-bound network requests
`and replies. The pseudo-CPU (PCPU) is responsible for buffer-
`ing incoming requests and issuing such requests on the cluster
`bus. It mimics a CPU on this bus on behalf of remote processors
`except that responses from the bus are sent out by the directory
`controller. The reply controller (RC) tracks outstanding requests
`made by the local processors and receives and buffers the corre-
`sponding replies from remote clusters. It acts as memory when
`the local processors are allowed to retry their remote requests.
`The network interface and the local portion of the network it-
`self reside on the directory card. The interconnection network
`consists of a pair of meshes. One mesh is dedicated to the re-
`quest messages while the other handles replies. These meshes
`utilize wormhole routing [9] to minimize latency. Finally, the
`board contains hardware monitoring logic and miscellaneous
`control and status registers. The monitoring logic samples a
`variety of directory board and bus events from which usage and
`performance statistics can be derived.
`The directory memory is organized as an array of directory
`
`150
`
`3
`
`

`
`Rcply Nctworll
`
`Rcquesl Network
`
`F-i&
`
`t
`
`t
`
`- Per p-saor
`- RAC snoopr on bus
`
`invdid-
`
`* Forward rm” CPU
`q u e s t to local MPBUS
`* lssuc cache linc invalid-
`ations and lock p t s
`
`.
`
`MPBUS Addrcss /
`Cnn1rol
`
`I
`
`*
`
`entries. There is one entry for each memory block. The direc-
`tory entries used in the prototype are identical to that originally
`proposed in [6]. They are composed of a single state bit to-
`gether with a bit vector of pointers to clusters. The state bit
`indicates whether the clusters have a read (shared) or readwrite
`(dirty) copy of the data. The bit vector contains a bit for each
`of the sixteen clusters supported in the prototype. Associating
`the directory with main memory allows the directory to be built
`with the same DRAM technology as main memory. The DC ac-
`cesses the directory memory on each MPBUS transaction along
`with the access to main memory. The directory information is
`combined with the type of bus operation, address, and result
`of the snooping within the cluster to determine what network
`messages and bus controls the DC will generate.
`The RC maintains its state in the remote access cache (RAC).
`The functions of the RAC include maintaining the state of cur-
`rently outstanding requests, buffering replies from the network
`and supplementing the functionality of the processors’ caches.
`The RAC is organized as a snoopy cache with augmented state
`information. The RAC’s state machines allow accesses from
`both the network and the cluster bus. Replies from the network
`are buffered in the RAC and cause the waiting processor to be
`released for bus arbitration. When the released processor re-
`tries the access the RAC supplies the data via a cache-to-cache
`transfer.
`
`3.1 Memory Consistency in DASH
`As stated in Section 2, the correctness of the coherence protocol
`is a function of the memory consistency model adopted by the
`architecture. There is a whole spectrum of choices for the level
`of consistency to support directly in hardware. At one end is the
`sequential consistency model [16] which requires the execution
`of the parallel program to appear as some interleaving of the
`execution of the parallel processes on a sequential machine. As
`one moves towards weaker models of consistency, performance
`
`
`
`gains are made at the cost of a more complex programming
`model for the user.
`The base model of consistency provided by the DASH hard-
`ware is called release consistency. Release consistency [lo] is
`an extension of the weak consistency model first proposed by
`Dubois, Scheurich and Briggs [8]. The distinguishing character-
`istics of release consistency is that it allows memory operations
`issued by a given processor to be observed and complete out
`of order with respect to the other processors. The ordering of
`operations is only preserved before “releasing” synchronization
`operations or explicit ordering operations. Release consistency
`takes advantage of the fact that while in a critical region a pro-
`grammer has already assured that no other processor is accessing
`the protected variables. Thus, updates to these variables can be
`observed by other processors in arbitrary order. Only before
`the lock release at the end of the region does the hardware need
`to guarantee that all operations have completed. While release
`consistency does complicate programming and the coherence
`protocol, it can hide much of the overhead of write operations.
`Support for release consistency puts several requirements on
`the system. First, the hardware must support a primitive which
`guarantees the ordering of memory operations at specific points
`in a program. Such fence [5, 101 primitives can then be placed
`by software before releasing synchronization points in order to
`implement release consistency. DASH supports two explicit
`fence mechanisms. A jidl-fence operation stalls the proces-
`sor until all of its pending operations have been completed,
`while a write-fence simply delays subsequent write-operations.
`A higher performance implementation of release consistency
`includes implicit fence operations within the releasing synchro-
`nization operations themselves. DASH supports such synchro-
`nization operations yielding release consistency as its base con-
`sistency model. The explicit fence operations in DASH then
`allow the user or compiler to synthesize stricter consistency
`models if needed.
`The release consistency model also places constraints on the
`base coherence protocol. First, the system must respect the local
`dependencies generated by the memory operations of a single
`processor. Second, all coherence operations, especially opera-
`tions related to writes, must be acknowledged so that the issuing
`processor can determine when a fence can proceed. Third, any
`cache line owned with pending invalidations against it can not
`be shared between processors. This prevents the new processor
`from improperly passing a fence. If sharing is allowed then
`the receiving processor must be informed when all of the pend-
`ing invalidates have been acknowledged. Lastly, any operations
`that a processor issues after a fence operation may not become
`visible to any other processor until all operations preceding the
`fence have completed.
`
`4 The DASH Cache Coherence Protocol
`In our discussion of the coherence protocol, we use the follow-
`ing naming conventions for the various clusters and memories
`involved in any given transaction. A local cluster is a cluster
`that contains the processor originating a given request, while
`the home cluster is the cluster which contains the main memory
`and directory for a given physical memory address. A remote
`cluster is any other cluster. Likewise, local memory refers to
`the main memory associated with the local cluster while remote
`memory is any memory whose home is not the local.
`The DASH coherence protocol is an invalidation-based own-
`
`151
`
`4
`
`

`
`ership protocol. A memory block can be in one of three states as
`indicated by the associated directory entry: (i) uncnched-remote,
`that is not cached by any remote cluster; (ii) shared-remote, that
`is cached in an unmodified state by one or more remote clus-
`ters; or (iii) dirty-remote, that is cached in a modified state by
`a single remote cluster. The directory does not maintain infor-
`mation concerning whether the home cluster itself is caching
`a memory block because all transactions that change the state
`of a memory block are issued on the bus of the home cluster,
`and the snoopy bus protocol keeps the home cluster coherent.
`While we could have chosen not to issue all transactions on the
`home cluster’s bus this would had an insignificant performance
`improvement since most requests to the home also require an
`access to main memory to retrieve the actual data.
`The protocol maintains the notion of an owning cluster for
`each memory block. The owning cluster is nominally the home
`cluster. However, in the case that a memory block is present
`in the dirty state in a remote cluster, that cluster is the owner.
`Only the owning cluster can complete a remote reference for a
`given block and update the directory state. While the directory
`entry is always maintained in the home cluster, a dirty cluster
`initiates all changes to the directory state of a block when it
`is the owner (such update messages also indicate that the dirty
`cluster is giving up ownership). The order that operations reach
`the owning cluster determines their global order.
`As with memory blocks, a cache block in a processor’s cache
`may also be in one of three states: invalid, shared, and dirty.
`The shared state implies that there may be other processors
`caching that location. The dirty state implies that this cache
`contains an exclusive copy of the memory block, and the block
`has been modified.
`The following sections outline the three primitive operations
`supported by the base DASH coherence protocol: read, read-
`exclusive and write-back. We also discuss how the protocol
`responds to the issues that were brought up in Section 2 and
`some of the altemative design choices that were considered. We
`describe only the normal flow for the memory transactions in the
`following sections, exception cases are covered in section 4.6.
`
`4.1 Read Requests
`Memory read requests are initiated by processor load instruc-
`tions. If the location is present in the processor’s fist-level
`cache, the cache simply supplies the data. If not present, then a
`cache fdl operation must bring the required block into the first-
`level cache. A fill operation first attempts to find the cache line
`in the processor’s second-level cache, and if unsuccessful, the
`processor issues a read request on the bus. This read request ei-
`ther completes locally or is signaled to retry while the directory
`board interacts with the other clusters to retrieve the required
`cache line. The detailed flow for a read request is given in
`Figure 7 in the appendix.
`The protocol tries to minimize latency by using cache-to-
`cache transfers. The local bus can satisfy a remote read if the
`given line is held in another processor’s cache or the remote
`access cache (RAC). The four processor caches together with
`the RAC form a five-way set associative (1.25 Mbyte) cluster
`cache. The effective size of this cache is smaller than a true set
`associative cache because the entries in the caches need not be
`distinct. The check for a local copy is initiated by the normal
`snooping when the read is issued on the bus. If the cache line
`is present in the shared state then the data is simply transferred
`over the bus to the requesting processor and no access to the
`
`remote home cluster is needed. If the cache line is held in a
`dirty state by a local processor, however, something must be
`done with the ownership of the cache line since the processor
`supplying the data goes to a shared state in the Illinois protocol
`used on the cluster bus. The two options considered were to: (i)
`have the directory do a sharing write-back to the home cluster:
`and (ii) have the RAC take ownership of the cache line. We
`chose the second option because it permits the processors within
`a cluster to read and write a shared location without causing
`traffic in the network or home cluster.
`If a read request cannot be satisfied by the local cluster, the
`processor is forced to retry the bus operation, and a request
`message is sent to the home cluster. At the same time the
`processor is masked from arbitration so that it does not tie up the
`local bus. Whenever a remote request is sent by a cluster, a RAC
`entry is allocated to act as a placeholder for the reply to this
`request. The RAC entry also permits merging of requests made
`by the different processors within the same cluster. If another
`request to the same memory block is made, a new request will
`not be sent to the home cluster; this reduces both traffic and
`latency. On the other hand, an access to a different memory
`block, which happens to map to a RAC entry already in use,
`must be delayed until the pending operation is complete. Given
`that the number of active RAC entries is small the benefit of
`merging should outweigh the potential for contention.
`When the read request reaches the home cluster, it is issued
`on that cluster’s bus. This causes the directory to look up the
`status of that memory block. If the block is in an uncached-
`remote or shared-remote state the directory controller sends the
`data over the reply network to the requesting cluster. It also
`records the fact that the requesting cluster now has a copy of
`the memory block. If the block is in the dirty-remote state,
`however, the read request is forwarded to the owning, dirty
`cluster. The owning cluster sends out two messages in response
`to the read. A message containing the data is sent directly to the
`requesting cluster, and a sharing writeback request is sent to the
`home cluster. The sharing writeback request writes the cache
`block back to memory and also updates the directory. The flow
`of messages for this case is shown in Figure 4.
`As shown in Figure 4, any request not satisfied in the home
`cluster is forwarded to the remote cluster that has a dirty copy
`of the data. This reduces latency by permitting the dirty clus-
`ter to respond directly to the requesting cluster. In addition,
`this forwarding strategy allows the directory controller to si-
`multaneously process many requests (i.e. to be multithreaded)
`without the added complexity of maintaining the state of out-
`standing requests. Serialization is reduced to the time of a sin-
`gle intra-cluster bus transaction. The only resource held while
`inter-cluster messages are being sent is a single entry in the
`originating cluster’s RAC.
`The downside of the forwarding strategy is that it can result
`in additional latency when simultaneous accesses are made to
`the same block. For example, if two read requests from differ-
`ent clusters are received close together for a line that is dirty
`remote, both will be forwarded to the dirty cluster. However,
`only the first one will be satisfied since this request will force
`the dirty cluster to lose ownership by doing a sharing writeback
`and changing its local state to read only. The second request
`will not find the dirty data and will be returned with a nega-
`tive acknowledge (NAK) to its originating cluster. This NAK
`will force the cluster to retry its access. An altemative to the
`forwarding approach used by our protocol would have been to
`buffer the read request at the home cluster, have the home send
`
`152
`
`5
`
`

`
`repeala request an
`rCes data m cache
`
`a
`
`ReadReply 3a
`sent to local
`
`Read Requesl sent
`
`to home ‘q
`
`Yome ClmIer
`a FCPUreads home bus.
`
`diliy cluster.
`
`ead-Exclusive Requcst
`
`Invalidation Requests se
`lo clusrers having read
`
`Each shared clurt
`sends Invalidatwn
`to local clustex
`
`Figure 4 Flow of Read Request to remote memory with direc-
`tory in dirty-remote state.
`
`Figure 5: Flow of Read-Exclusive Request to remote memory
`with directory in shared-remote state.
`
`a flush request to the owning cluster, and then have the home
`send the data back to the originating cluster. We did not adopt
`this approach because it would have increased the latency for
`such reads by adding an extra network and bus transaction. Ad-
`ditionally, it would have required buffers in the directory to hold
`the pending transaction, or blocking subsequent accesses to the
`directory until the first request had been satisfied.
`
`4.2 Read-Exclusive Requests
`Write operations are initiated by processor store instructions.
`Data is written through the first-level cache and is buffered in a
`four word deep write-buffer. The second-level cache can retire
`the write if it has ownership of the line. Otherwise, a read-
`exclusive request is issued to the bus to acquire sole ownership
`of the line and retrieve the other words in the cache block. Ob-
`taining ownership does not block the processor directly; only
`the write-buffer output is stalled. As in the case of read requests,
`cache coherence operations begin when the read-exclusive re-
`quest is issued on the bus. The detailed flow of read-exclusive
`request i s given in the appendix in Figure 9 and is summarized
`below.
`The flow of a read-exclusive is similar to that of a

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases