`With Asynchronous Caches
`
`Fong Pong, Michel Dubois*, Ken Lee
`Computer Systems and Technology Laboratory
`HP Laboratories Palo Alto
`HPL-1999-149
`November, 1999
`
`E-mail: fpong@hpl.hp.com
` kenlee@exch.hpl.hp.com
` dubois@paris.usc.edu
`
`
`
`Asynchronous,
`We propose and evaluate a cache-coherent symmetric multiprocessor system
`(SMP) based on asynchronous caches. In a system with asynchronous
`cache coherence,
`caches, processors and memory controllers may observe the same coherence
`shared-memory
`request at different points in time. All protocol transactions are uni-
`multiprocessor,
`directional and processors do not report snoop results. The need for an
`SMP
`extensive
`interlocking protocol between processor nodes and memory
`controller which is characteristic of snooping buses is thus removed.
`
`
`
`This design overcomes some of the scalability problem of a multi-drop
`shared-bus by employing high-speed point-to-point links, whose scalability
`prospects are much better than for shared buses. Memory and processors
`communicate through a set of queues. This highly pipelined memory system
`design is a better match to emerging ILP processors than bus-based
`snooping. Simulation results for ILP processors show that the shared-bus
`design is limited by its bandwidth. By contrast the parallel link design has
`ample bandwidth and yields large performance gain for the transaction
`processing and scientific benchmarks that we have considered.
`
`the asynchronous design we propose
`Besides higher performance
`considerably simplifies the behavior expected from the hardware. This is
`important because snooping bus protocols are so complex today that their
`verification has become a major challenge.
`
`* Department of Electrical Engineering-Systems. University of Southern California Los Angeles,
`California
` Copyright Hewlett-Packard Company 1999
`
`(cid:211)
`
`
`1 Introduction
`
`Because of their simplicity, shared-bus SMPs are a dominant architecture for small-scale systems.
`
`By taking advantage of the broadcast nature of a shared-bus, snooping protocols are the de facto
`
`schemes for achieving cache coherence. Figure 1 shows the basic configuration to support a four-
`
`states MESI protocol [22] which is widely used in cache-coherent shared-memory multiprocessor
`
`systems. In such a system, every processor is associated with a bus watcher (a “snooper”) which
`
`monitors all bus activities. When a processor initiates a coherence transaction such as a load miss
`
`on the bus, all snoopers in all processors latch in the request. These snoop requests consult the
`
`local caches, take necessary actions and respond with appropriate snooping results. Each protocol
`
`transaction on the bus is deemed complete when all caches have reported their snoop result.
`
`Although the snooping bus-based design is a classic, well-understood design and offers many
`
`good features, it is becoming harder to design a shared bus-based SMP that keeps pace with
`
`emerging ILP processor technology.
`
`First and foremost, the multi-drop bus architecture is reaching its speed limit. When the clocking
`
`speed was low, the electrical length of the bus was short enough that distributed behavior of the
`
`bus could be ignored. However, as bus speeds increase, the processor boards connected to the bus
`
`behave as stubs resulting in reflections and ringing of bus signals. There exist several schemes for
`
`terminations and signaling to reduced reflections, but none solves the fundamental problem of
`
`stubs. Because of design constraints such as heat dissipation the space needed between stubs is
`
`longer at high speeds. This limits the operating speed of buses to 150MHz in current systems.
`
`P0
`
`C0
`
`P1
`
`C1
`
`P2
`
`C2
`
`P3
`
`C3
`
`bus-watcher
`
`Control+Address
`
`Shared
`
`Dirty
`
`Figure 1 A Bus-based Cache System with Shared/Dirty Signals.
`
`For future processor designs with deep speculation, multiple cores, and/or multithreading [13, 14,
`
`2
`
`
`
`23], the shared-bus will no doubt become a major bottleneck even in small multiprocessor config-
`
`urations. An alternative is to have multiple short bus segments bridged by the memory controller
`
`[15, 18]. This approach has its own limitations. It is difficult to connect more than two full busses
`
`to the memory controller without an inordinate number of I/O pins. Furthermore, all bus transac-
`
`tions that occurs in one bus segment must be propagated to other bus segments unless the memory
`
`controller is equipped with coherence filters. A coherence filter essentially keeps track of memory
`
`blocks that are cached in the bus segment. Regardless of these possible extensions, the fundamen-
`
`tal problem of a shared-bus design remains. For instance, every request must start with an arbitra-
`
`tion cycle and spend one cycle for bus turnaround at the end. The protocol sets an upper bound on
`
`the maximum attainable throughput.
`
`Secondly, the bus snooping scheme requires all processors (and all snooping agents such as I/O
`
`bridge chips with I/O caches) to synchronize their responses. Generally speaking, the snoop results
`
`serve three purposes: 1) they indicate the safe completion of the snoop request in the cache hierar-
`
`chy of the local processor, 2) they provide sharing information, and 3) they identify which entity
`
`should respond with the missing data block, i.e, either another processor or memory. For the pur-
`
`pose of illustration (see Figure 1), assume that the snooping results are propagated to all proces-
`
`sors via two bus lines Shared and Dirty. For a load miss, the processor may load the data block into
`
`the exclusive (E) or shared (S) state depending on whether the Shared or Dirty signal is asserted. In
`
`the case where a processor has the most recently modified copy of the requested data, it asserts the
`
`Dirty signal preventing the memory from responding with the data.
`
`D$
`
`I$
`
`L2$
`
`write-back$
`
`prefetching$
`
`L3$
`
`Concentration
`
`snoop
`
`result
`
`Figure 2 Illustration of Snooping Paths in Modern Multiprocessors.
`
`The common approach is to require that all caches connected to the bus generate their snoop
`
`3
`
`
`
`results in exactly the same bus cycle. This requirement imposes a fixed latency time constraint
`
`between receiving each bus request and producing the snoop result and this fixed snooping delay
`
`must accommodate to the worst possible case. This constraint presents a number of challenges for
`
`the designers of highly-integrated processors with cache hierarchies. As illustrated in Figure 2,
`
`many modules must be snooped and the final results must be concentrated. In order to meet the
`
`fixed latency constraint, the processors may require ultra-fast snooping logic paths. The processor
`
`may have to adopt a priority scheme assigning a higher priority to snoop requests than to requests
`
`from the processor’s execution unit.
`
`More relaxed designs such as Intel’s P6 bus [16] allow processors and memory controllers to insert
`
`wait states when they are slow to respond. This scheme complicates the logic design because
`
`every processor must closely monitor the activities of other processors on the bus in order to re-
`
`generate snooping results when wait states are observed. In the case of Intel’s bus, for instance, the
`
`processor must repeat its snooping result two cycles after observing a wait state cycle.
`
`Yet another approach to synchronizing snoop results is the use of a shared wired-or Inhibit signal
`
`on the bus as was implemented in the SGI Challenge [12]. Processors may snoop at different
`
`speeds but must report the end of their snoop cycle on this new bus line. The transaction remains
`
`“open” for as long as any processor has not pulled down its Inhibit line. Again this interlocking of
`
`processor and memory signals on the bus results in complex bus protocols and bookkeeping of
`
`pending transactions in the interface of each processor, which in fact limits the number of concur-
`
`rent transactions on the bus. The design still relies on a shared line and on bus watching and is
`
`complex to verify. This complexity increases with the number of concurrent transaction tolerated
`
`on the bus. Current designs are so complex that verification has become a major development cost.
`
`In this paper, we propose an efficient cache-coherent shared-memory multiprocessor system based
`
`on an asynchronous-snooping scheme. In this paper asynchronous refers to a model of cache-
`
`coherent systems first theoretically introduced and proven correct in [2]. It does not refer to vari-
`
`able-delay snooping using an inhibit line as described above. In fact it does not require reporting
`
`snoop results or synchronizing the snoops in any way. Because of this simplification, fast, high-
`
`bandwidth point-to-point links can be used to communicate requests and data between processors
`
`and memory. Snooping requests to different processors are propagated independently through
`
`queues. The number of pending and concurrent protocol transactions is only limited by the size of
`
`4
`
`
`
`these FIFO queues. However by emulating a shared-bus broadcast protocol, the topology of the
`
`point-to-point interconnection is made transparent to the processors and the memory.
`
`The various design aspects of the new system are given in Sections 2 and 3. Section 4 is dedicated
`
`to the evaluation methodology and system simulation models. We compare the effectiveness of
`
`various bus-based configurations with our parallel link design. These results are presented and dis-
`
`cussed in Section 5 and our final comments conclude the paper in Section 6.
`
`2 A New Asynchronous Cache System Design
`
`2.1 The Architectural Organization
`
`We advocate an asynchronous cache system such as the one shown in Figure 3. In this design, pro-
`
`cessors and memory controller communicate via unidirectional high-speed links [10, 25]. A set of
`
`queues buffer requests and data blocks in case of access conflicts and also serve as adaptor
`
`between data paths of different width.
`
`processor 0
`
`processor 1
`
`Scheduling Window/
`Memory Disambiguation
`
`D$
`
`I$
`
`D$
`
`I$
`
`L2$
`
`L2$
`
`p2
`
`p3
`
`parallel links
`
`Data
`
`Request
`
`Incoming Buffer
`
`address bus
`
`data bus
`
`memory controller
`
`MemQ
`
`DataQ
`
`ReqQ
`
`SnoopQ
`
`blocks
`
`{
`
`state
`
`data
`
`state
`
`data
`
`bank2
`
`bank3
`
`state
`data
`bank0
`
`state
`data
`bank1
`
`Figure 3 The Proposed Asynchronous Cache System.
`
`On the memory controller side, received requests are stored in a request queue. Through a high-
`
`5
`
`
`
`speed address bus internal to the memory controller, requests are routed to the snoop queues of all
`
`processors. Emerging from the snoop queue, request packets are transmitted through the links to
`
`the incoming buffers of processors. When they reach their destination, packets may snoop various
`
`caches in the node and a data response may be generated as a point-to-point packet to the request-
`
`ing processor through a data bus internal to the memory controller chip.
`
`Note that the on-chip address and data busses may be both implemented by a multiplexer to data
`
`lines feeding into all processors incoming queues. Speed of this path can be made very fast,
`
`depending on the distance and the RC delay of the process technology.
`
`2.2 Memory Disambiguation and Access Ordering
`
`All the queues in this design are FIFOs (First-In-First-Out). In term of ordering of memory
`
`accesses, the behavior of this system has been shown to be indistinguishable from a platform with
`
`a shared bus [2, 19]. The address bus internal to the memory controller serves as a global ordering
`
`point for all memory accesses.
`
`For correctness, there are certain rules the system must support:
`
`1. The processor keeps track of all pending instructions in the scheduling window or the memory
`
`disambiguation buffer. Instructions are retired in program order.
`
`2. Loads that are executed, but have yet to retire from the memory disambiguation buffer are sub-
`
`ject to snooping requests. A write-invalidation arriving before retiring the load will cause the
`
`load to be re-issued and all instructions depending on the load to be re-played.
`
`3. Values of stores are written to the cache/memory system only when they are retired from the
`
`scheduling window, although the fetch of an exclusive cache line can proceed speculatively.
`
`Specifically, when a store operation is inserted into the scheduling window, a speculative access
`
`may be issued to the cache system to obtaining a local writable copy of the block. However, the
`
`store is actually executed only when it emerges at the head of the scheduling window.
`
`4. When the memory controller receives a request, the request is routed via the internal address
`
`bus to the input snoop queues of all processors, including the processor that made the
`
`request. At this point, a global ordering point has been established so that all processors
`
`observe the same sequence of events.
`
`5. Before processor pi can retire a request from its incoming snoop buffer, all requests received
`
`6
`
`
`
`before it in the buffer must have been completed.
`
`It is really not difficult to comprehend that a total order on all stores can be constructed in this sys-
`
`tem and that all the stores can only be observed in that order by any processor. Thus the system is
`
`sequentially consistent, the strongest form of memory consistency.
`
`2.3 The Cache Protocol
`
`Because we have no Shared, Dirty or Inhibit bus lines, we need to adapt parts of the protocol that
`
`rely on such lines. First, we cannot implement the E (Exclusive) state of the MESI protocol. This is
`
`easily done by using a simpler, MSI protocol. Second we need to make sure that one and only one
`
`cache or memory responds with the data on a cache miss. A classical solution to this problem is to
`
`maintain one Valid bit per block in memory. A memory block can be in one of two states: memory-
`
`invalid (MI) and memory-valid (MV). If a memory block is in the MI state, one of the processors
`
`must have a modified copy and supply the copy on a miss. If a memory block is in the MV state,
`
`some processor caches may have a shared block that must be consistent with the memory copy and
`
`the memory delivers the copy. Because of this valid bit processors or memory controller do not
`
`need to report who provides a data copy. Third the inhibit line or the fixed latency snoop delay
`
`become useless because the protocol is now asynchronous and does not rely on the broadcasting
`
`snoop results.
`
`The protocol uses only MSI states, which is a derivation from the MESI protocol. By definition, we
`
`have:
`
`1. Modified (M). When a cache is in the modified state, it has the most recent copy. This cache is
`
`also responsible for responding to a subsequent miss. The memory state must be invalid (MI).
`
`2. Shared (S). The block may be shared among processors. The memory copy is valid (MV) and
`
`the memory is to respond to a subsequent miss.
`
`3. Invalid (I). The cache does not have the memory block.
`
`The protocol works as follows:
`
`7
`
`
`
`1. Read miss. If another cache has a modified copy, it responds to the requesting cache with a data
`
`copy, and updates the main memory as well. Otherwise, the memory must have a valid copy
`
`which is directly supplied to the requesting cache. At the end, all cached copies are in the
`
`shared state, and the memory is in the memory-valid state.
`
`2. Write hit. If the block is in the modified state, the write can proceed without delay. Otherwise,
`
`the cache block is always re-loaded, as in a write miss1.
`
`3. Write miss. A write miss request is sent to the memory and to all other processors. If the mem-
`
`ory has a valid copy, the memory will provide its copy to the requesting processor. Otherwise,
`
`the current owner will respond to the request. At the end, the requesting processor will load the
`
`block in the modified state. All other processors with a valid block set their state to invalid. The
`
`memory state is set to memory-invalid so that the memory will ignore subsequent misses.
`
`2.4 Self-Regulation and Credit-Based Flow Control
`
`In the asynchronous cache system, the number of pending protocol transactions is only limited by
`
`the finite sizes of the queues. To avoid overflow of the request and data buffers a simple credit-
`
`based flow control scheme will do. Each processor limits the number of its outstanding requests to
`
`n. As shown in Figure 3, each processor has its own dedicated buffer at the memory controller and
`
`thus each buffer must hold n entries at the memory controller. Initially, the processor has n credits.
`
`The processor decreases its credits by one when a new request is sent and increases its credits by
`
`one when a reply is propagated back from the memory controller.
`
`The memory controller may also inform the processor on the amount of available buffer space in
`
`its incoming request and data buffers. For instance, when the memory controller propagates a
`snoop request to processor pi, it may piggyback the number of available entries for pi’s out-bound
`data buffers with the request.
`
`In the asynchronous design, flow control is achieved by self-regulation without overhead. By con-
`
`trast, in the pipelined shared-bus design, processors and memory controller must arbitrate for the
`
`memory bus when they want to send requests. For data transfer, the sender must listen to the ready
`
`1. This is done to simplify the protocol. However, the protocol would be more efficient and consume less interconnect
`bandwidth if the processor issued an Upgrade request instead [9]
`
`8
`
`
`
`signal from the receiver, and wait until the data bus is freed by a prior transfer.
`
`To illustrate the complexity introduced by the shared-bus interface design consider the case of
`
`back-to-back requests issued to the bus for the same memory block shown in Figure 4.
`
`Cycle 1
`
`Cycle 2
`
`Cycle 3
`
`Cycle 4
`
`Cycle 5
`
`snoop&latch
`
`addr1
`
`lookup
`
`compute
`
`snoop&latch
`
`addr2=addr1
`
`install new
`state
`
`lookup
`
`snoop_result
`
`Figure 4 Race Condition for Back-to-Back Requests in a Shared-bus System.
`
`In this example, every processor receives two back-to-back requests to the same memory block, in
`
`cycle 1 and in cycle 3. For the first request, the processor looks up its cache in cycle 2, computes a
`
`new state in cycle 3, and then installs the new state in cycle 4. Unfortunately, the snooping result
`
`of the second request depends on the new state computed after the first request, which will be only
`
`available after cycle 4. The situation is even more complex from the point of view of the processor
`
`issuing the first request. This processor may receive the second request before it even knows if its
`
`own request will complete, be deferred or be retried [16]. This is why concurrent requests to the
`
`same block are often disallowed in split-transaction bus systems. In the proposed asynchronous
`
`design, all these problems are solved without undue complexity by respecting the FIFO order of
`
`all requests and responses throughout the memory system. The asynchronous protocol does not
`
`check for nor constraint the number of concurrent transactions to the same block.
`
`2.5 Trade-offs Made in the Proposed Design
`
`2.5.1 Elimination of the Exclusive State
`
`Traditionally, the exclusive (E) state has been advocated to reduce the consumption of bus band-
`
`width when a load miss is followed by a store hit to the same cache line. This is because store
`
`accesses to cache lines in the E state are resolved locally in the processor without incurring bus
`
`accesses.
`
`9
`
`
`
`Since processors do not report sharing status in our proposed design, the exclusive state cannot be
`
`supported, at least in a simple way. The minimum cost of not having the exclusive state, is the
`
`additional bus bandwidth required to propagate an Upgrade request (one control packet) on the bus
`
`on the first write following a read miss. Since our design accommodates a much larger bandwidth
`
`than current bus designs and is highly pipelined, the absence of an exclusive state is not a problem.
`
`Even in the context of bus-based systems, the absence of the exclusive state create negligible addi-
`
`tional bandwidth consumption in parallel and multiprogrammed workloads as is shown in [9]
`
`when Upgrades are used. In our MSI protocol, the absence of an exclusive state is more costly
`
`because we take a miss on every write hit to a Shared copy (see Section 2.3) in order to simplify
`
`the protocol.
`
`In the future, we can expect that the exclusive state may become totally useless if hardware and/or
`
`compiler can detect migratory sharing in parallel workloads [7, 21] and accesses to private data in
`
`multiprogrammed workloads and give hints to the cache on load misses. In section 5, we will show
`
`the impact of dropping the exclusive state on system performance.
`
`2.5.2 Increased Memory Bandwidth Consumption
`
`Another potential drawback of the proposed design is that it may require more memory bandwidth
`
`support because, if the valid bit in each memory block is stored in the same DRAM as the data, the
`
`valid bit in memory must be always consulted and sometimes updated on every main memory
`
`access.
`
`By contrast, traditional shared-bus design may allow a memory access to be squashed to save
`
`memory bandwidth whenever a processor replies. As shown in Figure 5, when the processors
`
`probe their local caches on snoop requests, the memory controller normally starts a “speculative”
`
`access in parallel in order to reduce latency in case no processor owns the requested block. This
`
`memory access is termed “speculative” because it proceeds without knowing whether the data
`
`copy from memory will be used. With a proper design, “speculative” memory accesses may be
`
`dequeued from the memory queues when one of the processors asserts the Dirty line on the bus. It
`
`is, however, important to note that in many cases the memory access may still be unavoidable:
`
`Since the “speculative” access may start as early as the address is known, the snoop result may
`
`arrive too late to cancel the access. As a result, the speculative access may still have to complete,
`
`10
`
`
`
`but the memory does not return the copy.
`
`Clock
`
`A
`
`P0 puts request and address on the control bus
`
`P1 latches in the request A
`
`local cache lookup
`
`local cache lookup
`
`speculative memory access
`
`if the Dirty signal is asserted, the
`fetched memory block is dropped
`
`A A
`
`Pn latches in the request
`
`Memory controller
`latches in the request
`
`Figure 5 Speculative Memory Access by the Memory Controller in Snooping Protocols.
`
`Another problem is that the valid bit must at times be updated, which requires another DRAM
`
`access in some cases.
`
`A radical solution to these problems would be to store the valid bits in a separate, fast SRAM chip,
`
`which is accessed in parallel with the memory. However our results show that, for our bench-
`
`marks, bus-based snooping protocols are not really able to save much memory bandwidth by can-
`
`celling DRAM block accesses.
`
`The idea of a valid bit associated with each memory block is not new. In the Synapse protocol
`
`[11], when the cache with a modified line does not respond fast enough to a miss, the valid bit
`
`inhibits the memory from responding. This idea was also adopted in [3] in the design of a snoop-
`
`ing protocol for unidirectional slotted rings. In this design a snoop request propagates from node to
`
`node and the memory state must be snooped together with the cache state because a snoop request
`
`visits each node only once.
`
`3 Advances in Serial-Links Technology
`
`The performance of high-speed CMOS serial links is growing rapidly [10, 25]. These links are
`
`suitable for point-to-point connection in impedance controlled environment. The reflection prob-
`
`lems plaguing multi-drop bus are eliminated. Data rates up to 4Gbit/sec have been demonstrated
`
`using conventional CMOS technology.
`
`It is still premature to consider these technologies in massively parallel implementations because
`
`of their power, area, latency, and error rate characteristics. However, with the advances in CMOS
`
`11
`
`
`
`technology and circuit implementation, it is conceivable that differential 4Gbit/sec links will
`
`become available in the near future. The chip area needed by each link is basically the I/O pad area
`
`and power consumption is sufficiently low that hundreds of these links could be implemented on a
`
`single chip. The latency could be as small as 2-3 cycles plus time of flight.
`
`The following performance evaluations of the proposed asynchronous cache system have been
`
`carried out assuming links with bandwidths of 1 and 2 Gbit/sec.
`
`4 Methodology for Performance Evaluation
`
`The primary goal of this study is to provide some quantitative comparison of the proposed asyn-
`
`chronous protocol design with typical bus-based designs. We use cycle-accurate, trace-driven sim-
`
`ulation models for both the asynchronous design and traditional bus-based SMPs. Both models are
`
`described below in some details.
`
`4.1 Workloads
`
`In this study, we use an Oracle, TPC-C trace which was collected on a 4-way HP server by a tool
`
`based on object-code-translation technology [1]. The tool instruments the code to dump memory
`
`traces. In this configuration, the TPC-C traces are composed of trace files for 22 processes. These
`
`processes are partitioned and scheduled on the four processors in a dedicated manner. The traces
`
`contains user-space accesses only. We also use three benchmarks from the SPLASH2 suite [24].
`
`Table 1 lists the footprints and number of instruction fetches, loads and stores in each benchmark.
`
`Table 1 Characterization of Workloads.
`
`Benchmarks
`
`Code Space
`Footprint
`
`Data-set Size
`
`Num. of
`Instructions
`
`Num. of Loads
`
`Num. of Stores
`
`TPCC
`
`FFT
`
`RADIX
`
`OCEAN
`
`2MB
`
`4KB
`
`4KB
`
`45KB
`
`15MB
`
`3MB
`
`8MB
`
`15MB
`
`200,000,000
`
`47,278,610
`
`26,462,984
`
`36,709,821
`
`4,964,999
`
`101,397,013
`
`41,039,729
`
`2,868,340
`
`9,544,862
`
`292,565,301
`
`77,345,166
`
`18,326,143
`
`4.2 Simulation Model
`
`4.2.1 The Asynchronous Cache Model
`
`The asynchronous caching model is shown in Figure 3. We simulate a 4-way SMP node. Every
`
`12
`
`
`
`processor has a 128KB, level-1 write-through data cache, a 128KB, level-1 instruction cache and a
`
`1MB level-2 write-back cache. All caches are 4-way set-associative and lockup-free.
`
`The Processor Model
`
`We approximate an processor core which enables out-of-order issuances and in-order completions
`
`of memory accesses. It is an approximation because information of data dependency is not avail-
`
`able in our traces. The traces contain all instructions. Instructions are inserted one by one in a 96-
`
`entry buffer and are retired one by one. Loads and stores can retire at the head of the buffer if they
`
`do not access a pending block in the second level caches. Instruction accesses are also simulated.
`
`The histogram of the number of pending instructions, loads and stores misses at the time when a
`
`new second-level cache miss is issued is displayed in Figure 6. In most cases there are no or few
`
`pending misses. The figure shows that the average number of pending misses is quite reasonable
`
`and that the absence of dependency information in the trace does not unduly stress the memory
`
`system.
`
`Figure 6 The Number of Pending Misses When Issuing a New Miss.
`
`Furthermore, in the SPLASH2 benchmarks, most data elements are accessed by indexing into data
`
`arrays within loops. For instance, one code line may read as:
`
`dest[n+m]= x[i*k] +x[i*k+1]
`
`where n, m, i, k are typically private variables of processes that control the loop. The computation
`
`13
`
`
`
`of the indices should not cause global coherence traffic and should hit in the cache. As a result, the
`
`danger of issuing the loads to data array x much too early is limited. The only problem is the store
`
`to the dest data array. In this case, our approximation tends to aggressively prefetch the destination
`
`data block based on addresses. Nevertheless, we believe the number of pending accesses as shown
`
`in Figure 6 is reasonable, in particular for future processors with deep speculative execution.
`
`The Memory Subsystem
`
`A cache miss injects a request into the processor’s outgoing request queue. The request packet is
`
`128 bits long. The request packet is then muxed out through the parallel links (1gbps or 2gbps),
`
`entering the request queue (ReqQ) of the memory controller. Subsequently, the request is routed to
`
`the snoop queues of all processors as well as to the memory queue via an 128-bits wide internal
`
`bus. Finally, data blocks are routed through a separate 256-bits wide internal data bus in the model.
`
`Clock
`
`Active
`
`Read
`
`tRCD
`
`tCAS
`
`D0
`
`D1
`
`D2
`
`D3
`
`tRRD
`
`tRC
`
`Figure 7 Basic Timing for a Read Burst.
`
`There are four memory banks. The memory devices are 100MHz SDRAMs2 with four internal
`banks [20]. For these devices, a memory access starts with an ACTIVE command. A READ/
`
`WRITE command is given after tRCD. For a READ, the first bit of data is available after a read
`
`latency tCAS; for a write, the data is driven on the inputs at the same cycle where the WRITE
`
`command is given. Figure 7 shows the basic timings.
`
`It is important to note that a subsequent ACTIVE command to a different row in the same internal
`
`2. The experiment can be extended to other types of DRAMs such as DDR-SDRAM and DRDRAM. We choose the
`simple SDRAMs to study memory bandwidth issues. The simulated system supports a data bandwidth of 6.4GB/s. It
`is a reasonable configuration for today’s design.
`
`14
`
`
`
`bank can only be issued after the previous active row is closed. Therefore, the minimum time
`
`interval between successive accesses to the same internal bank is defined by tRC. On the other
`
`hand, a subsequent ACTIVE command to another internal bank can be issued when the first bank
`
`is being accessed. The minimum time interval between successive ACTIVE commands to differ-
`
`ent banks is defined by tRRD. In our simulation, we model the DRAM modules accordingly.
`
`However, we do not model an optimal memory controller design that may schedule accesses to
`
`interleaved banks in order to maximize throughput.
`
`We list the parameters for the simulated asynchronous cache system in Table 2.
`
`Table 2 Parameters for Simulated Asynchronous Cache Configurations.
`
`Module
`
`Parameters
`
`Processor Core
`
`Speed
`
`Cache
`
`block size
`
`I$
`
`D$
`
`L2$
`
`Address
`
`Data
`
`Link
`
`Flight Time (Req)
`
`Memory
`
`Flight Time
`(Data)
`
`Controller
`
`SDRAM
`
`Ranges
`
`500MHz
`
`64 bytes
`
`128KB, 4-way set-associative, 0-cycle cache hit
`
`128KB, 4-way set-associative, 0-cycle cache hit
`
`1MB, 4-way set-associative, 6-cycles cache hit
`
`8 parallel links (1gbps, or 2gbps)
`
`16 parallel links (1gbps, or 2gbps)
`
`16ns for 8x1gbps links
`8ns for 8x2gbps links
`
`32ns for 16x1gbps links
`16ns for 16x2gbps links
`
`200MHz
`128bits addr/control path
`256bits data path
`
`100MHz, 128bits-width/bank, 4 banks, tRCD=2
`memory cycles, tCAS=2 memory cycles, tRC=8
`memory cycles for a 64B cache line.
`
`4.2.2 The Shared-Bus Based Synchronous Cache Model
`
`As a point of reference, we also simulate the traditional shared-bus based design. The bus-based
`
`model has the same memory configurations as the model of Figure 3 except that the parallel links
`
`are replaced by a multi-drop shared-bus.
`
`We assume a P6 bus-like configuration and protocol [16]. The basic timing for this pipelined bus is
`
`illustrated below. In our simulation, we assume an ideal design that does not limit the number of
`
`15
`
`
`
`concurrent bus transactions.
`
`Arb
`
`Req Req Err Err SNP RSP D0 D1 D2 D3 D4 D5 D6 D7
`
`Arb
`
`Req Req Err Err SNP RSP
`
`D0
`
`Arb
`
`Req Req Err Err SNP
`
`RSP
`
`D0 D1 D2 D3 D4 D5 D6 D7
`
`Arb
`
`Req Req Err Err SNP
`
`RSP
`
`D0
`
`Arb
`
`Req Req Err Err SNP
`
`RSP
`
`D0
`
`Figure 8 Pipelined Bus Transactions.
`
`In general, a transaction starts with an arbitration cycle, followed by two request cycles. The snoop
`
`result is reported at the 4th cycle after the first request phase. Finally, the data phase starts at the
`
`6th cycle after the first request cycle. As clearly shown in the timing diagram of Figure 8, the max-
`
`imum request rate to the bus is one request every three bus cycles. Furthermore, the data phase
`
`may incur a long queueing delay for data returns.
`
`Table 3 summarizes the parameters for our simulations of bus-based systems.
`
`Table 3 Parameters for Simulated Shared-Bus Configurations.
`
`Module
`
`Parameters
`
`Processor Core
`
`Cache
`
`Bus
`
`Speed
`
`block size
`
`I$
`
`D$
`
`L2$
`
`clock
`
`Data
`
`Ranges
`
`500MHz
`
`64 bytes
`
`128KB, 4-way set-associative, 0-cycle cache hit
`
`128KB, 4-way set-associative, 0-cycle cache hit
`
`1MB, 4-way set-associative, 6-cycles cache hit
`
`100MHz
`
`64-bits data bus, 8 bus cycles for 64B cache line
`
`Data (double-pumped)
`
`64-bits data bus, 4 bus cycles for 64B cache line
`
`Data(double-pumped)
`
`128-bits data bus, 2 bus cycles for 64B cache line
`
`Memory
`
`SDRAM
`
`100MHz, 128bits-width/bank, 4 banks, tRCD=2
`memory cycles, tCAS=2 memory cycles, tRC=8
`memory