throbber
Design and Performance of SMPs
`With Asynchronous Caches
`
`Fong Pong, Michel Dubois*, Ken Lee
`Computer Systems and Technology Laboratory
`HP Laboratories Palo Alto
`HPL-1999-149
`November, 1999
`
`E-mail: fpong@hpl.hp.com
` kenlee@exch.hpl.hp.com
` dubois@paris.usc.edu
`
`
`
`Asynchronous,
`We propose and evaluate a cache-coherent symmetric multiprocessor system
`(SMP) based on asynchronous caches. In a system with asynchronous
`cache coherence,
`caches, processors and memory controllers may observe the same coherence
`shared-memory
`request at different points in time. All protocol transactions are uni-
`multiprocessor,
`directional and processors do not report snoop results. The need for an
`SMP
`extensive
`interlocking protocol between processor nodes and memory
`controller which is characteristic of snooping buses is thus removed.
`
`
`
`This design overcomes some of the scalability problem of a multi-drop
`shared-bus by employing high-speed point-to-point links, whose scalability
`prospects are much better than for shared buses. Memory and processors
`communicate through a set of queues. This highly pipelined memory system
`design is a better match to emerging ILP processors than bus-based
`snooping. Simulation results for ILP processors show that the shared-bus
`design is limited by its bandwidth. By contrast the parallel link design has
`ample bandwidth and yields large performance gain for the transaction
`processing and scientific benchmarks that we have considered.
`
`the asynchronous design we propose
`Besides higher performance
`considerably simplifies the behavior expected from the hardware. This is
`important because snooping bus protocols are so complex today that their
`verification has become a major challenge.
`
`* Department of Electrical Engineering-Systems. University of Southern California Los Angeles,
`California
` Copyright Hewlett-Packard Company 1999
`
`(cid:211)
`

`
`1 Introduction
`
`Because of their simplicity, shared-bus SMPs are a dominant architecture for small-scale systems.
`
`By taking advantage of the broadcast nature of a shared-bus, snooping protocols are the de facto
`
`schemes for achieving cache coherence. Figure 1 shows the basic configuration to support a four-
`
`states MESI protocol [22] which is widely used in cache-coherent shared-memory multiprocessor
`
`systems. In such a system, every processor is associated with a bus watcher (a “snooper”) which
`
`monitors all bus activities. When a processor initiates a coherence transaction such as a load miss
`
`on the bus, all snoopers in all processors latch in the request. These snoop requests consult the
`
`local caches, take necessary actions and respond with appropriate snooping results. Each protocol
`
`transaction on the bus is deemed complete when all caches have reported their snoop result.
`
`Although the snooping bus-based design is a classic, well-understood design and offers many
`
`good features, it is becoming harder to design a shared bus-based SMP that keeps pace with
`
`emerging ILP processor technology.
`
`First and foremost, the multi-drop bus architecture is reaching its speed limit. When the clocking
`
`speed was low, the electrical length of the bus was short enough that distributed behavior of the
`
`bus could be ignored. However, as bus speeds increase, the processor boards connected to the bus
`
`behave as stubs resulting in reflections and ringing of bus signals. There exist several schemes for
`
`terminations and signaling to reduced reflections, but none solves the fundamental problem of
`
`stubs. Because of design constraints such as heat dissipation the space needed between stubs is
`
`longer at high speeds. This limits the operating speed of buses to 150MHz in current systems.
`
`P0
`
`C0
`
`P1
`
`C1
`
`P2
`
`C2
`
`P3
`
`C3
`
`bus-watcher
`
`Control+Address
`
`Shared
`
`Dirty
`
`Figure 1 A Bus-based Cache System with Shared/Dirty Signals.
`
`For future processor designs with deep speculation, multiple cores, and/or multithreading [13, 14,
`
`2
`
`

`
`23], the shared-bus will no doubt become a major bottleneck even in small multiprocessor config-
`
`urations. An alternative is to have multiple short bus segments bridged by the memory controller
`
`[15, 18]. This approach has its own limitations. It is difficult to connect more than two full busses
`
`to the memory controller without an inordinate number of I/O pins. Furthermore, all bus transac-
`
`tions that occurs in one bus segment must be propagated to other bus segments unless the memory
`
`controller is equipped with coherence filters. A coherence filter essentially keeps track of memory
`
`blocks that are cached in the bus segment. Regardless of these possible extensions, the fundamen-
`
`tal problem of a shared-bus design remains. For instance, every request must start with an arbitra-
`
`tion cycle and spend one cycle for bus turnaround at the end. The protocol sets an upper bound on
`
`the maximum attainable throughput.
`
`Secondly, the bus snooping scheme requires all processors (and all snooping agents such as I/O
`
`bridge chips with I/O caches) to synchronize their responses. Generally speaking, the snoop results
`
`serve three purposes: 1) they indicate the safe completion of the snoop request in the cache hierar-
`
`chy of the local processor, 2) they provide sharing information, and 3) they identify which entity
`
`should respond with the missing data block, i.e, either another processor or memory. For the pur-
`
`pose of illustration (see Figure 1), assume that the snooping results are propagated to all proces-
`
`sors via two bus lines Shared and Dirty. For a load miss, the processor may load the data block into
`
`the exclusive (E) or shared (S) state depending on whether the Shared or Dirty signal is asserted. In
`
`the case where a processor has the most recently modified copy of the requested data, it asserts the
`
`Dirty signal preventing the memory from responding with the data.
`
`D$
`
`I$
`
`L2$
`
`write-back$
`
`prefetching$
`
`L3$
`
`Concentration
`
`snoop
`
`result
`
`Figure 2 Illustration of Snooping Paths in Modern Multiprocessors.
`
`The common approach is to require that all caches connected to the bus generate their snoop
`
`3
`
`

`
`results in exactly the same bus cycle. This requirement imposes a fixed latency time constraint
`
`between receiving each bus request and producing the snoop result and this fixed snooping delay
`
`must accommodate to the worst possible case. This constraint presents a number of challenges for
`
`the designers of highly-integrated processors with cache hierarchies. As illustrated in Figure 2,
`
`many modules must be snooped and the final results must be concentrated. In order to meet the
`
`fixed latency constraint, the processors may require ultra-fast snooping logic paths. The processor
`
`may have to adopt a priority scheme assigning a higher priority to snoop requests than to requests
`
`from the processor’s execution unit.
`
`More relaxed designs such as Intel’s P6 bus [16] allow processors and memory controllers to insert
`
`wait states when they are slow to respond. This scheme complicates the logic design because
`
`every processor must closely monitor the activities of other processors on the bus in order to re-
`
`generate snooping results when wait states are observed. In the case of Intel’s bus, for instance, the
`
`processor must repeat its snooping result two cycles after observing a wait state cycle.
`
`Yet another approach to synchronizing snoop results is the use of a shared wired-or Inhibit signal
`
`on the bus as was implemented in the SGI Challenge [12]. Processors may snoop at different
`
`speeds but must report the end of their snoop cycle on this new bus line. The transaction remains
`
`“open” for as long as any processor has not pulled down its Inhibit line. Again this interlocking of
`
`processor and memory signals on the bus results in complex bus protocols and bookkeeping of
`
`pending transactions in the interface of each processor, which in fact limits the number of concur-
`
`rent transactions on the bus. The design still relies on a shared line and on bus watching and is
`
`complex to verify. This complexity increases with the number of concurrent transaction tolerated
`
`on the bus. Current designs are so complex that verification has become a major development cost.
`
`In this paper, we propose an efficient cache-coherent shared-memory multiprocessor system based
`
`on an asynchronous-snooping scheme. In this paper asynchronous refers to a model of cache-
`
`coherent systems first theoretically introduced and proven correct in [2]. It does not refer to vari-
`
`able-delay snooping using an inhibit line as described above. In fact it does not require reporting
`
`snoop results or synchronizing the snoops in any way. Because of this simplification, fast, high-
`
`bandwidth point-to-point links can be used to communicate requests and data between processors
`
`and memory. Snooping requests to different processors are propagated independently through
`
`queues. The number of pending and concurrent protocol transactions is only limited by the size of
`
`4
`
`

`
`these FIFO queues. However by emulating a shared-bus broadcast protocol, the topology of the
`
`point-to-point interconnection is made transparent to the processors and the memory.
`
`The various design aspects of the new system are given in Sections 2 and 3. Section 4 is dedicated
`
`to the evaluation methodology and system simulation models. We compare the effectiveness of
`
`various bus-based configurations with our parallel link design. These results are presented and dis-
`
`cussed in Section 5 and our final comments conclude the paper in Section 6.
`
`2 A New Asynchronous Cache System Design
`
`2.1 The Architectural Organization
`
`We advocate an asynchronous cache system such as the one shown in Figure 3. In this design, pro-
`
`cessors and memory controller communicate via unidirectional high-speed links [10, 25]. A set of
`
`queues buffer requests and data blocks in case of access conflicts and also serve as adaptor
`
`between data paths of different width.
`
`processor 0
`
`processor 1
`
`Scheduling Window/
`Memory Disambiguation
`
`D$
`
`I$
`
`D$
`
`I$
`
`L2$
`
`L2$
`
`p2
`
`p3
`
`parallel links
`
`Data
`
`Request
`
`Incoming Buffer
`
`address bus
`
`data bus
`
`memory controller
`
`MemQ
`
`DataQ
`
`ReqQ
`
`SnoopQ
`
`blocks
`
`{
`
`state
`
`data
`
`state
`
`data
`
`bank2
`
`bank3
`
`state
`data
`bank0
`
`state
`data
`bank1
`
`Figure 3 The Proposed Asynchronous Cache System.
`
`On the memory controller side, received requests are stored in a request queue. Through a high-
`
`5
`
`

`
`speed address bus internal to the memory controller, requests are routed to the snoop queues of all
`
`processors. Emerging from the snoop queue, request packets are transmitted through the links to
`
`the incoming buffers of processors. When they reach their destination, packets may snoop various
`
`caches in the node and a data response may be generated as a point-to-point packet to the request-
`
`ing processor through a data bus internal to the memory controller chip.
`
`Note that the on-chip address and data busses may be both implemented by a multiplexer to data
`
`lines feeding into all processors incoming queues. Speed of this path can be made very fast,
`
`depending on the distance and the RC delay of the process technology.
`
`2.2 Memory Disambiguation and Access Ordering
`
`All the queues in this design are FIFOs (First-In-First-Out). In term of ordering of memory
`
`accesses, the behavior of this system has been shown to be indistinguishable from a platform with
`
`a shared bus [2, 19]. The address bus internal to the memory controller serves as a global ordering
`
`point for all memory accesses.
`
`For correctness, there are certain rules the system must support:
`
`1. The processor keeps track of all pending instructions in the scheduling window or the memory
`
`disambiguation buffer. Instructions are retired in program order.
`
`2. Loads that are executed, but have yet to retire from the memory disambiguation buffer are sub-
`
`ject to snooping requests. A write-invalidation arriving before retiring the load will cause the
`
`load to be re-issued and all instructions depending on the load to be re-played.
`
`3. Values of stores are written to the cache/memory system only when they are retired from the
`
`scheduling window, although the fetch of an exclusive cache line can proceed speculatively.
`
`Specifically, when a store operation is inserted into the scheduling window, a speculative access
`
`may be issued to the cache system to obtaining a local writable copy of the block. However, the
`
`store is actually executed only when it emerges at the head of the scheduling window.
`
`4. When the memory controller receives a request, the request is routed via the internal address
`
`bus to the input snoop queues of all processors, including the processor that made the
`
`request. At this point, a global ordering point has been established so that all processors
`
`observe the same sequence of events.
`
`5. Before processor pi can retire a request from its incoming snoop buffer, all requests received
`
`6
`
`

`
`before it in the buffer must have been completed.
`
`It is really not difficult to comprehend that a total order on all stores can be constructed in this sys-
`
`tem and that all the stores can only be observed in that order by any processor. Thus the system is
`
`sequentially consistent, the strongest form of memory consistency.
`
`2.3 The Cache Protocol
`
`Because we have no Shared, Dirty or Inhibit bus lines, we need to adapt parts of the protocol that
`
`rely on such lines. First, we cannot implement the E (Exclusive) state of the MESI protocol. This is
`
`easily done by using a simpler, MSI protocol. Second we need to make sure that one and only one
`
`cache or memory responds with the data on a cache miss. A classical solution to this problem is to
`
`maintain one Valid bit per block in memory. A memory block can be in one of two states: memory-
`
`invalid (MI) and memory-valid (MV). If a memory block is in the MI state, one of the processors
`
`must have a modified copy and supply the copy on a miss. If a memory block is in the MV state,
`
`some processor caches may have a shared block that must be consistent with the memory copy and
`
`the memory delivers the copy. Because of this valid bit processors or memory controller do not
`
`need to report who provides a data copy. Third the inhibit line or the fixed latency snoop delay
`
`become useless because the protocol is now asynchronous and does not rely on the broadcasting
`
`snoop results.
`
`The protocol uses only MSI states, which is a derivation from the MESI protocol. By definition, we
`
`have:
`
`1. Modified (M). When a cache is in the modified state, it has the most recent copy. This cache is
`
`also responsible for responding to a subsequent miss. The memory state must be invalid (MI).
`
`2. Shared (S). The block may be shared among processors. The memory copy is valid (MV) and
`
`the memory is to respond to a subsequent miss.
`
`3. Invalid (I). The cache does not have the memory block.
`
`The protocol works as follows:
`
`7
`
`

`
`1. Read miss. If another cache has a modified copy, it responds to the requesting cache with a data
`
`copy, and updates the main memory as well. Otherwise, the memory must have a valid copy
`
`which is directly supplied to the requesting cache. At the end, all cached copies are in the
`
`shared state, and the memory is in the memory-valid state.
`
`2. Write hit. If the block is in the modified state, the write can proceed without delay. Otherwise,
`
`the cache block is always re-loaded, as in a write miss1.
`
`3. Write miss. A write miss request is sent to the memory and to all other processors. If the mem-
`
`ory has a valid copy, the memory will provide its copy to the requesting processor. Otherwise,
`
`the current owner will respond to the request. At the end, the requesting processor will load the
`
`block in the modified state. All other processors with a valid block set their state to invalid. The
`
`memory state is set to memory-invalid so that the memory will ignore subsequent misses.
`
`2.4 Self-Regulation and Credit-Based Flow Control
`
`In the asynchronous cache system, the number of pending protocol transactions is only limited by
`
`the finite sizes of the queues. To avoid overflow of the request and data buffers a simple credit-
`
`based flow control scheme will do. Each processor limits the number of its outstanding requests to
`
`n. As shown in Figure 3, each processor has its own dedicated buffer at the memory controller and
`
`thus each buffer must hold n entries at the memory controller. Initially, the processor has n credits.
`
`The processor decreases its credits by one when a new request is sent and increases its credits by
`
`one when a reply is propagated back from the memory controller.
`
`The memory controller may also inform the processor on the amount of available buffer space in
`
`its incoming request and data buffers. For instance, when the memory controller propagates a
`snoop request to processor pi, it may piggyback the number of available entries for pi’s out-bound
`data buffers with the request.
`
`In the asynchronous design, flow control is achieved by self-regulation without overhead. By con-
`
`trast, in the pipelined shared-bus design, processors and memory controller must arbitrate for the
`
`memory bus when they want to send requests. For data transfer, the sender must listen to the ready
`
`1. This is done to simplify the protocol. However, the protocol would be more efficient and consume less interconnect
`bandwidth if the processor issued an Upgrade request instead [9]
`
`8
`
`

`
`signal from the receiver, and wait until the data bus is freed by a prior transfer.
`
`To illustrate the complexity introduced by the shared-bus interface design consider the case of
`
`back-to-back requests issued to the bus for the same memory block shown in Figure 4.
`
`Cycle 1
`
`Cycle 2
`
`Cycle 3
`
`Cycle 4
`
`Cycle 5
`
`snoop&latch
`
`addr1
`
`lookup
`
`compute
`
`snoop&latch
`
`addr2=addr1
`
`install new
`state
`
`lookup
`
`snoop_result
`
`Figure 4 Race Condition for Back-to-Back Requests in a Shared-bus System.
`
`In this example, every processor receives two back-to-back requests to the same memory block, in
`
`cycle 1 and in cycle 3. For the first request, the processor looks up its cache in cycle 2, computes a
`
`new state in cycle 3, and then installs the new state in cycle 4. Unfortunately, the snooping result
`
`of the second request depends on the new state computed after the first request, which will be only
`
`available after cycle 4. The situation is even more complex from the point of view of the processor
`
`issuing the first request. This processor may receive the second request before it even knows if its
`
`own request will complete, be deferred or be retried [16]. This is why concurrent requests to the
`
`same block are often disallowed in split-transaction bus systems. In the proposed asynchronous
`
`design, all these problems are solved without undue complexity by respecting the FIFO order of
`
`all requests and responses throughout the memory system. The asynchronous protocol does not
`
`check for nor constraint the number of concurrent transactions to the same block.
`
`2.5 Trade-offs Made in the Proposed Design
`
`2.5.1 Elimination of the Exclusive State
`
`Traditionally, the exclusive (E) state has been advocated to reduce the consumption of bus band-
`
`width when a load miss is followed by a store hit to the same cache line. This is because store
`
`accesses to cache lines in the E state are resolved locally in the processor without incurring bus
`
`accesses.
`
`9
`
`

`
`Since processors do not report sharing status in our proposed design, the exclusive state cannot be
`
`supported, at least in a simple way. The minimum cost of not having the exclusive state, is the
`
`additional bus bandwidth required to propagate an Upgrade request (one control packet) on the bus
`
`on the first write following a read miss. Since our design accommodates a much larger bandwidth
`
`than current bus designs and is highly pipelined, the absence of an exclusive state is not a problem.
`
`Even in the context of bus-based systems, the absence of the exclusive state create negligible addi-
`
`tional bandwidth consumption in parallel and multiprogrammed workloads as is shown in [9]
`
`when Upgrades are used. In our MSI protocol, the absence of an exclusive state is more costly
`
`because we take a miss on every write hit to a Shared copy (see Section 2.3) in order to simplify
`
`the protocol.
`
`In the future, we can expect that the exclusive state may become totally useless if hardware and/or
`
`compiler can detect migratory sharing in parallel workloads [7, 21] and accesses to private data in
`
`multiprogrammed workloads and give hints to the cache on load misses. In section 5, we will show
`
`the impact of dropping the exclusive state on system performance.
`
`2.5.2 Increased Memory Bandwidth Consumption
`
`Another potential drawback of the proposed design is that it may require more memory bandwidth
`
`support because, if the valid bit in each memory block is stored in the same DRAM as the data, the
`
`valid bit in memory must be always consulted and sometimes updated on every main memory
`
`access.
`
`By contrast, traditional shared-bus design may allow a memory access to be squashed to save
`
`memory bandwidth whenever a processor replies. As shown in Figure 5, when the processors
`
`probe their local caches on snoop requests, the memory controller normally starts a “speculative”
`
`access in parallel in order to reduce latency in case no processor owns the requested block. This
`
`memory access is termed “speculative” because it proceeds without knowing whether the data
`
`copy from memory will be used. With a proper design, “speculative” memory accesses may be
`
`dequeued from the memory queues when one of the processors asserts the Dirty line on the bus. It
`
`is, however, important to note that in many cases the memory access may still be unavoidable:
`
`Since the “speculative” access may start as early as the address is known, the snoop result may
`
`arrive too late to cancel the access. As a result, the speculative access may still have to complete,
`
`10
`
`

`
`but the memory does not return the copy.
`
`Clock
`
`A
`
`P0 puts request and address on the control bus
`
`P1 latches in the request A
`
`local cache lookup
`
`local cache lookup
`
`speculative memory access
`
`if the Dirty signal is asserted, the
`fetched memory block is dropped
`
`A A
`
`Pn latches in the request
`
`Memory controller
`latches in the request
`
`Figure 5 Speculative Memory Access by the Memory Controller in Snooping Protocols.
`
`Another problem is that the valid bit must at times be updated, which requires another DRAM
`
`access in some cases.
`
`A radical solution to these problems would be to store the valid bits in a separate, fast SRAM chip,
`
`which is accessed in parallel with the memory. However our results show that, for our bench-
`
`marks, bus-based snooping protocols are not really able to save much memory bandwidth by can-
`
`celling DRAM block accesses.
`
`The idea of a valid bit associated with each memory block is not new. In the Synapse protocol
`
`[11], when the cache with a modified line does not respond fast enough to a miss, the valid bit
`
`inhibits the memory from responding. This idea was also adopted in [3] in the design of a snoop-
`
`ing protocol for unidirectional slotted rings. In this design a snoop request propagates from node to
`
`node and the memory state must be snooped together with the cache state because a snoop request
`
`visits each node only once.
`
`3 Advances in Serial-Links Technology
`
`The performance of high-speed CMOS serial links is growing rapidly [10, 25]. These links are
`
`suitable for point-to-point connection in impedance controlled environment. The reflection prob-
`
`lems plaguing multi-drop bus are eliminated. Data rates up to 4Gbit/sec have been demonstrated
`
`using conventional CMOS technology.
`
`It is still premature to consider these technologies in massively parallel implementations because
`
`of their power, area, latency, and error rate characteristics. However, with the advances in CMOS
`
`11
`
`

`
`technology and circuit implementation, it is conceivable that differential 4Gbit/sec links will
`
`become available in the near future. The chip area needed by each link is basically the I/O pad area
`
`and power consumption is sufficiently low that hundreds of these links could be implemented on a
`
`single chip. The latency could be as small as 2-3 cycles plus time of flight.
`
`The following performance evaluations of the proposed asynchronous cache system have been
`
`carried out assuming links with bandwidths of 1 and 2 Gbit/sec.
`
`4 Methodology for Performance Evaluation
`
`The primary goal of this study is to provide some quantitative comparison of the proposed asyn-
`
`chronous protocol design with typical bus-based designs. We use cycle-accurate, trace-driven sim-
`
`ulation models for both the asynchronous design and traditional bus-based SMPs. Both models are
`
`described below in some details.
`
`4.1 Workloads
`
`In this study, we use an Oracle, TPC-C trace which was collected on a 4-way HP server by a tool
`
`based on object-code-translation technology [1]. The tool instruments the code to dump memory
`
`traces. In this configuration, the TPC-C traces are composed of trace files for 22 processes. These
`
`processes are partitioned and scheduled on the four processors in a dedicated manner. The traces
`
`contains user-space accesses only. We also use three benchmarks from the SPLASH2 suite [24].
`
`Table 1 lists the footprints and number of instruction fetches, loads and stores in each benchmark.
`
`Table 1 Characterization of Workloads.
`
`Benchmarks
`
`Code Space
`Footprint
`
`Data-set Size
`
`Num. of
`Instructions
`
`Num. of Loads
`
`Num. of Stores
`
`TPCC
`
`FFT
`
`RADIX
`
`OCEAN
`
`2MB
`
`4KB
`
`4KB
`
`45KB
`
`15MB
`
`3MB
`
`8MB
`
`15MB
`
`200,000,000
`
`47,278,610
`
`26,462,984
`
`36,709,821
`
`4,964,999
`
`101,397,013
`
`41,039,729
`
`2,868,340
`
`9,544,862
`
`292,565,301
`
`77,345,166
`
`18,326,143
`
`4.2 Simulation Model
`
`4.2.1 The Asynchronous Cache Model
`
`The asynchronous caching model is shown in Figure 3. We simulate a 4-way SMP node. Every
`
`12
`
`

`
`processor has a 128KB, level-1 write-through data cache, a 128KB, level-1 instruction cache and a
`
`1MB level-2 write-back cache. All caches are 4-way set-associative and lockup-free.
`
`The Processor Model
`
`We approximate an processor core which enables out-of-order issuances and in-order completions
`
`of memory accesses. It is an approximation because information of data dependency is not avail-
`
`able in our traces. The traces contain all instructions. Instructions are inserted one by one in a 96-
`
`entry buffer and are retired one by one. Loads and stores can retire at the head of the buffer if they
`
`do not access a pending block in the second level caches. Instruction accesses are also simulated.
`
`The histogram of the number of pending instructions, loads and stores misses at the time when a
`
`new second-level cache miss is issued is displayed in Figure 6. In most cases there are no or few
`
`pending misses. The figure shows that the average number of pending misses is quite reasonable
`
`and that the absence of dependency information in the trace does not unduly stress the memory
`
`system.
`
`Figure 6 The Number of Pending Misses When Issuing a New Miss.
`
`Furthermore, in the SPLASH2 benchmarks, most data elements are accessed by indexing into data
`
`arrays within loops. For instance, one code line may read as:
`
`dest[n+m]= x[i*k] +x[i*k+1]
`
`where n, m, i, k are typically private variables of processes that control the loop. The computation
`
`13
`
`

`
`of the indices should not cause global coherence traffic and should hit in the cache. As a result, the
`
`danger of issuing the loads to data array x much too early is limited. The only problem is the store
`
`to the dest data array. In this case, our approximation tends to aggressively prefetch the destination
`
`data block based on addresses. Nevertheless, we believe the number of pending accesses as shown
`
`in Figure 6 is reasonable, in particular for future processors with deep speculative execution.
`
`The Memory Subsystem
`
`A cache miss injects a request into the processor’s outgoing request queue. The request packet is
`
`128 bits long. The request packet is then muxed out through the parallel links (1gbps or 2gbps),
`
`entering the request queue (ReqQ) of the memory controller. Subsequently, the request is routed to
`
`the snoop queues of all processors as well as to the memory queue via an 128-bits wide internal
`
`bus. Finally, data blocks are routed through a separate 256-bits wide internal data bus in the model.
`
`Clock
`
`Active
`
`Read
`
`tRCD
`
`tCAS
`
`D0
`
`D1
`
`D2
`
`D3
`
`tRRD
`
`tRC
`
`Figure 7 Basic Timing for a Read Burst.
`
`There are four memory banks. The memory devices are 100MHz SDRAMs2 with four internal
`banks [20]. For these devices, a memory access starts with an ACTIVE command. A READ/
`
`WRITE command is given after tRCD. For a READ, the first bit of data is available after a read
`
`latency tCAS; for a write, the data is driven on the inputs at the same cycle where the WRITE
`
`command is given. Figure 7 shows the basic timings.
`
`It is important to note that a subsequent ACTIVE command to a different row in the same internal
`
`2. The experiment can be extended to other types of DRAMs such as DDR-SDRAM and DRDRAM. We choose the
`simple SDRAMs to study memory bandwidth issues. The simulated system supports a data bandwidth of 6.4GB/s. It
`is a reasonable configuration for today’s design.
`
`14
`
`

`
`bank can only be issued after the previous active row is closed. Therefore, the minimum time
`
`interval between successive accesses to the same internal bank is defined by tRC. On the other
`
`hand, a subsequent ACTIVE command to another internal bank can be issued when the first bank
`
`is being accessed. The minimum time interval between successive ACTIVE commands to differ-
`
`ent banks is defined by tRRD. In our simulation, we model the DRAM modules accordingly.
`
`However, we do not model an optimal memory controller design that may schedule accesses to
`
`interleaved banks in order to maximize throughput.
`
`We list the parameters for the simulated asynchronous cache system in Table 2.
`
`Table 2 Parameters for Simulated Asynchronous Cache Configurations.
`
`Module
`
`Parameters
`
`Processor Core
`
`Speed
`
`Cache
`
`block size
`
`I$
`
`D$
`
`L2$
`
`Address
`
`Data
`
`Link
`
`Flight Time (Req)
`
`Memory
`
`Flight Time
`(Data)
`
`Controller
`
`SDRAM
`
`Ranges
`
`500MHz
`
`64 bytes
`
`128KB, 4-way set-associative, 0-cycle cache hit
`
`128KB, 4-way set-associative, 0-cycle cache hit
`
`1MB, 4-way set-associative, 6-cycles cache hit
`
`8 parallel links (1gbps, or 2gbps)
`
`16 parallel links (1gbps, or 2gbps)
`
`16ns for 8x1gbps links
`8ns for 8x2gbps links
`
`32ns for 16x1gbps links
`16ns for 16x2gbps links
`
`200MHz
`128bits addr/control path
`256bits data path
`
`100MHz, 128bits-width/bank, 4 banks, tRCD=2
`memory cycles, tCAS=2 memory cycles, tRC=8
`memory cycles for a 64B cache line.
`
`4.2.2 The Shared-Bus Based Synchronous Cache Model
`
`As a point of reference, we also simulate the traditional shared-bus based design. The bus-based
`
`model has the same memory configurations as the model of Figure 3 except that the parallel links
`
`are replaced by a multi-drop shared-bus.
`
`We assume a P6 bus-like configuration and protocol [16]. The basic timing for this pipelined bus is
`
`illustrated below. In our simulation, we assume an ideal design that does not limit the number of
`
`15
`
`

`
`concurrent bus transactions.
`
`Arb
`
`Req Req Err Err SNP RSP D0 D1 D2 D3 D4 D5 D6 D7
`
`Arb
`
`Req Req Err Err SNP RSP
`
`D0
`
`Arb
`
`Req Req Err Err SNP
`
`RSP
`
`D0 D1 D2 D3 D4 D5 D6 D7
`
`Arb
`
`Req Req Err Err SNP
`
`RSP
`
`D0
`
`Arb
`
`Req Req Err Err SNP
`
`RSP
`
`D0
`
`Figure 8 Pipelined Bus Transactions.
`
`In general, a transaction starts with an arbitration cycle, followed by two request cycles. The snoop
`
`result is reported at the 4th cycle after the first request phase. Finally, the data phase starts at the
`
`6th cycle after the first request cycle. As clearly shown in the timing diagram of Figure 8, the max-
`
`imum request rate to the bus is one request every three bus cycles. Furthermore, the data phase
`
`may incur a long queueing delay for data returns.
`
`Table 3 summarizes the parameters for our simulations of bus-based systems.
`
`Table 3 Parameters for Simulated Shared-Bus Configurations.
`
`Module
`
`Parameters
`
`Processor Core
`
`Cache
`
`Bus
`
`Speed
`
`block size
`
`I$
`
`D$
`
`L2$
`
`clock
`
`Data
`
`Ranges
`
`500MHz
`
`64 bytes
`
`128KB, 4-way set-associative, 0-cycle cache hit
`
`128KB, 4-way set-associative, 0-cycle cache hit
`
`1MB, 4-way set-associative, 6-cycles cache hit
`
`100MHz
`
`64-bits data bus, 8 bus cycles for 64B cache line
`
`Data (double-pumped)
`
`64-bits data bus, 4 bus cycles for 64B cache line
`
`Data(double-pumped)
`
`128-bits data bus, 2 bus cycles for 64B cache line
`
`Memory
`
`SDRAM
`
`100MHz, 128bits-width/bank, 4 banks, tRCD=2
`memory cycles, tCAS=2 memory cycles, tRC=8
`memory

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket