`Pattin et al.
`
`[19]
`
`[11] Patent Number:
`
`5,745,913
`
`[45] Date of Patent:
`
`Apr. 28, 1998
`
`USO05745913A
`
`................ 395/438
`395/425
`395/650
`358/296
`.. 395/494
`
`
`
`5,479,640 12/1995 Cartman et al.
`.
`5,431,591
`1/1996 Day, 1]] eta].
`5,481,707
`1/1996 Murphy Jr.eta].
`5,495,339
`2/1996 Stegbaueretal. ..
`5,604,884
`2/1997 Thome et 51.
`..........
`OTHER PUBLICATIONS
`
`Synchronous DRAM Datebook,
`IBM 16 Mbit
`IBM03 164-09C, pp. 1—1oo, Jan. 1996.
`Primary Examz'ner—Tod R. Swann
`Assistant Examinc-r—-David Langiahr
`Attorney, Agent, or Fz'rm—wStua1t T. Auvinen
`
`[57]
`
`ABSTRACT
`
`Memory requests from multiple processors are re-ordered to
`maximize DRAM row hits and minimize row misses.
`Requests are loaded into a request queue and simultaneously
`decoded to determine the DRAM bank of the request. The
`last row address of the decoded DRAM bank is compared to
`the row address of the new request and a row-hit bit is set
`in the request queue if the row addresses match. The bank’s
`state machine is consulted to determine if RAS is low or
`high, and a RAS—1ow bit in the request queue is set if RAS
`is low and the row still open. Arow counter is reset for every
`new access but is incremented with a slow clock while the
`row is open but not being accessed. After a predetermined
`count, the row is considered “stale”. A stale—row bit in the
`request queue is set if the decoded bank has a stale row. A
`request prioritizer reviews requests in the request queue and
`processes row-hit requests first, then row misses which are
`to a stale row. Lower in priority are row misses to non—stale
`rows which have been more recently accessed. Requests
`loaded into the request queue before the cache has deter-
`mined if a cache hit has occurred are speculative requests
`and can open a new row when the old row is stale or closed.
`
`19 Claims, 8 Drawing Sheets
`
`[54]
`
`[75]
`
`[73]
`
`[21]
`
`[22]
`[5 1]
`[52]
`
`[5 3]
`
`[56]
`
`MULTI-PROCESSOR DRAM CONTROLLER
`THAT PRIORITIZES ROW-MISS REQUESTS
`TO STALE BANKS
`
`Inventors: Jay C. Pattin. Redwood City; James S.
`Blomgren. San lose, both of Calif.
`
`Assignee: Exponential Technology, Inc., San
`Jose, Calif.
`
`Appl. No.: 691,005
`
`Filed:
`Int. Cl.‘ ..
`U.S. Cl.
`
`Aug. 5, 1996
`............................... G06F 12/00
`711/105; 711/111; 395/413;
`395/405; 364/DIG. 1
`Field of Search ....................... 365/230.01; 395/405,
`395/413, 438, 432
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`.. 364/200
`395/413
`395/405
`395/143
`395/325
`395/400
`395/425
`365/2300]
`. 395/325
`. 395/164
`365/189.01
`395/725
`395/425
`395/842
`395/800
`395/325
`..... 395/495
`
`
`
`9/1991 Fling et al.
`5,051,889
`11/1993 Mehring et al.
`5,265,236
`5,269,010 12/1993 MacDonald
`.
`5,280,571
`1/1994 Keith et al.
`5,289,584
`2/1994 Thnme et al.
`5,301,287
`4/1994 Herrell et al.
`5,3012% 4/1994 Hilton et al.
`5,307,320
`4/1994 Fairer et 111.
`5,353,416 10/1994 Olson
`5,357,606 10/1994 Adams
`5,392,239
`2/1995 Margulis el
`.
`
`5,392,436
`2/1995 Jansen et 2.].
`5,412,788
`5/1995 Collins et al.
`5,438,666
`8/1995 Craft et al.
`..
`5,440,752
`8/1995 Lentz at al.
`5,448,702
`9/1995 Garcia, Jr. et a1.
`5,450,564
`9/1995 Hassler et a1.
`.....
`
`.
`
`
`
`DISABLED
`
`91
`
`0001
`
`Volkswagen 1003
`
`0001
`
`Volkswagen 1003
`
`
`
`U.S. Patent
`
`Apr. 23, 1998
`
`Sheet 1 of 3
`
`5,745,913
`
`EIEIEIIJEIEIEIEIEIEIEIEIEI
`
`El
`
`El
`
`El
`
`El
`D
`
`D
`
`E]
`
`El
`
`D E
`
`]
`
`El
`
`LEVE L-2
`CACH E
`
`52
`
`52
`
`PCI
`
`INTERFACE
`
`EIEIEIEIEIEIUEIUEI
`
`EIEI
`
`gEl|.'.'lElUUE|
`
`El
`
`E]
`
`E!
`
`El
`
`El
`
`E E
`
`l D E
`
`El
`
`l.'.l
`
`0002
`
`
`
`U.S. Patent
`
`Apr. 28, 1998
`
`Sheet 2 of 3
`
`5,745,9 13
`
`DISABLED
`
`N EXT__STAT E
`
`20
`
`CLR
`
`T|ME_VAL
`
`BURST__CNT
`
`CLK
`
`CLK_D|V10
`
`ROW_ACT|VE ‘
`
`ROW [DLE$
`
`ROW__OLD
`
`FIG. 3
`
`0003
`
`
`
`U.S. Patent
`
`Apr. 23, 1998
`
`Sheet 3 of 3
`
`5,745,913
`
`REQUEST
`
`QUEUE
`
`ACTIVE
`
`REQU EST
`
`PROCESSING
`
`ROW_H IT, LOW, OLD
`
`ROW
`
`ADDR
`
`ROW
`
`ADDR
`
`ROW
`
`ADDR
`
`14
`
`14
`
`14
`
`0004
`
`
`
`U.S. Patent
`
`8991Q182LPA
`
`Sheet 4 of 8
`
`5,745,913
`
`
`
`
`
`n:o:>>omm>_.8<:>>om:::>>omEEmoo<E
`
`0005
`
`
`
`U.S. Patent
`
`Apr. 28, 1998
`
`Sheet 5 of 8
`
`5,745,913
`
`pmmaomm
`
`
`m_>_._.0<mm_N_:mo_mn_3.
`
`
`pwmsommoDn_O
`
`ozammoommmm«N4:..n_0
`
`pwmaomm
`m__._o<o
`
`
`
`m5m_:omm_s_3%
`
`ow
`
`mv
`
`>>Om
`
`moo<
`
`mm<n_s_oomo<:>>om
`
`E.
`
`
`
`
`
`N.mm:_D_|>>ON_.n._DOOmDhm5v_z<m_
`
`
`
`4¢swm2:.2<mo
`
`%o._o26m
`
`3
`
`Imono
`
`0006
`
`
`
`U.S. Patent
`
`Apr. 23, 1998
`
`Sheet 6 of 8
`
`5,745,913
`
`0007
`
`
`
`U.S. Patent
`
`0032LPA
`
`8991
`
`Sheet 7 of 8 5
`
`mmon
`
`mm
`
`mm3mmom3
`33NHoH
`
`mmao
`
`
`
`mmfiowoMUOMU
`
`m~Ho..:u-:m--m
`
`MNHOIIUIIMIIQ
`
`mua
`
`mmaouamuum
`
`mua
`
`o--4a-m
`
`mma
`
`munm
`
`Evm_m<o
`
`munm
`
`AmVmm<o
`
`mnnm
`
`A9m_m<o
`
`muum
`
`AoVmm<o
`
`0008
`
`
`
`U.S. Patent
`
`Apr. 28, 1998
`
`Sheet 8 of 3
`
`5,745,913
`
`LATENCY
`
`(CYCLES)
`
`50
`
`40
`
`30
`
`20
`
`2%
`
`3%
`
`4%
`
`5%
`
`6%
`
`7%
`
`MISS RATE PER CPU
`
`32% 48%
`
`64%
`
`80%
`
`96°/o
`
`BUS UTILIZATON
`
`FIG. 9
`
`0009
`
`
`
`5,745,913
`
`1
`MULTI-PROCESSOR DRAM CONTROLLER
`THAT PRIORITIZES ROW-MISS REQUESTS
`TO STALE BANKS
`
`BACKGROUND OF THE 1NVENTION—FIELD
`OF THE INVENTION
`
`This invention relates to multi—processor systems, and
`more particularly for DRAM controllers which re-order
`requests from multiple sources.
`
`BACKGROUND OF THE INVENTION——
`DESCRIPTION OF THE RELATED ARI‘
`
`Multi-processor systems are constructed from one or
`more processing elements. Each processing element has one
`or more processors, and a shared cache and/or memory.
`These processing elements are connected to other processing
`elements using a scaleable coherent interface (SCI). SCI
`provides a communication protocol for transferring memory
`data between processing elements. A single processing ele-
`ment which includes SCI can later be expanded or upgraded.
`An individual processing element typically contains two
`to four central processing unit (CPU) cores, with a shared
`cache. The shared cache is connected to an external memory
`for the processing element. The external memory is con-
`structed from dynamic RAM (DRAM) modules such as
`single-inline memory modules (S]MMs). The bandwidth
`between the shared cache and the external memory is critical
`and can limit system performance.
`Standard DRAM controllers for uni—proces sors have been
`available commercially and are well-known. However, these
`controllers, when used for multi-processor systems. do not
`take advantage of the fact that multiple processors generate
`the requests to the external memory. Often requests from
`different processors can be responded to in any order, not
`just
`the order received by the memory controller.
`Unfortunately, DRAM controllers for uni-processor systems
`do not typically have the ability to re-order requests. Thus
`standard DRAM controllers are not optimal for multi-
`processor systems.
`‘
`Synchronous DRAMs are becoming available which pro-
`vide extended features to optimize performance. The row
`can be left open, allowing data to be accessed by pulsing
`CAS without pulsing RAS again. Many CAS-only cycles
`canbe performed once the row address has been strobed into
`the DRAM chips and the row left active. Burst cycles can
`also be perfonned where CAS is strobed once while data that
`sequentially follows the column address is bursted out in
`successive clock cycles.
`What is desired is a DRAM controller which is optimized
`for a multi-processor system. It is desired to re-order
`requests from different CPU cores in an optimal fashion to
`increase bandwidth to the external DRAM memory. It is also
`desired to use burst features of newer synchronous DRAMs
`to further increase bandwidth.
`
`SUNIMARY OF THE INVENTION
`
`A memory controller accesses an external memory in
`response to requests from a plurality of general—purpose
`processors. The memory has a plurality of bank controllers.
`Each bank controller accesses a bank of the external
`memory. Each bank controller has a state machine for
`sequencing control signals for timing access of the external
`memory. The state machine outputs a row-address-strobe
`RAS-active indication of a logical state of a RAS signal line
`coupled to the bank of the external memory. A row address
`
`2
`register stores a last row address of a last—accessed row of
`the bank of the external memory.
`A counter includes reset means for resetting the counter
`upon completion of a data-transfer access to the bank of the
`external memory. The counter periodically increments when
`the row is active and no data-transfer access occurs. The
`counter outputs a stale-row indication when a count from the
`counter exceeds a predetermined count.
`A request queue stores requests from the plurality of
`general-purpose processors for accessing the external
`memory. The request queue stores row- status hits including:
`a) a row-hit indication when the last row address matches
`a row address of a request,
`b) the row-active indication from the state machine, and
`c) the stale-row indication from the counter.
`A request prioritizer is coupled to the request queue. It
`determines a next request from the request queue to generate
`a data-transfer access to the external memory. The request
`prioritizer re-orders the requests into an order other than the
`order the requests are loaded into the request queue. Thus
`requests from the plurality of processors are reordered for
`accessing the external memory.
`In further aspects of the invention the counter is incre-
`mented by a slow clock having a frequency divided down
`from a memory clock for incrementing the state machine
`when the row line is active and no data-transfer access
`occurs, but the counter is incremented by the memory clock
`during the data-transfer access. Thus the counter increments
`more slowly when no data-transfer occurs than when a
`data-transfer access is occurring.
`In further aspects the predetermined count for determin-
`ing the stale row indication is programmable. The request
`prioritizer re-orders requests having the stale-row indication
`before requests not having the stale-row indication when no
`requests have the row-hit indication. Thus miss requests to
`stale rows are processed before miss requests to more-
`recently-accessed rows.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a diagram of a multi-processor chip which
`connects to an external memory bus.
`FIG. 2 is a diagram of a state machine for accessing a
`dynamic random-access memory (DRAM) bank using page-
`mode access.
`
`FIG. 3 is a diagram of a burst counter which is also used
`to indicate how long the current row has been open without
`being accessed.
`FIG. 4 is a diagram of a DRAM controller for separately
`accessing multiple banks of DRAM.
`FIG. 5 is a diagram of the fields for a memory request in
`the request queue.
`FIG. 6 is a diagram showing that the status of the DRAM
`row is determined while the cache is determining if a cache
`hit or miss has occurred.
`
`FIG. 7 is a waveform showing a row being opened on one
`bank while a different bank is bursting data.
`FIG. 8 is a timing diagram of SDRAM accesses using
`normal and auto—precharge.
`FIG. 9 highlights a decrease in average latency when
`using request re-ordering.
`DETAILED DESCRIPTION
`
`in
`invention relates to an improvement
`The present
`embedded DRAM controllers. The following description is
`
`0010
`
`
`
`5,745,913
`
`3
`presented to enable one of ordinary skill in the art to make .
`and use the invention as provided in the context of a
`particular application and its requirements. Various modifi-
`cations to the preferred embodiment will be apparent to
`those with skill in the art, and the general principles defined
`herein may be applied to other embodiments. Therefore, the
`present invention is not intended to be limited to the par-
`ticular embodiments shown and described, but is to be
`accorded the widest scope consistent with the principles and
`novel features herein disclosed.
`Multi-Processor Die—FIG. 1
`FIG. 1 is a diagram of a multi-processor chip for con-
`necting to an external memory bus. Four CPU cores 42, 44,
`46, 48 each contain one or more execution pipelines for
`fetching, decoding, and executing general-purpose instn1c—
`tions. Each CPU core has a level-one primary cache. Oper-
`and or instruction fetches which miss in the primary cache
`are requested from the larger second—leve1 cache 60 by
`sending a request on internal bus 52.
`CPU cores 42, 44, 46. 48 may communicate with each
`other by writing and reading data in second-level cache 60
`without creating tratfic outside of chip 30 on the external
`memory bus. Thus inter-processor communication is some-
`times accomplished without external bus requests. However,
`communication to other CPU’s on other dies do require
`external cycles on the memory bus.
`Requests from CPU cores 42, 44, 46, 48 which cannot be
`satisfied by second-level cache 60 include external DRAM
`requests and PCI bus requests such as input-output and
`peripheral accesses. PCI requests are transferred from
`second-level cache 60 to PCI interface 58, which arbitrates
`for control of PCI bus 28, using PCI-specific arbitration
`signals. Bus 54 connects second-level cache 60 to PCI
`interface 58 and BIU 56.
`Memory requests which are mapped into the address
`space of the DRAMs rather than the PCI bus are sent from
`second-level cache 60 to bus-interface unit (BIU) 56. BUI
`56 contains a DRAM controller so that DRAM signals such
`as chip-select (CS), RAS, CAS, and the multiplexed rowl
`column address are generated directly rather than by an
`external DRAM controller chip connected to a local bus.
`Communication to the other processors is accomplished
`through scaleable—coherent interface SCI 57, which transfers
`data to other processing elements (not shown).
`DRAM Access Requirements
`The cost of a DRAM chip is typically reduced by multi-
`plexing the row and column addresses to the same input pins
`on a DRAM package. A row—address strobe (RAS) is
`asserted to latch the multiplexed address as the row address,
`which selects one of the rows in the memory array within the
`DRAM chip. A short period of time later a column-address
`strobe (CAS) is strobed to latch in the column address,
`which is the other half of the full address. Accessing DRAM
`thus requires that the full address be divided into a row
`address and a column address. The row address and the
`column address are strobed into the DRAM chip at different
`times using the same multiplexed-address pins on the
`DRAM chip.
`DRAM manufacturers require that many detailed timing
`specifications be met for the RAS and CAS signals and the
`multiplexed address. At the beginning of an access, the RAS
`signal must remain inactive for a period of time known as
`the row or RAS precharge period This often requires several
`clock periods of a processor's clock. After RAS precharge
`has occurred, the row address is driven out onto the multi-
`plexed address bus and RAS is asserted. Typically the row
`address must be driven one or more clock periods before
`
`4
`RAS is asserted to meet the address set-up requirement.
`Then the column address is driven to the multiplexed
`address bus and CAS is asserted.
`'
`For synchronous DRAMs, a burst mode is used where
`CAS is pulsed before the data is read out of the DRAM chip.
`On successive clock periods, a burst of data from successive
`memory locations is read out. CAS is not reasserted for
`each datum in the burst. The internal row is left asserted
`during this time although the external RAS signal may have
`become inactive. CAS may then be asserted again with a
`diiferent column address when additional bursts of data have
`the same row address but different column addresses.
`Indeed, a row address may be strobed in once followed by
`dozens or hundreds of CAS—on1y bursts. The internal row is
`left asserted as long as possible, often until the next refresh
`occurs. This is sometimes referred to as page-mode. Since a
`typical multiplexed address has 11 or more address bits, each
`row address or “page” has at least 2“ or 2048 unique
`column addresses.
`When the arbitration logic for DRAM access determines
`that the current access is to arow that is not requested by any
`other CPU’s,
`then a row precharge is performed upon
`completion of requests to the row when other requests are
`outstanding to other rows in the bank. This reduces latency
`to these other rows.
`The row address is typically the higher-order address bits
`while the column address is the lowest-order address bits.
`This address partition allows a single row to contain 2K or
`more contiguous bytes which can be sequentially accessed
`without the delay to strobe in a new row address and
`precharge the row. Since many computer programs exhibit
`locality, where memory references tend to be closer rather
`than farther away from the last memory reference,
`the
`DRAM row acts as a small cache.
`The DRAM is arranged into multiple banks. with the
`highest-order address bits being used to determine which
`bank is accessed. Sometimes lower-order address bits. espe-
`cially address bits between the row and column address bits,
`are used as bank-select bits. See “Page-interleaved memory
`access”, U.S. Pat. No. 5,051,889 by Fung et al., assigned to
`Chips and Technologies, Inc. of San Jose Calif. Each DRAM
`bank acts as a separate row-sized cache, reducing access
`time by avoiding the RAS precharge and strobing delay for
`references which match a row address of an earlier access.
`A synchronous DRAM adds a chip-select (CS) and a
`clock signal to the standard set of DRAM pins. The multi-
`plexed address, and the RAS, CAS, and WE control signals
`are latched into the DRAM chip when CS is active on a
`rising edge of the clock. Thus RAS, CAS, and WE are
`ignored when the chip-select signal is inactive. The internal
`row may be activated by having RAS and CS low on arising
`clock edge. The row remains activated when RAS becomes
`inactive as long as CS remains inactive to that DRAM chip.
`Other rows in other DRAM chips may then be activated by
`asserting a separate CS to these other DRAM chips. Nor-
`mally the multiplexed address and the RAS, CAS, WE
`control signals are shared among all banks of DRAMS, but
`each bank gets a separate chip-select. Only the bank having
`the active chip-select signal latches the shared RAS, CAS,
`WE signals; the other banks with CS inactive simply ignore
`the RAS, CAS, WE signals.
`Page-Mode DRAM State Machine—FIG. 2
`FIG. 2 is a diagram of a state machine for accessing a
`dynamic random-access memory (DRAM) bankusing page-
`mode access.A copy of DRAM state machine 10 is provided
`for each bank of DRAM which is to be separately accessed.
`Some DRAM banks may be disabled. indicated by disabled
`
`0011
`
`
`
`5,745,913
`
`5
`state 91 which cannot be exited without re-configuring the
`memory. Idle state 90 is first entered; the row is not yet
`activated. Row precharge state 92 is then entered; the row is
`still inactive. After a few clock periods have elapsed. the row
`or RAS precharge requirement has been met and RAS is
`asserted. Row active state 96 is then entered. CAS is then
`asserted for reads in state 98 and for writes in state 99. After
`
`tl1e CAS cycle and any burst is completed. row active state
`96 is re-entered and remains active until another access is
`requested.
`Additional accesses that have the same row address are
`known as row hits, and simply assert CAS using states 98 or
`99. When the row address does not match the row address
`of the last access, then a row miss occurs. Row precharge
`state 92 is entered, and the new row address is strobed in as
`RAS becomes active (falls) when entering state 96. Then the
`column address is strobed in states 98 or 99.
`A timer is used to signal when a refresh is needed. and
`refiesh state 93 is entered once any pending access is
`completed. Refresh is typically triggered by asserting CAS
`before the RAS signal, and a counter inside the DRAM chip
`increments for each refresh so that all rows are eventually
`refreshed.
`These states can be encoded into afour-bit field known as
`the BAN'K__STATE, which can be read by other logic
`elements. Table 1 shows a simple encoding of these states
`into a 4-bit field.
`'
`
`TABLE 1
`
`DRAM State Encodigg
`Name
`
`Idle, RAS High
`Row Precharge
`Row Active
`CAS Read
`CAS Write
`Refresh
`Not Installed
`
`BANK_STA".[E
`
`0000
`0001
`00 1 1
`0 100
`0 10 1
`0111
`11 11
`
`Burst and Row Open Timer--FIG. 3
`FIG. 3 is a diagram of a burst counter which is also used
`to indicate how long the current row has been open without
`being accessed. Row counter 12 is provided for each bank
`of DRAMS which is separately accessible. Counter 20 is a
`4-bit binary upward-counter which increments for each
`rising edge of the increment clock input INCR. Once the
`maximum count of 1111 (15 decimal) is reached, the counter
`holds the terminal count rather than rolling over to zero.
`Each time a new state is entered in DRAM state machine
`10 of FIG. 2. NEXT_STATE is pulsed and counter 20 is
`cleared to 0000. Thus counter 20 counts the time that DRAM
`state machine 10 remains in any state.
`The processor’s memory clock CLK is selected by mux
`19 for all states except row active state 96. When any state
`requires multiple clock cycles, counter 20 is used to keep
`count of the number of clock cycles in that state. For
`example, precharge may require 3 clock cycles. Row pre-
`charge state 92. clears counter 20 when first entered, and
`remains active until the count from counter 20 reaches two,
`which occurs during the third clock period of CLK. Once the
`count TIME_VAL output from counter 20 reaches two, row
`precharge state 92 may be exited and row active state 96
`entered.
`When bursting of data is enabled, counter 20 is also used
`as a burst counter. When read CAS state 98 is entered,
`counter 20 is incremented each clock period as the data is
`
`6
`burst until the burst count is reached. Finally read CAS state
`98 is exited once the desired burst count is reached. For a
`burst of four data items, four clock periods are required,
`while a burst of eight data items requires eight clock cycles.
`CAS may be pipelined so that a new column address is
`strobed in while the last data item(s) in the previous burst are
`being read or written. A simple pipeline is used to delay the
`data transfer relative to the state machine. For a CAS latency
`of three cycles, the burst of data actually occurs three clock
`cycles after it is indicated by the burst counter. In that case,
`read CAS state 98 is exited three cycles before the burst
`completes.
`Counter 20 is used in a second way for row active state 96.
`The second use of counter 20 is to keep track of the idle time
`since the last CAS cycle when RAS is low. Mux 19 selects
`the clock divided by ten, CLK_DlV10 during row active
`state 96 ROW_ACTIVE. This slower clock increments
`counter 20 at a slower rate. The slower rate is needed
`because row active state 96 may be operative for many
`cycles. In a sense, the DRAM bankis idle, but RAS has been
`left on for much of the time that row active state 96 is
`operative. Counter 20 can only count up to 15, so a fast clock
`would quickly reach the maximum count.
`_
`Counter 20 is slowly incremented by the divided-down
`clock CLK_DIV10. A programmable parameter ROW__
`IDLE$ is programmed into a system configuration register
`and may be adjusted by the system designer for optimum
`performance. Comparator 18 signals ROW__OLD when the
`count from counter 20 exceeds ROW_lDLE$, indicating
`that the row active state 96 has been active for longer than
`ROW_]DLE$ periods of CLK_DIV10.
`ROW Open Counter Indicates Stale Rows
`ROW__OLD is thus signaled when row active state 96 has
`been active for a relatively long period of time. No data
`transfers from this bank have occurred in that time, since
`another state such as read CAS state 98 would have been
`entered. clearing counter 20. ROW_0LD is an indication of
`how stale the open row is. Such information is useful when
`determining whether another reference is likely to occur to
`that open row. Once a long period of time has elapsed
`without any accesses to that row, it is not likely that future
`accesses will occur to this row. Rather, another row is likely
`to be accessed.
`.
`Another row is more likely to be accessed when ROW_
`OLD is active than when a shorter period of time has elapsed
`since the last access and ROW_0LD is not asserted. The
`priority of requests may be altered to lower the priority to a
`stale bank Scheduling logic in the DRAM controller may
`examine ROW__OLD to determine when it is useful to close
`a row and begin RAS precharge for another access, even
`before another access is requested. This is known as a
`speculative precharge, when the stale row is closed, and the
`bank is pre-charged before an actual demand request arrives.
`Multi—bank DRAM Controller-—-FIG. 4
`FIG. 4 is a diagram of DRAM controller 56 for separately
`accessing multiple banks of DRAM. Four DRAM banks 16
`are shown. although typically 8 or 16 banks may be sup-
`ported by a simple extension of the logic shown. Each bank
`has its own DRAM state machine 10, as detailed in FIG. 2,
`and burst and row counter 12. Each bank also has its own
`row—address register 14 which contains a copy of the row
`address bits strobed into the DRAMs during the last row
`open when RAS was asserted
`Bus 54 transmits memory requests from any of the four
`CPU cores 42, 44, 46, 48 which missed in second-level
`cache 60 of FIG. 1. or PCI interface 58 or SCI 57. These
`requests are loaded into request queue 26 which are priori-
`
`0012
`
`
`
`5,745,913
`
`7
`tized by request prioritizer 22 before being processed by
`request processor 24 which activates one of the DRAM state
`machines 10 for accessing one of the DRAM banks 16.
`Refresh timer 28 is a free—running counter/tirner which
`signals when a refresh is required, typically once every 0.1
`msec. The dilfcrent banks are refreshed in a staggered
`fashion to reduce the power surge when the DRAM is
`refreshed. Four refresh request signals REFO:3 are sent to
`request prioritizer 22 which begins the refresh as soon as any
`active requests which have already started have finished.
`Thus refresh is given the highest priority by request priori-
`tizer 22.
`As each new request is being loaded into request queue
`26, the bank address is decoded and the row address is sent
`to row—address register 14 of the decoded bank to determine
`if the new request is to the same row-address as the last
`access of that bank. The DRAM state machine 10 of that
`bank is also consulted to determine if a requested row is
`active or inactive. The row counter is also checked to see if
`ROW_OLD has been asserted yet, indicating a stale row.
`The results of the early query of the bank, the row-match
`indicator ROW_HII‘. the DRAM state machine’s current
`state of the row, ROW_ACI'IVE, and RAS_LOW from the
`row counter, are sent to request queue 26 and stored with the
`request.
`Request Queue Stores Row Hit, Count Status—FIG. 5
`FIG. 5 is a diagram of the fields for a memory request in
`the request queue. Enable bit 31 is set when a new request
`is loaded into request queue 26. but cleared when a request
`has been processed by a DRAM state machine which
`accesses the external memory. The full address of the
`request, including the bank, row, column, and byte address
`is stored in address field 32. An alternative is to store just the
`decoded bank number rather than the bank address. Status
`bits such as read or write indications are stored in field 34.
`An identifier for the source of the request is stored in source
`field 36.
`Other fields are loaded after the bank’s state machine, row
`counter, and row address are consulted. If the new row
`address of the request being loaded into request queue 26
`matches the row address of the last access, which is stored
`in row—address register 14, then ROW_}lIT bit 38 is set;
`otherwise it is cleared to indicate that the old row must be
`closed and a new row address slrobed in.
`The DRAM state machine 10 of the decoded bank is
`consulted to determine if the row is active (open) or inactive
`(closed). RAS is elfectively “low” and the row active for
`states 96, 98, 99. When RAS is "low", bit ROW_ACI'IVE
`39 is set. The row counter 12 is also consulted for the
`decoded bank, and ROWAOLD is copied to ROW_OLD
`bit 35 in request queue 26.
`Row Status Looked-Up During Cache Access—FIG. 6
`FIG. 6 is a diagram showing that the status of the DRAM
`row is obtained while the cache is determining if a cache hit
`or miss has occurred. Requests from CPU cores 42, 4-4, 46,
`48, and requests from SCI 57, and PCI interface 58 are sent
`to second-level cache 60 to determine if the data requested
`is present in the on-chip cache. At the same time as the cache
`lookup, before the cache hit or miss has been determined, the
`request’s address is decoded by bank decoder 62 to deter-
`mine which DRAM bank contains the data. ‘The decoded
`bank’ s state machine 10 and row-address register 14 are then
`selected. Row address Comparator 64 signals ROW__HIT if
`the new row address of the request matches the row address
`of the last access, stored in register 14. ROW_ACI'IVE is
`signaled by DRAM state machine 10 if the current state has
`the RAS signal active (low).
`
`8
`Row counter 12 of the decoded bank is also consulted to
`determine if the row has been open for along period of time
`without a recent access. ROW_OLD’ is signaled for such as
`stale row. These row status signals, ROW__HIT, ROW_
`ACTIVE, and ROW_OLD. are loaded into request queue
`26, possibly before cache 60 has determined if a hit or a miss
`has occurred. Should a cache hit occur,
`the row status
`information is simply discarded since cache 60 can supply
`the data without a DRAM access.
`The bank status information in request queue 26 is
`updated when the status of a bank changes. Alternately, the
`bank’s status may be read during each prioritization cycle,
`or as each request is processed.
`Request prioritizer 22 then examines the pending requests
`in request queue 26 to determine which request should be
`processed next by request processor 24. Request prioritizer
`22 often processes requests in a diflerent order than the
`requests are received in order to maximize memory perfor-
`mance.
`
`In the preferred embodiment, request prioritizer 22
`attempts to group requests to the same row together.
`Requests to the row with the most requests are processed
`before requests to other rows with fewer requests.
`Prioritizer Re-Orders Requests
`Request prioritizer 22 re-orders the requests to maximize
`memory bandwidth. Bandwidth can be increased by reduc-
`ing the number of row misses: requests to a row other than
`the open row. The prioritizer searches pending requests in
`request queue 26 and groups the requests together by bank
`and row. Any requests to a row that is already open are
`processed first. These requests to open rows are rapidly filled
`since the RAS precharge and opening delays are avoided.
`Row hits are requests to rows which are already open.
`Row-hit requests are re-ordered and processed before other
`row—miss requests. The row—miss request may be an older
`request which normally is processed before the younger
`row-hit request. However,
`the older row—miss request
`requires closing the row that is already open. If this row were
`closed, then the row-hit request would require spending
`additional delay to precharge RAS and open the row again.
`Once all of tile row-hit requests have been processed, only
`row-miss requests remain. Any requests to banks with RAS
`inactive (ROW,ACl‘IVE=0) are processed first, since no
`row is open. Then row-rniss requests to banks with open
`rows are processed. The row counter is used to determine
`priority of these requests: any row—miss requests that have
`ROW_OLD active are processed before requests with
`ROW_OLD inactive, so that stale rows are closed before
`more—reoently accessed rows are closed. The lowest priority
`request is a row—miss to a bank with a recently—accessed row
`open. Recently-accessed banks are likely to be accessed
`again, and thus a delay in closing the bank may allow a new
`request to that open row to be received and processed before
`closing the row to process the row—miss request.
`The pending requests are processed in this order:
`DRAM Refresh Request
`Row-Hit Request
`Row-Miss Request to Bank with Row Closed
`Row-Miss Request to Bank with Row Open, ROW_
`OLD=1
`
`Row-Miss Request to Bank with Row Open, ROW_
`OLD=0
`Thus a recently—accessed bank with ROW_OLD=0 is left
`65 open as long as possible because any row-rniss request to
`this bank is given the lowest priority. When multiple
`requests have the same priority, then requests in a group
`
`0013
`
`
`
`5,745,913
`
`9
`having a larger number of individual requests to a row are
`processed before smaller groups of requests.
`When multiple requests are pending to the current row,
`then these requests are processed before other requests to
`other rows. Efficient CAS-only cycles are used. When there
`is only one request to the current row, then an auto-precharge
`CAS cycle is performed to precharge the row as soon as
`possible. Auto-precharge is a special type of CAS cycle
`whereby a row precharge is requested immediately after the
`data has been internally read from the RAM array, even
`before the data has been bursted out of the DRAM chip.
`Auto-precharge is supported by some synchronous
`DRAM’s, allowing row precharge to begin before the data
`has been completely transfe