throbber
1208
`
`IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 33, NO. 8, AUGUST 1998
`
`A Replica Technique for Wordline and
`Sense Control in Low-Power SRAM’s
`
`Bharadwaj S. Amrutur and Mark A. Horowitz, Senior Member, IEEE
`
`Abstract— With the migration toward low supply voltages in
`low-power SRAM designs, threshold and supply voltage fluc-
`tuations will begin to have larger impacts on the speed and
`power specifications of SRAM’s. We present techniques based on
`replica circuits which minimize the effect of operating conditions’
`variability on the speed and power. Replica memory cells and
`bitlines are used to create a reference signal whose delay tracks
`that of the bitlines. This signal is used to generate the sense clock
`with minimal slack time and control wordline pulsewidths to limit
`bitline swings. We implemented the circuits for two variants of
`the technique, one using bitline capacitance ratioing in a 1.2-m
`8-kbyte SRAM, and the other using cell current ratioing in a 0.35-
`m 2-kbyte SRAM. Both the RAM’s were measured to operate
`over a wide range of supply voltages, with the latter dissipating
`3.6 mW at 150 MHz at 1 V and 5.2 W at 980 kHz at 0.4 V.
`
`Index Terms—Low power, low swing bus, low voltage, pulsed
`decoder, replica technique, self-timing, sense clock control,
`SRAM’s, threshold variation, wordline pulsing.
`
`I. INTRODUCTION
`
`LOW-POWER circuit designers have been continually
`
`pushing down supply voltages to minimize the energy
`consumption of chips for portable applications [1]–[3]. The
`same trend has also applied to low-power SRAM’s in the
`past few years [4]–[6]. While the supply voltages are scaling
`down at a rapid rate, to control subthreshold leakage, the
`threshold voltages have not scaled down as fast, which has
`resulted in a corresponding reduction of the gate overdrive for
`the transistors. With the fluctuations in the threshold voltages
`also not expected to decrease in future submicron devices [7],
`[8], the delay variability of low-power circuits across process
`corners will increase in the future [9]. The large delay spreads
`across process corners will necessitate bigger margins in the
`design of the bitline path in an SRAM, and also will result in
`larger bitline power dissipation and loss of speed. This problem
`can be mitigated by using a self-timed approach to designing
`the bitline path, based on delay generators which track the
`bitline delays across operating conditions.
`Traditionally, the bitline swings during a read access have
`been limited by using active loads of either diode-connected
`nMOS or resistive pMOS [10], [11], but these clamp the bitline
`swing at the expense of a steady bitline current. A more power-
`efficient way of limiting the bitline swings is to use high-
`
`impedance bitline loads and pulse the wordlines [12]–[15].
`Bitline power can be further minimized by controlling the
`wordline pulsewidth to be just wide enough to guarantee
`the minimum bitline swing development. This type of bitline
`swing control can be achieved by a precise pulse generator
`that can match the bitline delay. Low-power SRAM’s also
`use clocked sense amplifiers to limit the sense power. These
`are either the current mirror type [16], [17] or cross-coupled
`latch type [18], [19] designs. In the former, the sense clock
`turns on the amplifier sometime before the sensing, to set up
`the amplifier in the high-gain region. To reduce power, the
`amount of time the amplifier is ON should be minimized. In the
`latch-type amplifiers, the sense clock starts the amplification,
`and hence the sense clock needs to track the bitline delay to
`ensure correct and fast operation.
`Fundamentally, the clock path needs to match the data path
`to ensure fast and low-power operation. The data path starts
`from the local block select and/or global wordline, and goes
`through the wordline driver, memory cell, and bitline to the
`input of the sense amps. The clock path often starts from the
`local block select or some clock phase, and goes through a
`buffer chain to generate the sense clock. The delay variations
`in the former are dominated by the bitline delay since the
`memory cells are made out of minimum sized devices and are
`more vulnerable to process variations. Therefore, the delays
`of the two paths do not track each other very well over all
`process and environment conditions. Enough delay margin
`has to be provided to the sense clock path for worst case
`conditions, which reduces the average case performance. The
`rest of this paper describes methods of using replica circuits,
`which mimic the delay of the bitline path over all conditions
`to create the clocks, and gives experimental results from using
`these techniques. The next section presents simulation data
`comparing the matching of bitline delay with inverter chain
`delay and replica circuit delay under different operating con-
`ditions. The following two sections describe different methods
`of building replica circuits. Section III presents a clock circuit
`which uses a dummy memory cell that drives bitlines with
`reduced capacitance, and Section IV describes a circuit which
`uses a full bitline load. Results from two prototype chips which
`implement the two different replica techniques are presented
`in Section V.
`
`Manuscript received August 30, 1997; revised March 4, 1998. This work
`was supported by the Advanced Research Projects Agency under Contract
`J-FBI-92-194 and by Fujitsu Ltd.
`The authors are with the Center for Integrated Systems, Stanford University,
`Stanford, CA 94305-4070 USA (e-mail: amrutur@chroma.stanford.edu).
`Publisher Item Identifier S 0018-9200(98)05522-X.
`
`II. CLOCK MATCHING
`technique to generate the timing signals
`The prevalent
`within the array core essentially uses an inverter chain. This
`can take one of two forms—the first kind relies on a clock
`0018–9200/98$10.00 ª
`
`1998 IEEE
`
`Page 1 of 12
`
`SAMSUNG EXHIBIT 1012
`
`

`
`AMRUTUR AND HOROWITZ: REPLICA TECHNIQUE FOR WORDLINE AND SENSE CONTROL
`
`1209
`
`115 C and
`transistors have a
`and
`for 25 C). The
`2-sigma threshold variation unless suffixed by a 3, in which
`case they represent 3-sigma threshold variations. The process
`used is a typical 0.25- m CMOS process, and simulations
`are done for a bitline spanning 64 rows. We can observe that
`the bitline delay to inverter delay ratio can vary by a factor
`of two over these conditions, the primary reason being that,
`while the memory cell delay is mainly affected by the nMOS
`thresholds, the inverter chain delay is affected by both nMOS
`and pMOS thresholds. The worst case matching for the inverter
`delay chain occurs for process corners where the nMOS and
`pMOS thresholds move in the opposite direction. In the above
`simulations, it is assumed that they move independently, while
`in reality, there will be some correlation between them which
`would make the mismatch for the inverter delay chain less
`pronounced, but still worse than that of the replica element.
`The delay element is designed to match the delay of a
`nominal memory cell in a block. But in an actual block of
`cells, there will be variations in the cell currents across the
`cells in the block. Fig. 3 displays the ratio of delays for the
`bitline and the delay elements for varying amounts of threshold
`mismatch in the access device of the memory cell compared
`to the nominal cell. The graph is shown only for the case
`of the accessed cell being weaker than the nominal cell as
`this would result in a lower bitline swing. The curves for the
`inverter chain delay element (hatched) and the replica delay
`element (solid) are shown with error bars for the worst case
`fluctuations across process corners. The variation of the delay
`ratio across process corners in the case of the inverter chain
`delay element is large even with zero offset in the accessed
`cell, and grows further as the offsets increase. In the case of the
`replica delay element, the variation across the process corners
`is negligible at zero offsets, and starts growing with increasing
`offsets in the accessed cell. This is mainly due to the adverse
`impact of the higher nMOS thresholds in the accessed cell
`under slow nMOS conditions. It can be noted that the tracking
`of the replica delay element is better than that of the inverter
`chain delay element across process corners, even with offsets
`in the accessed memory cell.
`There are two more sources of variations that are not
`included in the graphs above and make the inverter matching
`even worse. The minimum sized transistors used in memory
`cells are more vulnerable to delta–
`variations than the
`nonminimum sized devices used typically in the delay chain.
`Furthermore, accurate characterization of the bitline capaci-
`tance is also required to enable a proper delay chain design.
`These two sources of variations would make the matching
`even worse for the inverter chain delay element.
`All of the sources of variations have to be taken into
`account in determining the speed and power specifications for
`the part. To guarantee functionality, the delay chain has to
`be designed for worst case conditions, which means that the
`clock circuit must be padded in the nominal case, degrading
`performance. Replica-based delay elements, by virtue of their
`good tracking, offer the possibility of designing SRAM’s with
`tight specifications across all process corners [22]. Two ways
`of creating and using these replica structures are explained in
`the following sections.
`
`(a)
`
`(b)
`
`Fig. 1. Common sense clock generation techniques.
`
`phase to do the timing [Fig. 1(a)] [20], and the second kind
`uses a delay chain within the accessed block, and is triggered
`by the block select signal [Fig. 1(b)] or a local wordline [21].
`The main problem in these approaches is that the inverter delay
`does not track the delay of the memory cell over all process
`and environment conditions. The tracking issue becomes more
`severe for low-power SRAM’s operating at low voltages due to
`enhanced impact of threshold and supply voltage fluctuations
`on delays as described by
`
`(1)
`
`which shows that delay variations are inversely proportional
`to the gate overdrive. Fig. 2 plots the ratio of bitline delay
`to obtain a bitline swing of 120 mV from a 1.2-V supply and
`the delay of two different delay elements for various operating
`conditions. One delay element is based on an inverter chain
`with a fan-out of four loading (diamonds), and the other is
`based on a replica structure consisting of a replica memory
`cell and a dummy bitline. The process and temperature are
`encoded as
`where
`represents the nMOS type (
`slow,
`fast,
`typical),
`represents the pMOS
`type (one of
`), and
`is the temperature (
`for
`
`Page 2 of 12
`
`

`
`1210
`
`IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 33, NO. 8, AUGUST 1998
`
`Fig. 2. Delay matching between the bitline delay to generate 120 mV and two delay elements, one based on an inverter chain and the other on a
`replica cell–bitline combination.
`
`III. FEEDBACK BASED ON CAPACITANCE RATIOING
`The replica delay stage is made up of a memory cell
`connected to a dummy bitline whose capacitance is set to
`be a fraction of the main bitline capacitance. The value of
`the fraction is determined by the required bitline swing for
`proper sensing. For the clocked voltage sense amplifiers we
`use (Fig. 4), the minimum bitline swing for correct sensing is
`around a tenth of the supply. An extra column in each memory
`block is converted into the dummy column by cutting its bitline
`pair to obtain a segment whose capacitance is the desired
`fraction of the main bitline (Fig. 5). The replica bitline has a
`similar structure to the main bitlines in terms of the wire and
`diode parasitic capacitances. Hence, its capacitance ratio to the
`main bitlines is set purely by the ratio of the geometric lengths
`. The replica memory cell is programmed to always store
`a zero so that, when activated, it discharges the replica bitline.
`The delay from the activation of the replica cell to the 50%
`
`discharge of the replica bitline tracks that of the main bitline
`very well (see Fig. 2—circles). The delays can be made equal
`by fine tuning of the replica bitline height using simulations.
`The replica structure takes up only one additional column per
`block, and hence has very little area overhead.
`The circuits to control
`the sense clock and wordline
`pulsewidths are shown in Fig. 6. The block decoder activates
`). The output of the replica
`the replica delay cell (node
`delay cell is fed to a buffer chain to start the local sensing,
`and is also fed back to the block decoder to reset the block
`select signal. Since the block select pulse is ANDed with the
`global wordline signal to generate the local wordline pulse,
`the latter’s pulsewidth is set by the width of block select
`signal. It is assumed that the block select signal does not
`arrive earlier than the global wordline. The delay of the buffer
`chain to drive the sense clock is compensated by activating
`the replica delay cell with the unbuffered block select signal.
`
`Page 3 of 12
`
`

`
`AMRUTUR AND HOROWITZ: REPLICA TECHNIQUE FOR WORDLINE AND SENSE CONTROL
`
`1211
`
`Fig. 5. Design of the replica bitline column.
`
`Fig. 3. Matching of the bitline delay with the inverter chain delay and the
`replica cell–bitline delay across process fluctuations over varying threshold
`offsets for the accessed memory cell.
`
`Fig. 4. Latch-type sense amplifier.
`
`,
`–
`The delay of the five inverters in the buffer chain,
`is set to match the delay of the four stages,
`, of the
`–
`block select to local wordline path (the sense clock needs to
`be a rising edge). The problem of delay matching has now
`been pushed from having to match bitline and inverter chain
`delay to having to match the delay of one inverter chain to
`a chain of inverters and an AND gate. The latter is easier
`to tackle, especially since the matching for only one pair of
`
`Fig. 6. Control circuits for sense clock activation and wordline pulse control.
`
`edges needs to be done. A simple heuristic for matching the
`delay of a rising edge of the five-long chain,
`–
`, to the
`rising delay of the four-long chain,
`–
`, is to ensure that
`the sums of falling delays in the two chains are equal, as well
`as the sum of rising delays [23] (Fig. 7). The
`chain has
`three rising delays and two falling delays, while the
`chain
`has two rising and falling delays. This simple sizing technique
`ensures that the rising and falling delays in the two chains
`are close to each other, giving good delay tracking between
`the two chains over all process corners. The delay from
`(see Fig. 6) to minimum bitline swing is
`
`Page 4 of 12
`
`

`
`1212
`
`IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 33, NO. 8, AUGUST 1998
`
`TABLE II
`BLOCK PARAMETERS: 64 ROWS, 32 COLUMNS
`
`Fig. 7. Delay matching of two buffer chains.
`
`TABLE I
`BLOCK PARAMETERS: 256 ROWS, 64 COLUMNS
`
`delay.
`and the delay to the senseclock is
`If
`, then the
`equals
`equals
`and
`sense clock fires exactly when the minimum bitline swings
`have developed.
`We next look at two design examples, one for a block of
`256 rows and 64 columns, and the other for a block with 64
`rows and 32 columns. The number of rows in these blocks is
`typical of large and small SRAM’s, respectively. For each
`block, the replica-based implementation is compared to an
`inverter-chain-based one. Table I summarizes the simulated
`performance of the 256 block design over various process
`corners. Five process corners are considered along with a 10%
`supply voltage variation at the
`corner. The delay elements
`are designed to yield a bitline swing of around 100 mV when
`the sense clock fires, under all conditions, with the weakest
`corner being the
`corner with slow nMOS and fast pMOS
`(since the memory cell is weak and inverters are relatively
`faster). For each type of delay element, the table provides the
`bitline swing when the sense clock fires and the maximum
`bitline swing after the wordline is shut off. An additional
`column notes the “slack” time of the sense clock with respect
`to an ideal sense clock as a fraction of a fan-out of 4 gate
`delay. This time indicates the time lost in the turning ON of
`the sense amp compared to an ideal sense clock generator
`
`which would magically fire under all conditions exactly when
`the bitline swings are 100 mV, and directly adds to the critical
`path delay for the SRAM. The last column shows the excess
`swing of the bitlines for the inverter chain case relative to
`the replica case as a percentage and represents the excess
`bitline power for the former over the latter. The last two rows
`show the performance at –10% of the nominal supply of 1.2
`V under typical conditions. Considering all of the rows of
`the table, we note that the slack time for the replica case is
`within 2.4 gate delays, while that of the inverter case goes
`up to 5.25 gate delays, indicating that the latter approach
`will lead to a speed specification which is at least 3 gate
`delays slower than the former. The bitline power overhead in
`the inverter-based approach can be up to 40% more than the
`replica-based approach. If we were to consider only correlated
`threshold fluctuations for the nMOS and the pMOS, then the
`delay spread for both of the approaches is lower by one gate
`delay, but the relative performance difference still exists. The
`main reason for the variation in the slack time for the replica
`approach is the mismatch in the delays of the buffer chains
`across the operating conditions. This comes about mainly due
`to the variation of the falling edge rate of the replica bitline.
`In the case of the inverter-based design, the spread in slack
`time comes about due to the lack of tracking between the
`bitline delay and the inverter chain delay, as discussed in the
`earlier section. To study the scalability of the replica technique,
`designs for a 64-row block are compared in Table II. The small
`bitline delay for short bitlines is easy to match even with
`an inverter chain delay element, and there is only a slight
`advantage for the replica design in terms of delay spread,
`while there is not much difference in the maximum bitline
`swings. The maximum bitline swings are considerably larger
`than the previous case, mainly due to the smaller bitline
`capacitance.
`This technique can be modified for clocked current mirror
`sense amplifiers, where the exact time of arrival of the sense
`clock is not critical as long as it arrives sufficiently ahead to
`set up the amplifiers in the high-gain stage by the time the
`bitline signal starts developing. Delaying the sense clock to
`be as late as safely possible minimizes the amplifier static
`power dissipation. This can be achieved in the above scheme
`
`Page 5 of 12
`
`

`
`AMRUTUR AND HOROWITZ: REPLICA TECHNIQUE FOR WORDLINE AND SENSE CONTROL
`
`1213
`
`Fig. 8. Current-ratio-based replica structure.
`
`chain with respect to
`by merely trimming the delay of the
`the
`chain to ensure that the sense clock turns on a fixed
`number of gate delays before the bitlines differentiate.
`
`IV. FEEDBACK BASED ON CELL-CURRENT RATIOING
`While the above technique works well, it can be modified
`further to improve the access time. If the reset timing signal for
`the wordline can be generated locally, then the wordline driver
`can be skewed to speed up the propagation of the rising block
`select transition, reducing the access time, with the falling
`wordline transition being triggered off the local reset signal,
`similar to the postcharge gates discussed in [24].
`An extra row and column containing replica memory cells
`can be used to provide local resetting timing information
`for the wordline drivers. The extra row contains memory
`cells whose pMOS devices are eliminated to act as current
`sources, with currents equal to that of an accessed memory
`cell (Fig. 8). All of their outputs are tied together, and they
`simultaneously discharge the replica bitline. This enables a
`multiple of memory cell current to discharge the replica bitline.
`The current sources are activated by the replica wordline,
`which is turned on during each access of the block. The replica
`bitline is identical in structure to the main bitlines, with dummy
`memory cells providing the same amount of drain parasitic
`loading as the regular cells. By connecting
`current sources
`to the replica bitline, the replica bitline slew rate can be made
`to be
`times that of the main bitline slew rate, achieving the
`same effect as bitline capacitance ratioing described earlier.
`
`Fig. 9. Skewed wordline driver.
`
`The local wordline drivers are skewed to speed up the
`rising transition, and they are reset by the replica bitline as
`shown in Fig. 9. The replica bitline signal is forwarded into
`the wordline driver through the dummy cell access transistor
`. This occurs only in the activated row since the access
`transistor of the dummy cell is controlled by the row wordline
`, minimizing the impact of the extra loading of
`on the
`replica bitline. The forward path of the wordline driver can
`be optimized for speed, independent of the resetting of the
`block-select or global wordline by skewing the transistor sizes.
`The control circuits to activate the replica bitline and the
`sense clock are shown in Fig. 10. The dummy wordline driver
`
`Page 6 of 12
`
`

`
`1214
`
`IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 33, NO. 8, AUGUST 1998
`
`Fig. 10. Control circuits for current-ratio-based replica structure.
`
`TABLE III
`BLOCK PARAMETERS: 256 ROWS, 64 COLUMNS
`
`. The replica
`is activated by the unbuffered block select
`bitline is detected by
`, and buffered to drive the sense clock
`signal. If the delay of the replica bitline is matched with the
`bitline delay and the delay of
`is made equal to
`, then the sense clock fires when the bitline swing is
`the desired amount. Also, if the delay of
`is equal to
`the delay of generating
`(Fig. 9) from the replica bitline,
`then the wordline pulsewidth will be the minimum needed
`to generate the required bitline swing. The performance for
`a 256-row block with 64 columns implementing this replica
`technique is summarized in Table III. The slack in activating
`the sense clock is less than 1.5 gate delays, and the maximum
`bitline swing is within 170 mV. When compared with the
`implementation based on capacitance ratioing discussed in
`the previous section,
`this design is faster by about
`two-
`thirds of a gate delay due to the skewing of the wordline
`driver.
`
`Fig. 11. Layout of a block with replica cell and column.
`
`The power dissipation overhead of this approach is the
`switching power for the replica bitline which has the same
`capacitance as the main bitline. The power overhead becomes
`small for large access width SRAM’s. The area overhead
`consists of one extra column and row, and the extra area
`required for the layout of the skewed local wordline drivers.
`
`V. MEASURED RESULTS
`
`A. SRAM Test Chip with Replica Feedback
`Based on Capacitance Ratioing
`The replica feedback technique based on capacitive ratioing
`was implemented in a 1.2- m process as part of a 2K
`32 SRAM [25]. The SRAM array was partitioned into eight
`blocks, each with 256 rows and 32 columns. Making the
`block width equal to the access width ensures that the bitline
`power is minimized since only the desired number of bitline
`columns swing in any access. Two extra columns near the
`wordline drivers and two extra rows near the sense amps are
`provided for each block, with the replica column being the
`second one from the wordline drivers and the replica cell being
`the second cell in the column (Fig. 11). The extra row and
`column surrounding the replica cell contain dummy cells for
`padding purposes, so that the replica cell does not suffer from
`any processing variations at the array edge. The bitlines in
`the replica column are cut at a height of 26 rows, yielding
`a capacitance for the replica bitline which is one-tenth that
`of the main bitline. Further control for the replica delay is
`provided for testing purposes by utilizing part of the replica
`structure above the cut to provide an adjustable current source.
`This consists of a pMOS transistor whose drain is tied to the
`replica bitline and whose gate is controlled from outside the
`block. Since the replica bitline runs only part way along the
`column, the bitline wiring track in the rest of the column can
`be used for running the gate control for this transistor. By
`varying the gate voltage, we can subtract from the replica cell
`
`Page 7 of 12
`
`

`
`AMRUTUR AND HOROWITZ: REPLICA TECHNIQUE FOR WORDLINE AND SENSE CONTROL
`
`1215
`
`Fig. 14. Level-to-pulse converter.
`
`Fig. 12. Column IO circuits.
`
`Fig. 15. Global IO circuits.
`
`access width requires that all of this peripheral circuitry be
`laid out in a bit pitch. The precharge and write control are
`locally derived from the block select, write enable, and the
`replica feedback signals.
`An 8- to 256-row decoder is implemented in three stages
`of two-input AND gates. The transistor sizes in the gates are
`skewed to favor one transition as described in [24] to speed
`up the clock to wordline rising delay. The input and output
`signals for the gates are in the form of pulses. In that design,
`the gates are activated by only one edge of the input pulse, and
`the resetting is done by a chain of inverters to create a pulse
`at the output. While this approach can lead to a cycle time
`faster than the access time, careful attention has to be paid to
`ensure sufficient overlap of pulses under all conditions. This
`is not a problem for pulses within the row decoder as all of
`the internal convergent paths are symmetric. But for the local
`wordline driver inputs, the global wordlines (the outputs of
`the row decoder) and the block select signal (output of the
`block decoder) converge from two different paths. To ensure
`
`Fig. 13. Pulsed global wordline driver.
`
`pulldown current, thus slowing the replica delay element if
`needed. The inverters
`and
`(Fig. 6) are laid close
`to the replica cell and to each other to minimize wire loading.
`Clocked voltage sense amplifiers, precharge devices and
`write buffers are all laid on one side of the block as shown in
`Fig. 12. Since the bitline swings during a read are significantly
`less than that for writes, the precharge devices are partitioned
`in two different groups, with one activated after every access
`and the other activated only after a write, to reduce power in
`driving the precharge signal. Having the block width equal to
`
`Page 8 of 12
`
`

`
`1216
`
`IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 33, NO. 8, AUGUST 1998
`
`TABLE IV
`MEASUREMENT DATA FOR THE 1.2-m PROTOTYPE CHIP
`
`Fig. 16. Die photo of a 1.2-m prototype chip.
`
`Fig. 18. On-chip probed databus waveforms.
`
`Fig. 17. On-chip probed bitline waveform.
`
`sufficient margins for their overlap, the global wordline signal
`is prolonged by the modifying the global wordline driver
`as shown in Fig. 13. Here, the reset signal is generated by
`ANDing the gate’s output and one of the address inputs to
`increase the output pulsewidth. Clearly, for the gate to have
`any speed advantage over a static gate, the extra loading of
`the inverter on one of the inputs must be negligibly small
`compared to the pMOS devices in a full NAND gate. The
`input pulses to the row decoder are created from level address
`inputs by a “chopper” circuit shown in Fig. 14, which could
`potentially be merged with a 2-bit decode function.
`The data bus in the SRAM, which connects the blocks to the
`IO ports, is implemented as a low-swing, shared differential
`
`Fig. 19. Die photo of a prototype chip in 0.25-m technology.
`
`bus. The voltage swings for reads and writes are limited
`by using a pulsing technique similar to that described in
`Section IV. During reads, the bitline data are amplified by
`the sense amps and transferred onto the local data bus through
`devices
`and
`(see Fig. 12). The swing on the local data
`bus is limited by pulsing the sense clock. The sense clock
`pulse width is set by a delay element consisting of stacked
`
`Page 9 of 12
`
`

`
`AMRUTUR AND HOROWITZ: REPLICA TECHNIQUE FOR WORDLINE AND SENSE CONTROL
`
`1217
`
`Fig. 20. Architecture of the 0.25-m prototype SRAM.
`
`pulldown devices connected to a global line which mimics
`the capacitance of the databus. The data bus mimic line also
`serves as a timing signal for the global sense amps at the end
`of the data bus, and has the same capacitance as that of the
`worst case data bit. The global sense amps, write drivers, and
`the data bus precharge are shown in Fig. 15. The same bus
`is also used to communicate the write data from the IO ports
`to the memory blocks during writes. The write data are gated
`with a pulse, whose width is controlled to create low swings
`on the databus. The low-swing write data are then amplified
`by the write amplifiers in the selected block and driven onto
`the bitlines (Fig. 12).
`Fig. 16 displays the die photo of the chip. Only two blocks
`are implemented in the prototype chip to reduce the test-
`chip die area, but extra wire loading on some of the global
`wordlines and IO lines is provided to emulate the loading in
`a full SRAM chip. Accesses to these rows and bits are used
`to measure the power and speed. The bitline waveforms were
`probed for different supply voltages to measure the bitline
`swing. Fig. 17 displays the on-chip measured waveform of a
`bitline pair at 3.5 V supply. The bitline swings are limited
`to be about 11% of the supply at 3.5 V. Table IV gives the
`measured speed, power, bitline swing and IO line swing for
`the chip at three different supply voltages of 1.5, 2.5, and 3.5
`V. The on-chip probed waveforms for the databus are shown
`in Fig. 18 for a consecutive read and write operation. The IO
`bus swings for writes are limited to 11% of the supply at 3.5
`V, but are 19% of the supply for reads. The rather large read
`swings are due to improper matching of the sense-amp mimic
`circuit with the read sense amps.
`
`B. SRAM Test Chip with Replica Feedback
`Based on Cell-Current Ratioing
`A cell-current ratioing-based replica technique was imple-
`mented in a 0.35- m (0.25- m drawn gate length) process
`from Texas Instruments. A 2-kbyte RAM is partitioned into
`two blocks of 256 rows by 32 columns. Two extra rows are
`added to the top of each block, with the topmost row serving
`as a padding for the array and the next row containing the
`replica cells.
`The prototype chip (Fig. 19) has two blocks, each 256
`32 bits, to form a 8-kbit memory. The premise in
`rows
`this chip was that the delay relationship between the global
`wordline and the block select is unknown, i.e., either one of
`them could be the late arriving signal, and hence the local
`wordline could be triggered by either. This implies that the
`replica bitline path too needs to be activated by either the
`local block select or a replica of the global wordline. The
`replica of the global wordline is created by mirroring the
`critical path of the row decoders as shown in Fig. 20. This
`involves creating replica predecoders, global word drivers,
`replica predecode, and global wordlines with loading identical
`to the main predecode and global wordlines [26]. The replica
`global wordline and the block select signal in each block
`generate the replica wordline which discharges the replica
`bitline as described in the previous section. Thus, the delay
`from the address inputs to the bitline is mirrored in the delay
`to the replica bitlines. But the additional delay of buffering
`this replica bitline to generate the sense-amp signal cannot
`be cancelled in this approach. Hence, the only alternative to
`
`Page 10 of 12
`
`

`
`1218
`
`IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 33, NO. 8, AUGUST 1998
`
`16 kbytes will dissipate 7 mW at 150 MHz at 1 V and 10.4
`uW at 980 kHz at 0.4 V.
`
`VI. SUMMARY
`We presented a replica-based delay circuit which is used to
`generate the sense clock and control wordline pulse widths to
`limit bitline swings in low-power SRAM’s. The circuit helps
`to minimize the variations in the speed and power of SRAM’s
`across varying operating conditions, and is especially suitable
`for SRAM’s with large bitline loads. Two flavors of the circuit,
`one which uses bitline capacitance ratioing and the other which
`uses cell current ratioing, were discussed. The former yields an
`implementation with the smallest area overhead and no other
`changes required in the rest of the SRAM circuitry. The latter
`allows for implementing skewed wordline driver designs to
`achieve a slightly faster operation. Prototype chips in 1.2 and
`0.35 m with both of the implementations were measured to
`have a wide operating range, with deep subvolt operation for
`the 0.35- m implementation.
`
`ACKNOWLEDGMENT
`The authors thank C.-K. K. Yang and D. Weinlader for help
`with the design and testing of the chips, C. Lemonds and
`Texas Instruments for providing the opportunity to fabricate
`the second chip, Rambus Inc. for providing access to their
`lasers, and T. Mori of Fujitsu Ltd. for helpful discussions.
`
`REFERENCES
`
`[1] A. P. Chandrakasan et al., “Low-power CMOS digital design,” IEEE J.
`Solid-State Circuits, vol. 27, pp. 473–484, Apr. 1992.
`[2] W. Lee et al., “A 1 V DSP for wireless commun

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket