`The memory technology for embedded memories has a wide variation, ranging
`from small blocks of RO Ms, hundred of kilobytes of cache RAMs, high density
`(several megabits) of DRAMs, and small to medium density nonvolatile
`memory blocks of EEPROMs and flash memories. For memories embedded in
`the logic process, the most important figure of merit is the compatibility to
`logic process. In general, embedded ROM used for microcode storage has the
`highest compatibility to the logic process; however, its application is rather
`limited. Programmable logic array (PLA) or ROM-based logic is well used, but
`it is considered a special case of embedded ROM [1].
`Embedded SRAMs is one of the most frequently used memory embedded in
`logic chips, and typical applications include on-chip buffers, caches, register
`files, and so on. The standard six-transistor (6T) SRAM cell is also fairly
`compatible to a logic process, unless there are special structures involved. The
`bit density is not very high. Polysilicon resistor load ( 4T) cells provide higher
`bit density, but at the cost of process complexity associated with additional
`polysilicon-layer resistors. Embedded DRAM (eDRAM) that provides high
`density features is also becoming quite popular in combination with RISC
`processor and other peripheral circuitry in system-on-chip (SOC) types of
`applications for graphic accelerators and multimedia chips. Embedded EP(cid:173)
`ROM, EEPROM, and flash memory technologies require two to three or more
`additional masking steps to the standard logic process. These are finding
`applications in microcontrollers, field programmable gate arrays (FPGAs), and
`complex programmable logic devices (CPLDs).
`In a typical logic VLSI design, the on-chip memory integrated with other
`circuitry may include anything from simple registers to caches of several
`megabit sizes. In special applications such as the microprocessors, memory can
`occupy more than 50% of the chip area. A processor in search of data or
`instructions looks first in the first-level (Ll) cache memory, which is closest to
`the processor, and if the information is not found there, the request is passed
`on to the second-level cache (L2). The integration of on-chip Ll improves both
`the processor performance and bandwidth. The SRAM cells used in the Ll
`cache are usually larger than the ones for commodity SRAMs and for highest
`performance, fabricated with the same process as for the processor logic. Thus,
`an Ll cache tends to be faster than the L2 and L3 (the off-chip caches) and
`often utilizes special SRAM processes that minimize cell area.
`The major goal of cache memory design is to minimize the miss rate- that
`is, the possibility that immediately needed bits of information are not available
`in the nearest level of cache memory and have to be fetched from the higher
`levels of caches or even the main memory. As the processor performance has
`increased, the wait time in idle memory cycles has also increased, which has
`led to the so-called memory-processor performance gap, as illustrated in Figure
`6.1. Therefore, the ability to integrate large memory close to the processor
`(possibly, on-chip) helps remove some of the constraints of a slow memory
`access, allowing increase of more conventional pin-limited bus widths of
`16--256 bits, or even as high as 1024 bits [2].
`Another trend that drives the integration of DRAMs into logic chips is the
`memory granularity, which refers to the smallest increment by which a memory
`size may be increased. For applications that require less than 2 Mb of memory,
`an embedded SRAM would probably be more cost effective and should be
`considered first. In the 0.18-µm process generation, it is expected that the
`embedded DRAM solution would become cost effective at above approximate-
`g 1,000 .
`ro g 100 ...
`0 c:

`~rocessor-~emory !
`performance gap \
`Illustration of performance levels of processor versus DRAM, over several
`Figure 6.1
`generations [2].
`ly 2-Mb density. The embedding of DRAM not only reduces power by
`eliminating the need for off-chip drivers, but also allows for more active power
`management systems.
`In a DRAM manufacturing process, tight lithography and structural inno(cid:173)
`vations are combined to fabricate very dense cell arrays, and focus is on the
`minimization of cell area. Cell structure innovations include buried trench
`capacitor or stacked capacitor structure. Furthermore, in DRAM technologies,
`low leakage current levels are crucial, because the goal is to minimize the loss
`of stored charge (increase its retention time), including the transistor off(cid:173)
`current as well as the junction leakage. Typically, the low off-current levels are
`attained by using relatively high threshold voltages and longer channel lengths.
`In fact, unlike in logic process, channel lengths are significantly longer than the
`minimum dimension (e.g., in a 0.25-pm process, the designed gate lengths may
`be as long as 0.35-0.40 pm). Therefore, when the longer channels are combined
`with the rather high threshold voltages, the device on-currents are low as
`compared to those for the devices fabricated on a same generation standard
`logic process.
`In comparison to DRAM process, logic technologies use gate-level mini(cid:173)
`mum dimensions to drive short channel lengths to optimize the circuit timing
`and minimize the timing skew from channel length variations. The DRAM
`processes prefer a single work-function gate material (usually 11-type) for
`submicron technologies, so that the p-channel FET operates as a buried(cid:173)
`channel device. The gate of a buried pFET device is more loosely coupled to
`the channel it controls, and the performance is poorer as compared to the
`surface-channel device. The logic technology process runs dual-function gates:
`n-poly gates for the 11FETs and p-poly gates for the pFETs to get highest
`performance at the cost of complexity.
`In a given DRAM technology, the devices have thicker gate dielectric than
`would be consistent with a constant field scaling. For example, a 0.25-µm
`DRAM process would typically have about 8-nm gate oxide, whereas 0.25-µm
`logic process typically uses a 5-nm gate oxide to ensure maximum performance
`at 2.5 V. In a DRAM process, to minimize cost, all devices (and not just the
`transfer gates that are stressed with the boosted gate voltage) are fabricated
`with the thicker gate oxide, even though it provides suboptimal performance.
`The logic circuits often have self-aligned silicides (called salicides, in contrast
`to the polycide process) in which the gate and diffusion area are silicided
`simultaneously. However, in DRAM technology, the silicided diffusions are
`avoided because they can increase the junction leakage, even though the gate
`conductor usually has a silicide that is put on before the gate is defined.
`The density of a DRAM chip is mostly dependent on the cell size or array
`density, and it supports circuits that rarely push density limits and are less
`strictly controlled. Thus, the performance compromises made by commodity
`DRAM processes to reduce cost and increase yield result in a significant gap
`in performance between the logic and DRAM technologies. A simple logic
`circuit fabricated in a logic technology can be twice as fast as a comparable
`circuit fabricated in a high-yield, commodity DRAM technology. As a result
`of the tradeoffs, while the DRAM technologies are capable of dense features,
`the devices have lower drive capability. Thus, to drive the next stages, the
`devices have to be made larger (i.e., wider), and the indirect consequence is that
`despite the high circuit density of a DRAM process, it lags about one(cid:173)
`to one-half generation behind the logic technology process. The logic technol(cid:173)
`ogy may have up to seven or eight levels of high-performance wiring (intercon(cid:173)
`nect) as compared with two or three for the DRAMs, and it requires an
`extremely planar surface at every level.
`These relative tradeoffs between the two technologies have spawned argu(cid:173)
`ments regarding which technology should be preferred for the embedded
`D RA Ms-- fabricating memory in a logic-based technology versus fabricating
`logic circuits in a DRAM-based technology.
`An ASIC with embedded DRAM costs about 40% more than the same size
`die with pure logic, and combining the two produces logic that is 10-30%
`slower than the pure logic in the same process geometry. Also, embedded
`DRAM requires internal built-in-self test (BIST) structures and additional
`testing, which can increase the time to market. Due to these disadvantages, the
`embedded DRAM approach may not be appropriate for every design. The first
`to use embedded DRAM technology were the graphics, networking, and data
`storage applications. Graphic controllers use embedded DRAM to boost the
`throughput, network controllers are replacing embedded SRAM with large
`DRAM buffers, and disk drive controllers are embedding DRAM because
`discrete devices take up too much board area. A number of traditional memory
`manufacturers such as Mitsubishi, Toshiba, and Samsung, and so on, offer
`embedded DRAM technology for the ASICs. Also, many logic process vendors
`supply ASICs with embedded memory macros and intellectual property (IP)
`A major advantage of embedding DRAM is the much lower power
`consumption, because it eliminates tl;ie need for power-consuming driver
`circuits and allows the circuits to operate at chip's internal voltage, which is
`typically below 2 V. Embedding DRAM improves system performance, be(cid:173)
`cause the signals between CPU and memory no longer suffer the delays from
`the drivers and additional peripheral chips (or PCB) routing. However, the
`major performance boost comes when the system is adapted to take full
`advantage of the embedded DRAM capabilities. For example, by widening the
`interface between the CPU and the DRAM array from the typical 64 bits to
`256 bits, a 100-MHz DRAM array can provide a memory bandwidth of 3
`Gbytes/s. Such wide interfaces are not practical with discrete designs because
`the package size on the ASIC would be prohibitively expensive. The embedded
`array has no such limitation.
`Although DRAM is more complicated to work with compared to any other
`available memory, it is popular as discrete implementation because it is the
`most compact memory technology available, and therefore the cheapest. The
`size difference between a DRAM and a SRAM can be nearly tenfold. In a
`0.25-pm process, a DRAM memory cell occupies 0.6-1.0pm 2
`, whereas the
`SRAM cell occupies 5-9 pm 2
`. Therefore, an embedded DRAM can allow an
`ASIC to contain as much as 128 Mb in a 0.25-pm process.
`Another advantage is noise reduction. The interconnect between the proces(cid:173)
`sor and the memory carries some of the highest-frequency signals that generate
`electrical noise, which is increasingly difficult to control as the system clock
`speeds increase. The fact that a memory bus contains many high-speed signals
`routed together makes the problem worse. Therefore, bringing the signals
`inside the ASIC by using embedded DRAM approach reduces the difficulty of
`controlling electrical noise on the board. Figure 6.2 shows (a) an on-chip
`DRAM that eliminates both the PCB traces and two sets of I/O drivers
`required and (b) an ASIC plus a discrete DRAM design approach [3]. The
`embedding of DRAM can also improve the granularity of the ASIC itself,
`because it allows the system designers to select an optimum size for their
`system memory without any waste.
`The embedded DRAM can reduce the system engineering effort and
`therefore, possibly, the time to market. While the discrete DRAM has many
`access modes, embedded DRAM has only one basic mode, which eliminates
`the need for extensive architectural analysis to determine which type of DRAM
`provides the best system performance. Despite these savings, there are some
`drawbacks also, such as the process costs, technical risks, and additional test
`the embedded memory approach. Although embedded
`requirements for
`DRAM lowers some of the system costs, the cost of the ASIC is much greater
`Figure 6.2 A comparison of (a) ASIC with embedded DRAM and (b) ASIC with PCB
`connection to a discrete DRAM [3].
`and may override the system savings. An embedded DRAM process costs
`nearly 40% more to run than a logic process, because it requires extra
`production steps (i.e., more masks). As a result, an ASIC with on-chip DRAM
`is almost always more expensive than a pure logic ASIC combined with a
`discrete DRAM approach.
`Memory testing has several components that are different from the logic
`testing, and need specialized tests such as the functional test patterns to detect
`for the memory array pattern sensitivity faults and data retention measure(cid:173)
`ments, under worst-case refresh timings and operating temperature extremes.
`While these tests are routinely performed by the commodity DRAM manufac(cid:173)
`turers, the ASIC manufacturers may not be equipped to perform this complex
`testing on the embedded DRAMs. Therefore, the embedded DRAMs in a logic
`process may require a different approach such as the design for testability
`(DFT) and built-in self-test (BIST) techniques. These DFT and BIST tech(cid:173)
`niques for embedded memories were discussed in Semiconductor Memories,
`Chapter 5. The addition of testability requirements to embedded memories
`may increase the time to market, as well as the system cost.
`A key advantage of the embedded memory approach is the higher packag(cid:173)
`ing density and board space saving, which is a very desirable feature for the
`notebook computers, mobile computing, and portable communication devices.
`In the conventional multichip memory approach, interconnections require
`large 1/0 buffers to overcome package and board level trace impedances. The
`resultant increased power means limited battery life and often reduced reliabil(cid:173)
`ity. It is estimated that a graphic controller with embedded DRAM consumes
`roughly 500-750 mW, which is roughly 25% the power of its multichip
`alternative, consuming approximately 2.5 W.
`The transistor gate oxide thicknesses differ for the logic and memory cell
`processes. The standard logic transistors have thin oxides with low turn-on
`threshold voltages to minimize switching time and maximize performance. In
`contrast, many memory processes require thicker gate oxide transistors that
`have high turn-on threshold voltages to minimize off-transistor leakage cur(cid:173)
`rent, which is a prime determinant of the required DRAM refresh frequency.
`Also, thicker oxides improve data retention characteristics for the EPROM,
`EEPROM, and flash memory, because they can help memory cells withstand
`the effects of high voltages and corresponding electrical field stresses during
`repeated programming and erase cycles.
`Mitsubishi for its larger than 0.25-µm design processes, uses normal thin
`gate oxide logic transistors for fabricating the DRAM memory cells also,
`depending on the lower operating voltages to reduce leakage current. However,
`at 0.25-µm and below processes, Mitsubishi's HyperDRAM process uses
`dual-oxide thicknesses and adds three processing steps. Standard logic and
`HyperDRAM share the same metal pitch and therefore the same logic-layout
`libraries but have different logic timing. Mitsubishi's triple-well approach
`isolates the DRAM substrate from the bias and injected noise originating in
`the logic and standard SRAM circuits and also contains any required DRAM
`substrate bias (see Figure 6.3) [4]. Embedded memory density depends on the
`Figure 6.3 Mitsubishi embedded DRAM triple-well process cross section [4].
`internal bus-width, including optional parity support, array aspect ratio, the
`number of memory macros, and logic gate count.
`In addition to providing embedded DRAM for ASIC applications, Mit(cid:173)
`subishi leverages its eDRAM processes to build the 3-D RAM graphic chip
`that combines 1.25 Mbytes of DRAM with a high-performance ALU and also
`combines the M32R/D 32-bit RISC CPU with 2 Mbytes of DRAM. Mitsubishi
`has ported its 32-bit M32R CPU not only to the embedded DRAM with
`M32R/D, but also to the flash memory with the M32R/E family.
`Another example is Samsung Semiconductor, which supplies merged
`DRAM and logic on MDL90, a 3.3-V, 0.35-/Im process, with three-layer or
`four-layer metal process. This process is the same that the company uses for
`its 500-MHz Alpha 21164 CPU. It supports as much as 24 Mbits of extended
`dataout (EDO) DRAM or SDRAM. A four-well approach ensures isolation
`between the logic and DRAM subsections. There are other vendors offering
`embedded flash memory fabricated on an EEPROM process, such as Atmel,
`Hyundai, Lucent Technologies, Motorola, and Texas Instruments. EEPROM
`variants, although they have more complex cell structures than the NOR
`alternatives, offer efficient programming and erasing that minimizes the re(cid:173)
`quired size of on-chip charge pumps for scaling down to low-voltage operation,
`and, unlike the NAND flash memory, are appropriate for both code and data
`storage. Some companies are also planning to introduce FRAM as part of their
`embedded memory portfolio. However, FRAM has a more complex manufac(cid:173)
`turing process because of its specialized capacitor-dielectric material structure.
`The selection of embedded memory approach as compared to the discrete
`DRAM-based memory systems has advantages as well as disadvantages in
`three areas of consideration, as follows:
`1. On-Chip Memory Inte1face
`· The replacement of the off-chip drivers with smaller on-chip drivers can
`reduce power consumption significantly, because large board wire capac(cid:173)
`itive loads are avoided. For example, a system that needs a 4-Gbyte/s
`bandwidth and a bus width of 256 bits, built with discrete SDRAMs
`(16-bit interface at 100 MHz), would require about 10 times the power of
`an eDRAM with an internal 256-bit interface. However, even though the
`use of eDRAM may reduce overall power consumption of the system, the
`power consumption per chip may increase, and therefore the junction
`temperature may increase, which can affect the DRAM retention time.
`• Embedded DRAMs can achieve much higher clock frequencies than the
`discrete SDRAMs, because the chip interface can be 512 bits wide (or even
`higher) as compared to the discrete SDRAMs that are limited to 16-64
`bits. It is possible to make a 4-Mb eDRAM with a 256-bit interface,
`whereas it would require 16 discrete 4-Mb chips (organized as 256K x 16)
`to achieve the same width, and the granularity of such a discrete system
`is 64 Mb (overcapacity for an application that needs, say, 4 Mb of
`• In eDRAMs, interconnect wire lengths can be optimized for a given
`application, which can result in lower propagation delays and highe

