`
`Nitin Agrawal∗, Vijayan Prabhakaran, Ted Wobber,
`John D. Davis, Mark Manasse, Rina Panigrahy
`Microsoft Research, Silicon Valley
`∗University of Wisconsin-Madison
`
`Abstract
`Solid-state disks (SSDs) have the potential to revolution-
`ize the storage system landscape. However, there is little
`published work about their internal organization or the
`design choices that SSD manufacturers face in pursuit of
`optimal performance. This paper presents a taxonomy of
`such design choices and analyzes the likely performance
`of various configurations using a trace-driven simulator
`and workload traces extracted from real systems. We find
`that SSD performance and lifetime is highly workload-
`sensitive, and that complex systems problems that nor-
`mally appear higher in the storage stack, or even in dis-
`tributed systems, are relevant to device firmware.
`
`1 Introduction
`
`The advent of the NAND-flash based solid-state stor-
`age device (SSD) is certain to represent a sea change in
`the architecture of computer storage subsystems. These
`devices are capable of producing not only exceptional
`bandwidth, but also random I/O performance that is
`orders of magnitude better than that of rotating disks.
`Moreover, SSDs offer both a significant savings in power
`budget and an absence of moving parts, improving sys-
`tem reliability.
`Although solid-state disks cost significantly more per
`unit capacity than their rotating counterparts, there are
`numerous applications where they can be applied to great
`benefit. For example, in transaction-processing systems,
`disk capacity is often wasted in order to improve oper-
`ation throughput.
`In such configurations, many small
`(cost inefficient) rotating disks are deployed to increase
`I/O parallelism. Large SSDs, suitably optimized for ran-
`dom read and write performance, could effectively re-
`place whole farms of slow, rotating disks. At this writ-
`ing, small SSDs are starting to appear in laptop comput-
`ers because of their reduced power-profile and reliability
`in portable environments. As the cost of flash continues
`to decline, the potential application space for solid-state
`disks will certainly continue to grow.
`Despite the promise that SSDs hold, there is little in
`the literature about the architectural tradeoffs inherent in
`
`their design. Where such knowledge exists, it typically
`remains the intellectual property of SSD manufacturers.
`As a consequence, it is difficult to understand the archi-
`tecture of a given device, and harder still to interpret its
`performance characteristics.
`In this paper, we lay out a range of design tradeoffs
`that are relevant to NAND-flash solid-state storage. We
`then analyze several of these tradeoffs using a trace-
`based disk simulator that we have customized to char-
`acterize different SSD organizations. Since we can only
`speculate about the detailed internals of existing SSDs,
`we base our simulator on the specified properties of
`NAND-flash chips. Our analysis is driven by various
`traces captured from running systems such as a full-scale
`TPC-C benchmark, an Exchange server workload, and
`various standard file system benchmarks.
`We find that many of the issues that arise in SSD
`design appear to mimic problems that have previously
`appeared higher in the storage stack.
`In solving these
`hard problems, there is considerable latitude for design
`choice. We show that the following systems issues are
`relevant to SSD performance:
`
`• Data placement. Careful placement of data across
`the chips of an SSD is critical not only to provide
`load balancing, but to effect wear-leveling.
`
`• Parallelism. The bandwidth and operation rate of
`any given flash chip is not sufficient to achieve op-
`timal performance. Hence, memory components
`must be coordinated so as to operate in parallel.
`
`• Write ordering. The properties of NAND flash
`present hard problems to the SSD designer. Small,
`randomly-ordered writes are especially tricky.
`
`• Workload management. Performance is highly
`workload-dependent. For example, design deci-
`sions that produce good performance under sequen-
`tial workloads may not benefit workloads that are
`not sequential, and vice versa.
`
`As SSDs increase in complexity, existing disk models
`will become insufficient for predicting performance. In
`
`USENIX Association
`
`USENIX ’08: 2008 USENIX Annual Technical Conference
`
`57
`
`CSCO-1009
`
`
`
`particular, random write performance and disk lifetime
`will vary significantly due to the locality of disk write
`operations. We introduce a new model for characterizing
`this behavior based on cleaning efficiency and suggest a
`new wear-leveling algorithm for extending SSD lifetime.
`The remainder of this paper is organized as follows. In
`the next section, we provide background on the proper-
`ties of NAND-flash memory. Section 3 describes the ba-
`sic functionality that SSD designers must provide and the
`major challenges in implementing these devices. Sec-
`tion 4 describes our simulation environment and presents
`an evaluation of the various design choices. Section 5
`provides a discussion of SSD wear-leveling and gives
`preliminary simulator results on this topic. Related work
`is discussed in Section 6, and Section 7 concludes.
`
`2 Background
`
`Our discussion of flash memory is based on the latest
`product specifications for Samsung’s K9XXG08UXM
`series NAND-flash part [29]. Other vendors such as
`Micron and Hynix offer products with similar features.
`For the remainder of this paper, we treat the 4GB Sam-
`sung part as a canonical exemplar, although the specifics
`of other vendors’ parts will differ in some respects.
`We present the specifications for single-level cell (SLC)
`flash. Multi-level cell (MLC) flash is cheaper than SLC,
`but has inferior performance and lifetime.
`Figure 1 shows a schematic for a flash package. A
`flash package is composed from one or more dies (also
`called chips). We describe a 4GB flash-package consist-
`ing of two 2GB dies, sharing an 8-bit serial I/O bus and
`a number of common control signals. The two dies have
`separate chip enable and ready/busy signals. Thus, one
`of the dies can accept commands and data while the other
`is carrying out another operation. The package also sup-
`ports interleaved operations between the two dies.
`Each die within a package contains 8192 blocks, orga-
`nized among 4 planes of 2048 blocks. The dies can oper-
`ate independently, each performing operations involving
`one or two planes. Two-plane commands can be exe-
`cuted on either plane-pairs 0 & 1 or 2 & 3, but not across
`other combinations. Each block in turn consists of 64
`4KB pages. In addition to data, each page includes a 128
`byte region to store metadata (identification and error-
`detection information). Table 1 presents the operational
`attributes of the Samsung 4GB flash memory.
`
`2.1 Properties of Flash Memory
`
`Data reads are at the granularity of flash pages, and a typ-
`ical read operation takes 25µs to read a page from the
`media into a 4KB data register, and then subsequently
`shift it out over the data bus. The serial line transfers
`
`Page Read to Register
`Page Program (Write) from Register
`Block Erase
`Serial Access to Register (Data bus)
`Die Size
`Block Size
`Page Size
`Data Register
`Planes per die
`Dies per package (2GB/4GB/8GB)
`Program/Erase Cycles
`
`25µs
`200µs
`1.5ms
`100µs
`2 GB
`256 KB
`4 KB
`4 KB
`4
`1,2 or 4
`100 K
`
`Table 1: Operational flash parameters
`
`data at 25ns per byte, or roughly 100µs per page. Flash
`media blocks must be erased before they can be reused
`for new data. An erase operation takes 1.5ms, which is
`considerably more expensive than a read or write opera-
`tion. In addition, each block can be erased only a finite
`number of times before becoming unusable. This limit,
`100K erase cycles for current generation flash, places a
`premium on careful block reuse.
`Writing (or programming) is also done at page granu-
`larity by shifting data into the data register (100µs) and
`then writing it out to the flash cell (200µs). Pages must be
`written out sequentially within a block, from low to high
`addresses. The part also provides a specialized copy-
`back program operation from one page to another, im-
`proving performance by avoiding the need to transport
`data through the serial line to an external buffer.
`In this paper, we discuss a 2 x 2GB flash package, but
`extensions to larger dies and/or packages with more dies
`are straightforward.
`
`2.2 Bandwidth and Interleaving
`
`The serial interface over which flash packages receive
`commands and transmit data is a primary bottleneck
`for SSD performance. The Samsung part takes roughly
`100µs to transfer a 4KB page from the on-chip register to
`an off-chip controller. This dwarfs the 25µs required to
`move data into the register from the NAND cells. When
`these two operations are taken in series, a flash pack-
`age can only produce 8000 page reads per second (32
`MB/sec). If interleaving is employed within a die, the
`maximum read bandwidth from a single part improves
`to 10000 reads per second (40 MB/sec). Writes, on the
`other hand, require the same 100µs serial transfer time
`per page as reads, but 200µs programming time. With-
`out interleaving, this gives a maximum, single-part write
`rate of 3330 pages per second (13 MB/sec). Interleaving
`the serial transfer time and the program operation dou-
`bles the overall bandwidth. In theory, because there are
`two independent dies on the packages we are consider-
`ing, we can interleave three operations on the two dies
`put together. This would allow both writes and reads to
`progress at the speed of the serial interconnect.
`
`58
`
`USENIX ’08: 2008 USENIX Annual Technical Conference
`
`USENIX Association
`
`
`
`Serial Connection
`
`Plane 0
`
`Block 0
`
`Plane 1
`
`Block 1
`
`Plane 2
`
`Block 4096
`
`Plane 3
`
`Block 4097
`
`Plane 0
`
`Block 0
`
`Plane 1
`
`Block 1
`
`Plane 2
`
`Block 4096
`
`Plane 3
`
`Block 4097
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`ooo
`
`o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o
`
`oo o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o
`
`oo o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o
`
`oo o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o o o
`
`o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o
`
`oo o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o
`
`oo o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o
`
`oo o
`
`Block 4094
`
`Block 4095
`
`Block 8190
`
`Block 8191
`
`Block 4094
`
`Block 4095
`
`Block 8190
`
`Block 8191
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`4K Register
`
`4K Register
`
`4K Register
`
`4K Register
`
`4K Register
`
`4K Register
`
`4K Register
`
`4K Register
`
`Die 0
`
`Flash Package (4 GB)
`
`Die 1
`
`Figure 1: Samsung 4GB flash internals
`
`Interleaving can provide considerable speedups when
`the operation latency is greater than the serial access la-
`tency. For example, a costly erase command can in some
`cases proceed in parallel with other commands. As an-
`other example, fully interleaved page copying between
`two packages can proceed at close to 100µs per page as
`depicted in Figure 2 in spite of the 200µs cost of a single
`write operation. Here, 4 source planes and 4 destination
`planes copy pages at speed without performing simulta-
`neous operations on the same plane-pair and while opti-
`mally making use of the serial bus pins connected to both
`flash dies. Once the pipe is loaded, a write completes ev-
`ery interval (100µs).
`Even when flash architectures support interleaving,
`they do so with serious constraints. So, for example, op-
`erations on the same flash plane cannot be interleaved.
`This suggests that same-package interleaving is best em-
`ployed for a choreographed set of related operations,
`such as a multi-page read or write as depicted in Fig-
`ure 2. The Samsung parts we examined support a fast in-
`ternal copy-back operation that allows data to be copied
`to another block on-chip without crossing the serial pins.
`This optimization comes at a cost: the data can only be
`copied within the same flash plane (of 2048 blocks). Two
`such copies may themselves be interleaved on different
`planes, and the result yields similar performance to the
`fully-interleaved inter-package copying depicted in Fig-
`ure 2, but without monopolizing the serial pins.
`
`3 SSD Basics
`
`In this section we outline some of the basic issues that
`arise when constructing a solid-state disk from NAND-
`
`#(% "
` &' "
`#(% "
` &' "
`
`#(% "
` &' "
`#(% "
` &' "
`
`
`%
`%'
`
`!
`
`Figure 2: Interleaved page copying
`
`flash components. Although we introduce a number of
`dimensions in which designs can differ, we leave the
`evaluation of specific choices until Section 4.
`All NAND-based SSDs are constructed from an ar-
`ray of flash packages similar to those described in the
`previous section. Figure 3 depicts a generalized block
`diagram for an SSD. Each SSD must contain host inter-
`face logic to support some form of physical host interface
`connection (USB, FiberChannel, PCI Express, SATA)
`and logical disk emulation, like a flash translation layer
`mechanism to enable the SSD to mimic a hard disk drive.
`The bandwidth of the host interconnect is often a critical
`constraint on the performance of the device as a whole,
`and it must be matched to the performance available to
`and from the flash array. An internal buffer manager
`holds pending and satisfied requests along the primary
`data path. A multiplexer (Flash Demux/Mux) emits com-
`mands and handles transport of data along the serial con-
`nections to the flash packages. The multiplexer can in-
`clude additional logic, for example, to buffer commands
`and data. A processing engine is also required to manage
`the request flow and mappings from disk logical block
`
`USENIX Association
`
`USENIX ’08: 2008 USENIX Annual Technical Conference
`
`59
`
`
`
`
`
`
`
` &
`
`
` &
`
`
`
`
` "'%#""'
`
`
` "'%
`#
`
`%#&&#%
`
`(%
`"%
`
` &
` !()
` ()
`
` &
`
`
` &
`
`
`Figure 3: SSD Logic Components
`
`address to physical flash location. The processor, buffer-
`manager, and multiplexer are typically implemented in a
`discrete component such as an ASIC or FPGA, and data
`flow between these logic elements is very fast. The pro-
`cessor, and its associated RAM, may be integrated, as
`is the case for simple USB flash-stick devices, or stan-
`dalone as for designs with more substantial processing
`and memory requirements.
`As described in Section 2, flash packages export an
`8-bit wide serial data interface with a similar number of
`control pins. A 32GB SSD with 8 of the Samsung parts
`would require 136 pins at the flash controller(s) just for
`the flash components. With such a device, it might be
`possible to achieve full interconnection between the flash
`controller(s) and flash packages, but for larger configura-
`tions this is not likely to remain feasible. For the mo-
`ment, we assume full interconnection between data path,
`control logic, and flash. We return to the issue of inter-
`connect density in Section 3.3.
`This paper is primarily concerned with the organiza-
`tion of the flash array and the algorithms needed to man-
`age mappings between logical disk and physical flash ad-
`dresses. It is beyond the scope of this paper to tackle the
`many important issues surrounding the design and layout
`of SSD logic components.
`
`3.1 Logical Block Map
`
`As pointed out by Birrell et al. [2], the nature of NAND
`flash dictates that writes cannot be performed in place as
`on a rotating disk. Moreover, to achieve acceptable per-
`formance, writes must be performed sequentially when-
`ever possible, as in a log. Since each write of a single
`logical-disk block address (LBA) corresponds to a write
`of a different flash page, even the simplest SSD must
`maintain some form of mapping between logical block
`address and physical flash location. We assume that the
`
`logical block map is held in volatile memory and recon-
`structed from stable storage at startup time.
`We frame the discussion of logical block maps us-
`ing the abstraction of an allocation pool to think about
`how an SSD allocates flash blocks to service write re-
`quests. When handling a write request, each target log-
`ical page (4KB) is allocated from a pre-determined pool
`of flash memory. The scope of an allocation pool might
`be as small as a flash plane or as large as multiple flash
`packages. When considering the properties of allocation
`pools, the following variables come to mind.
`• Static map. A portion of each LBA constitutes a
`fixed mapping to a specific allocation pool.
`• Dynamic map. The non-static portion of a LBA is
`the lookup key for a mapping within a pool.
`• Logical page size. The size for the referent of a
`mapping entry might be as large as a flash block
`(256KB), or as small as a quarter-page (1KB) .
`• Page span. A logical page might span related pages
`on different flash packages thus creating the poten-
`tial for accessing sections of the page in parallel.
`
`These variables are then bound by three constraints:
`• Load balancing. Optimally, I/O operations should
`be evenly balanced between allocation pools.
`• Parallel access. The assignment of LBAs to phys-
`ical addresses should interfere as little as possible
`with the ability to access those LBAs in parallel. So,
`for example if LBA0..LBAn are always accessed at
`the same time, they should not be stored on a com-
`ponent that requires each to be accessed in series.
`• Block erasure. Flash pages cannot be re-written
`without first being erased. Only fixed-size blocks of
`contiguous pages can be erased.
`
`60
`
`USENIX ’08: 2008 USENIX Annual Technical Conference
`
`USENIX Association
`
`
`
`The variables that define allocation pools trade off
`against these constraints. For example, if a large portion
`of the LBA space is statically mapped, then there is little
`scope for load-balancing. If a contiguous range of LBAs
`is mapped to the same physical die, performance for se-
`quential access in large chunks will suffer. With a small
`logical page size, more work will be required to elimi-
`nate valid pages from erasure candidates. If the logical
`page size (with unit span) is equal to the block size, then
`erasure is simplified because the write unit and erase unit
`are the same, however all writes smaller than the logical
`page size result in a read-modify-write operation involv-
`ing the portions of the logical page not being modified.
`RAID systems [26] often stripe logically contiguous
`chunks of data (e.g. 64KB or larger) across multiple
`physical disks. Here, we use striping at fine granular-
`ity to distribute logical pages (4K) across multiple flash
`dies or packages. Doing so serves both to distribute load
`and to arrange that consecutive pages will be placed on
`different packages that can be accessed in parallel.
`
`3.2 Cleaning
`
`Fleshing out the design sketched by Birrell et al. [2], we
`use flash blocks as the natural allocation unit within an
`allocation pool. At any given time, a pool can have one or
`more active blocks available to hold incoming writes. To
`support the continued allocation of fresh active blocks,
`we need a garbage collector to enumerate previously-
`used blocks that must be erased and recycled. If the log-
`ical page granularity is smaller than the flash block size,
`then flash blocks must be cleaned prior to erasure. Clean-
`ing can be summarized as follows. When a page write
`is complete, the previously mapped page location is su-
`perseded since its contents are now out-of-date. When
`recycling a candidate block, all non-superseded pages in
`the candidate must be written elsewhere prior to erasure.
`In the worst case, where superseded pages are dis-
`tributed evenly across all blocks, N − 1 cleaning writes
`must be issued for every new data write (where there are
`N pages per block). Of course, most workloads produce
`clusters of write activity, which in turn lead to multiple
`superseded pages per block when the data is overwrit-
`ten. We introduce the term cleaning efficiency to quantify
`the ratio of superseded pages to total pages during block
`cleaning. Although there are many possible algorithms
`for choosing candidate blocks for recycling, it is always
`desirable to optimize cleaning efficiency. It’s worth not-
`ing that the use of striping to enhance parallel access for
`sequential addresses works against the clustering of su-
`perseded pages.
`For each allocation pool we maintain a free block list
`that we populate with recycled blocks. In this section and
`the next, we assume a purely greedy approach that calls
`
`for choosing blocks to recycle based on potential clean-
`ing efficiency. As described in Section 2, NAND flash
`sustains only a limited number of erasures per block.
`Therefore, it is desirable to choose candidates for recy-
`cling such that all blocks age evenly. This property is
`enforced through the process known as wear-leveling. In
`Section 5, we discuss how the choice of cleaning candi-
`dates interacts directly with wear-leveling, and suggest a
`modified greedy algorithm.
`In an SSD that emulates a traditional disk interface,
`there is no abstraction of a free disk sector. Hence, the
`SSD is always full with respect to its advertised capacity.
`In order for cleaning to work, there must be enough spare
`blocks (not counted in the overall capacity) to allow
`writes and cleaning to proceed, and to allow for block
`replacement if a block fails. An SSD can be substan-
`tially overprovisioned with such spare capacity in order
`to reduce the demand for cleaning blocks in foreground.
`Delayed block cleaning might also produce better clus-
`tering of superseded pages in non-random workloads.
`In the previous subsection, we stipulated that a given
`LBA is statically mapped to a specific allocation pool.
`Cleaning can, however, operate at a finer granularity.
`One reason for doing so is to exploit low-level efficiency
`in the flash architecture such as the internal copy-back
`operation described in Section 2.2, which only applies
`when pages are moved within the same plane. Since a
`single flash plane of 2048 blocks represents a very small
`allocation pool for the purposes of load distribution, we
`would like to allocate from a larger pool. However, if an
`active block and cleaning state per plane is maintained,
`then cleaning operations within the same plane can be
`arranged with high probability.
`It might be tempting to view block cleaning as simi-
`lar to log-cleaning in a Log-Structured File System [28]
`and indeed there are similarities. However, apart from
`the obvious difference that we model a block store as op-
`posed to a file system, a log-structured store that writes
`and cleans in strict disk-order cannot choose candidate
`blocks so as to yield higher cleaning efficiency. And,
`as with LFS-like file systems, it’s altogether too easy
`to combine workloads that would cause all recoverable
`space to be situated far from the log’s cleaning pointer.
`For example, writing the same sets of blocks over and
`over would require a full cycle over the disk content in
`order for the cleaning pointer to reach the free space
`near the end of the log. And, unlike a log-structured file
`system, the disk here is always “full”, corresponding to
`maximal cleaning pressure all the time.
`
`3.3 Parallelism and Interconnect Density
`
`If an SSD is going to achieve bandwidths or I/O rates
`greater than the single-chip maxima described in Sec-
`
`USENIX Association
`
`USENIX ’08: 2008 USENIX Annual Technical Conference
`
`61
`
`
`
`tion 2.2, it must be able to handle I/O requests on mul-
`tiple flash packages in parallel, making use of the ad-
`ditional serial connections to their pins. There are sev-
`eral possible techniques for obtaining such parallelism
`assuming full connectivity to the flash, some of which
`we have touched on already.
`
`#"'%#
`$
`" &
`
` '
`
` &
`
`
`
` &
`
`
`
` &
`
`
`
` &
`
`
`
`• Parallel requests. In a fully connected flash array,
`each element is an independent entity and can there-
`fore accept a separate flow of requests. The com-
`plexity of the logic necessary to maintain a queue
`per element, however, may be an issue for imple-
`mentations with reduced processing capacity.
`
`• Ganging. A gang of flash packages can be utilized
`in synchrony to optimize a multi-page request. Do-
`ing so can allow multiple packages to be used in
`parallel without the complexity of multiple queues.
`However, if only one queue of requests flows to
`such a gang, elements will lie idle when requests
`don’t span all of them.
`
`• Interleaving. As discussed in Section 2.2, inter-
`leaving can be used to improve bandwidth and hide
`the latency of costly operations.
`
`• Background cleaning. In a perfect world, clean-
`ing would be performed continuously in the back-
`ground on otherwise idle components. The use of
`operations that don’t require data to cross the serial
`interface, such as internal copy-back, can help hide
`the cost of cleaning.
`
`The situation becomes more interesting when full con-
`nectivity to the flash packages is not possible. Two
`choices are readily apparent for organizing a gang of
`flash packages: 1) the packages are connected to a se-
`rial bus where a controller dynamically selects the target
`of each command; and 2) each package has separate data
`path to the controller, but the control pins are connected
`in a single broadcast bus. Configuration (1) is depicted in
`Figure 4. Data and control lines are shared, and an enable
`line for each package selects the target for a command.
`This scheme increases capacity without requiring more
`pins, but it does not increase bandwidth. Configuration
`(2) is depicted in Figure 5. Here there is shared set of
`control pins, but since there are individual data paths to
`each package, synchronous operations which span mul-
`tiple packages can proceed in parallel. The enable lines
`can be removed from the second configuration, but in this
`case all operations must apply to the entire gang, and no
`package can lie idle.
`Interleaving can play a role within a gang. A long run-
`ning operation such as block erasure can be performed on
`one element while reads or writes are proceeding on oth-
`ers (hence the control line need only be held long enough
`
`Figure 4: Shared bus gang
`
`#"'%#
`$
`" &
`
` '
`
` &
`
`
`
` &
`
`
`
` &
`
`
`
` &
`
`
`
`Figure 5: Shared control gang
`
`to issue a command). The only constraint is competition
`for the shared data and command bus. This could be-
`come quite important during block recycling since block
`erasure is an order of magnitude more expensive than
`other operations.
`In another form of parallelism, intra-plane copy-back
`can be used to implement block cleaning in background.
`However, cleaning can take place with somewhat lower
`latency if pages are streamed at maximum speed between
`two chips. This benefit comes, of course, at the expense
`of occupying two sets of controller pins for the duration.
`It is unlikely that any one choice for exploiting paral-
`lelism can be optimal for all workloads, and certainly as
`SSD capacity scales up, it will be difficult to ensure full
`connectivity between controllers and flash packages. The
`best choices will undoubtedly be dictated by workload
`properties. For example, a highly sequential workload
`will benefit from ganging, a workload with inherent par-
`allelism will take advantage of a deeply parallel request
`queuing structure, and a workload with poor cleaning ef-
`ficiency (e.g. no locality) will rely on a cleaning strategy
`that is compatible with the foreground load.
`
`3.4 Persistence
`
`Flash memory is by definition persistent storage. How-
`ever, in order to recover SSD state, it is essential to re-
`build the logical block map and all related data struc-
`tures. This recovery must also reconstruct knowledge of
`failed blocks so that they are not re-introduced into active
`use. There are several possible approaches to this prob-
`lem. Most take advantage of the fact that each flash page
`contains a dedicated area (128 bytes) of metadata storage
`that can be used to store the logical block address that
`maps to a given flash page. In our simulator, we model
`the technique sketched by Birrell et al. [2]. This tech-
`
`62
`
`USENIX ’08: 2008 USENIX Annual Technical Conference
`
`USENIX Association
`
`
`
`nique eliminates the need to visit every page at startup
`by saving mapping information per block rather than per
`page. Note that in any such algorithm, pages whose con-
`tent has been superseded but not yet erased will appear
`multiple times during recovery. In these cases, the stable
`storage representation must allow the recovery algorithm
`to determine the most recent instance of a logical page
`within an allocation pool.
`Flash parts do not, in general, provide error detection
`and correction. These functions must be provided by ap-
`plication firmware. The page metadata can also hold an
`error-detection code to determine which pages are valid
`and which blocks are failure-free, and an error-correction
`code to recover from the single-bit errors that are ex-
`pected with NAND flash. Samsung specifies block life-
`time assuming the presence of a single-bit ECC [29]. It
`may be possible to extend block lifetime by using more
`robust error correction.
`The problem of recovering SSD state can be bypassed
`altogether by holding the logical block map in Phase-
`Change RAM [14] or Magnetoresistive RAM [9]. These
`non-volatile memories are writable at byte granularity
`and don’t have the block-erasure constraints of NAND
`flash. The former is not yet widely available, and the lat-
`ter can be obtained in small capacities, but at 1000 times
`the cost of NAND flash. Alternatively, backup power
`(e.g. big capacitors) might be enough to flush the neces-
`sary recovery state to flash on demand.
`
`3.5 Industry Trends
`
`The market for NAND flash continues to grow both in
`volume and in breadth as storage capacity increases and
`cost per unit storage declines. As of late 2007, laptops
`are available that use an SSD as the primary disk drive.
`In addition, high-end flash systems have become avail-
`able for the enterprise marketplace. Most products in the
`marketplace can be placed into one of three categories.
`Consumer portable storage. These are inexpensive
`units with one or perhaps two flash packages and a sim-
`ple controller. They commonly appear as USB flash
`sticks or camera memories, and feature moderate band-
`width for sequential operations, moderate random read
`performance, and very poor random write performance.
`Laptop disk replacements. These disks provide sub-
`stantial bandwidth to approximate that of the IDE and
`SATA disks they replace. Random read performance is
`far superior to that of rotating media. Random write per-
`formance is comparable to that of rotating media.
`Enterprise/database accelerators. These units promise
`very fast sequential performance, random read perfor-
`mance superior to that of a high-end RAID array, and
`very strong random write performance. They have costs
`to match. However, the specified random-write perfor-
`
`Sequential
`
`Read
`11.7 MB/sec
`100 MB/sec
`200 MB/sec
`700 MB/sec
`
`Write
`4.3 MB/sec
`80 MB/sec
`100 MB/sec
`600 MB/sec
`
`Random 4K
`Read
`Write
`150/sec
`<20/sec
`11K/sec
`130/sec
`52K/sec
`11K/sec
`87K/sec
`Not avail
`
`USB
`MTron
`Zeus
`FusionIO
`
`Table 2: Sample Flash Device Performance
`
`mance numbers are often a bit mysterious, and some-
`times are missing altogether. For example, it is often dif-
`ficult to tell whether cleaning costs are included.
`In Table 2, we provide performance numbers for sam-
`ple devices in these categories. The sample devices are
`the Lexar JD Firefly 8GB USB flash stick, the MTron
`MSD-SATA3025 SSD [20], the ZeusIOP S SSD [30],
`and the FusionIO ioDrive [10]. The MTron device is in
`the second category above, while the Zeus and FusionIO
`devices are clearly enterprise-class. We are not privy to
`the internals of these devices, but we can speculate on
`how they are organized. Previous work [2] points out
`that consumer USB flash devices almost certainly use a
`logical page granularity close to the size of a flash block.
`Limited random read performance suggests that only a
`single I/O request is in flight at a time. The MTron device
`gives random read I/O performance better than can be
`expected with a single flash request stream, so it almost
`certainly employs some sort of parallel request structure.
`However, the poor random write I/O performance sug-
`gests that a large logical page size is being used (with
`substantial read-modify-write costs). The random write
`performance of the Zeus and FusionIO devices suggest a
`sophisticated block mapping scheme, and multiple par-
`allel activities are clearly needed to reach the stated ran-
`dom I/O performance. The Zeus system appears to lower
`pressure on its cleaning algorithm by substantial over-
`provisioning [7]. The FusionIO system (for which only
`a preliminary specification was available in 2007) is the
`first to offer a PCI-Express interconnect.
`
`4 Design Details and Evaluation
`
`From our discussion of industry trends, it is clear that
`current SSDs can get very good performance numbers
`from ideal workloads.
`For exam