throbber
Design Tradeoffs for SSD Performance
`
`Nitin Agrawal∗, Vijayan Prabhakaran, Ted Wobber,
`John D. Davis, Mark Manasse, Rina Panigrahy
`Microsoft Research, Silicon Valley
`∗University of Wisconsin-Madison
`
`Abstract
`Solid-state disks (SSDs) have the potential to revolution-
`ize the storage system landscape. However, there is little
`published work about their internal organization or the
`design choices that SSD manufacturers face in pursuit of
`optimal performance. This paper presents a taxonomy of
`such design choices and analyzes the likely performance
`of various configurations using a trace-driven simulator
`and workload traces extracted from real systems. We find
`that SSD performance and lifetime is highly workload-
`sensitive, and that complex systems problems that nor-
`mally appear higher in the storage stack, or even in dis-
`tributed systems, are relevant to device firmware.
`
`1 Introduction
`
`The advent of the NAND-flash based solid-state stor-
`age device (SSD) is certain to represent a sea change in
`the architecture of computer storage subsystems. These
`devices are capable of producing not only exceptional
`bandwidth, but also random I/O performance that is
`orders of magnitude better than that of rotating disks.
`Moreover, SSDs offer both a significant savings in power
`budget and an absence of moving parts, improving sys-
`tem reliability.
`Although solid-state disks cost significantly more per
`unit capacity than their rotating counterparts, there are
`numerous applications where they can be applied to great
`benefit. For example, in transaction-processing systems,
`disk capacity is often wasted in order to improve oper-
`ation throughput.
`In such configurations, many small
`(cost inefficient) rotating disks are deployed to increase
`I/O parallelism. Large SSDs, suitably optimized for ran-
`dom read and write performance, could effectively re-
`place whole farms of slow, rotating disks. At this writ-
`ing, small SSDs are starting to appear in laptop comput-
`ers because of their reduced power-profile and reliability
`in portable environments. As the cost of flash continues
`to decline, the potential application space for solid-state
`disks will certainly continue to grow.
`Despite the promise that SSDs hold, there is little in
`the literature about the architectural tradeoffs inherent in
`
`their design. Where such knowledge exists, it typically
`remains the intellectual property of SSD manufacturers.
`As a consequence, it is difficult to understand the archi-
`tecture of a given device, and harder still to interpret its
`performance characteristics.
`In this paper, we lay out a range of design tradeoffs
`that are relevant to NAND-flash solid-state storage. We
`then analyze several of these tradeoffs using a trace-
`based disk simulator that we have customized to char-
`acterize different SSD organizations. Since we can only
`speculate about the detailed internals of existing SSDs,
`we base our simulator on the specified properties of
`NAND-flash chips. Our analysis is driven by various
`traces captured from running systems such as a full-scale
`TPC-C benchmark, an Exchange server workload, and
`various standard file system benchmarks.
`We find that many of the issues that arise in SSD
`design appear to mimic problems that have previously
`appeared higher in the storage stack.
`In solving these
`hard problems, there is considerable latitude for design
`choice. We show that the following systems issues are
`relevant to SSD performance:
`
`• Data placement. Careful placement of data across
`the chips of an SSD is critical not only to provide
`load balancing, but to effect wear-leveling.
`
`• Parallelism. The bandwidth and operation rate of
`any given flash chip is not sufficient to achieve op-
`timal performance. Hence, memory components
`must be coordinated so as to operate in parallel.
`
`• Write ordering. The properties of NAND flash
`present hard problems to the SSD designer. Small,
`randomly-ordered writes are especially tricky.
`
`• Workload management. Performance is highly
`workload-dependent. For example, design deci-
`sions that produce good performance under sequen-
`tial workloads may not benefit workloads that are
`not sequential, and vice versa.
`
`As SSDs increase in complexity, existing disk models
`will become insufficient for predicting performance. In
`
`USENIX Association
`
`USENIX ’08: 2008 USENIX Annual Technical Conference
`
`57
`
`CSCO-1009
`
`

`

`particular, random write performance and disk lifetime
`will vary significantly due to the locality of disk write
`operations. We introduce a new model for characterizing
`this behavior based on cleaning efficiency and suggest a
`new wear-leveling algorithm for extending SSD lifetime.
`The remainder of this paper is organized as follows. In
`the next section, we provide background on the proper-
`ties of NAND-flash memory. Section 3 describes the ba-
`sic functionality that SSD designers must provide and the
`major challenges in implementing these devices. Sec-
`tion 4 describes our simulation environment and presents
`an evaluation of the various design choices. Section 5
`provides a discussion of SSD wear-leveling and gives
`preliminary simulator results on this topic. Related work
`is discussed in Section 6, and Section 7 concludes.
`
`2 Background
`
`Our discussion of flash memory is based on the latest
`product specifications for Samsung’s K9XXG08UXM
`series NAND-flash part [29]. Other vendors such as
`Micron and Hynix offer products with similar features.
`For the remainder of this paper, we treat the 4GB Sam-
`sung part as a canonical exemplar, although the specifics
`of other vendors’ parts will differ in some respects.
`We present the specifications for single-level cell (SLC)
`flash. Multi-level cell (MLC) flash is cheaper than SLC,
`but has inferior performance and lifetime.
`Figure 1 shows a schematic for a flash package. A
`flash package is composed from one or more dies (also
`called chips). We describe a 4GB flash-package consist-
`ing of two 2GB dies, sharing an 8-bit serial I/O bus and
`a number of common control signals. The two dies have
`separate chip enable and ready/busy signals. Thus, one
`of the dies can accept commands and data while the other
`is carrying out another operation. The package also sup-
`ports interleaved operations between the two dies.
`Each die within a package contains 8192 blocks, orga-
`nized among 4 planes of 2048 blocks. The dies can oper-
`ate independently, each performing operations involving
`one or two planes. Two-plane commands can be exe-
`cuted on either plane-pairs 0 & 1 or 2 & 3, but not across
`other combinations. Each block in turn consists of 64
`4KB pages. In addition to data, each page includes a 128
`byte region to store metadata (identification and error-
`detection information). Table 1 presents the operational
`attributes of the Samsung 4GB flash memory.
`
`2.1 Properties of Flash Memory
`
`Data reads are at the granularity of flash pages, and a typ-
`ical read operation takes 25µs to read a page from the
`media into a 4KB data register, and then subsequently
`shift it out over the data bus. The serial line transfers
`
`Page Read to Register
`Page Program (Write) from Register
`Block Erase
`Serial Access to Register (Data bus)
`Die Size
`Block Size
`Page Size
`Data Register
`Planes per die
`Dies per package (2GB/4GB/8GB)
`Program/Erase Cycles
`
`25µs
`200µs
`1.5ms
`100µs
`2 GB
`256 KB
`4 KB
`4 KB
`4
`1,2 or 4
`100 K
`
`Table 1: Operational flash parameters
`
`data at 25ns per byte, or roughly 100µs per page. Flash
`media blocks must be erased before they can be reused
`for new data. An erase operation takes 1.5ms, which is
`considerably more expensive than a read or write opera-
`tion. In addition, each block can be erased only a finite
`number of times before becoming unusable. This limit,
`100K erase cycles for current generation flash, places a
`premium on careful block reuse.
`Writing (or programming) is also done at page granu-
`larity by shifting data into the data register (100µs) and
`then writing it out to the flash cell (200µs). Pages must be
`written out sequentially within a block, from low to high
`addresses. The part also provides a specialized copy-
`back program operation from one page to another, im-
`proving performance by avoiding the need to transport
`data through the serial line to an external buffer.
`In this paper, we discuss a 2 x 2GB flash package, but
`extensions to larger dies and/or packages with more dies
`are straightforward.
`
`2.2 Bandwidth and Interleaving
`
`The serial interface over which flash packages receive
`commands and transmit data is a primary bottleneck
`for SSD performance. The Samsung part takes roughly
`100µs to transfer a 4KB page from the on-chip register to
`an off-chip controller. This dwarfs the 25µs required to
`move data into the register from the NAND cells. When
`these two operations are taken in series, a flash pack-
`age can only produce 8000 page reads per second (32
`MB/sec). If interleaving is employed within a die, the
`maximum read bandwidth from a single part improves
`to 10000 reads per second (40 MB/sec). Writes, on the
`other hand, require the same 100µs serial transfer time
`per page as reads, but 200µs programming time. With-
`out interleaving, this gives a maximum, single-part write
`rate of 3330 pages per second (13 MB/sec). Interleaving
`the serial transfer time and the program operation dou-
`bles the overall bandwidth. In theory, because there are
`two independent dies on the packages we are consider-
`ing, we can interleave three operations on the two dies
`put together. This would allow both writes and reads to
`progress at the speed of the serial interconnect.
`
`58
`
`USENIX ’08: 2008 USENIX Annual Technical Conference
`
`USENIX Association
`
`

`

`Serial Connection
`
`Plane 0
`
`Block 0
`
`Plane 1
`
`Block 1
`
`Plane 2
`
`Block 4096
`
`Plane 3
`
`Block 4097
`
`Plane 0
`
`Block 0
`
`Plane 1
`
`Block 1
`
`Plane 2
`
`Block 4096
`
`Plane 3
`
`Block 4097
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`ooo
`
`o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o
`
`oo o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o
`
`oo o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o
`
`oo o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o o o
`
`o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o
`
`oo o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o
`
`oo o
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`o
`
`oo o
`
`Block 4094
`
`Block 4095
`
`Block 8190
`
`Block 8191
`
`Block 4094
`
`Block 4095
`
`Block 8190
`
`Block 8191
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`Page 0
`
`Page 1
`
`oo
`
`Page 63
`
`4K Register
`
`4K Register
`
`4K Register
`
`4K Register
`
`4K Register
`
`4K Register
`
`4K Register
`
`4K Register
`
`Die 0
`
`Flash Package (4 GB)
`
`Die 1
`
`Figure 1: Samsung 4GB flash internals
`
`Interleaving can provide considerable speedups when
`the operation latency is greater than the serial access la-
`tency. For example, a costly erase command can in some
`cases proceed in parallel with other commands. As an-
`other example, fully interleaved page copying between
`two packages can proceed at close to 100µs per page as
`depicted in Figure 2 in spite of the 200µs cost of a single
`write operation. Here, 4 source planes and 4 destination
`planes copy pages at speed without performing simulta-
`neous operations on the same plane-pair and while opti-
`mally making use of the serial bus pins connected to both
`flash dies. Once the pipe is loaded, a write completes ev-
`ery interval (100µs).
`Even when flash architectures support interleaving,
`they do so with serious constraints. So, for example, op-
`erations on the same flash plane cannot be interleaved.
`This suggests that same-package interleaving is best em-
`ployed for a choreographed set of related operations,
`such as a multi-page read or write as depicted in Fig-
`ure 2. The Samsung parts we examined support a fast in-
`ternal copy-back operation that allows data to be copied
`to another block on-chip without crossing the serial pins.
`This optimization comes at a cost: the data can only be
`copied within the same flash plane (of 2048 blocks). Two
`such copies may themselves be interleaved on different
`planes, and the result yields similar performance to the
`fully-interleaved inter-package copying depicted in Fig-
`ure 2, but without monopolizing the serial pins.
`
`3 SSD Basics
`
`In this section we outline some of the basic issues that
`arise when constructing a solid-state disk from NAND-
`
`#(% "
` &' "
`#(% "
` &' "
`
`#(% "
` &' "
`#(% "
` &' "
`
`
`%
`%'
`
`!
`
`Figure 2: Interleaved page copying
`
`flash components. Although we introduce a number of
`dimensions in which designs can differ, we leave the
`evaluation of specific choices until Section 4.
`All NAND-based SSDs are constructed from an ar-
`ray of flash packages similar to those described in the
`previous section. Figure 3 depicts a generalized block
`diagram for an SSD. Each SSD must contain host inter-
`face logic to support some form of physical host interface
`connection (USB, FiberChannel, PCI Express, SATA)
`and logical disk emulation, like a flash translation layer
`mechanism to enable the SSD to mimic a hard disk drive.
`The bandwidth of the host interconnect is often a critical
`constraint on the performance of the device as a whole,
`and it must be matched to the performance available to
`and from the flash array. An internal buffer manager
`holds pending and satisfied requests along the primary
`data path. A multiplexer (Flash Demux/Mux) emits com-
`mands and handles transport of data along the serial con-
`nections to the flash packages. The multiplexer can in-
`clude additional logic, for example, to buffer commands
`and data. A processing engine is also required to manage
`the request flow and mappings from disk logical block
`
`USENIX Association
`
`USENIX ’08: 2008 USENIX Annual Technical Conference
`
`59
`
`

`

`
`
`  
`
` &
`
`
` &
`
`
`
#&'
`
` "'%#""'
`
`
#&'
` "'%
`#
`
`%#&&#%
`
`(%
`"%
`
` &
` !()
` ()
`
` &
`
`
` &
`
`
`Figure 3: SSD Logic Components
`
`address to physical flash location. The processor, buffer-
`manager, and multiplexer are typically implemented in a
`discrete component such as an ASIC or FPGA, and data
`flow between these logic elements is very fast. The pro-
`cessor, and its associated RAM, may be integrated, as
`is the case for simple USB flash-stick devices, or stan-
`dalone as for designs with more substantial processing
`and memory requirements.
`As described in Section 2, flash packages export an
`8-bit wide serial data interface with a similar number of
`control pins. A 32GB SSD with 8 of the Samsung parts
`would require 136 pins at the flash controller(s) just for
`the flash components. With such a device, it might be
`possible to achieve full interconnection between the flash
`controller(s) and flash packages, but for larger configura-
`tions this is not likely to remain feasible. For the mo-
`ment, we assume full interconnection between data path,
`control logic, and flash. We return to the issue of inter-
`connect density in Section 3.3.
`This paper is primarily concerned with the organiza-
`tion of the flash array and the algorithms needed to man-
`age mappings between logical disk and physical flash ad-
`dresses. It is beyond the scope of this paper to tackle the
`many important issues surrounding the design and layout
`of SSD logic components.
`
`3.1 Logical Block Map
`
`As pointed out by Birrell et al. [2], the nature of NAND
`flash dictates that writes cannot be performed in place as
`on a rotating disk. Moreover, to achieve acceptable per-
`formance, writes must be performed sequentially when-
`ever possible, as in a log. Since each write of a single
`logical-disk block address (LBA) corresponds to a write
`of a different flash page, even the simplest SSD must
`maintain some form of mapping between logical block
`address and physical flash location. We assume that the
`
`logical block map is held in volatile memory and recon-
`structed from stable storage at startup time.
`We frame the discussion of logical block maps us-
`ing the abstraction of an allocation pool to think about
`how an SSD allocates flash blocks to service write re-
`quests. When handling a write request, each target log-
`ical page (4KB) is allocated from a pre-determined pool
`of flash memory. The scope of an allocation pool might
`be as small as a flash plane or as large as multiple flash
`packages. When considering the properties of allocation
`pools, the following variables come to mind.
`• Static map. A portion of each LBA constitutes a
`fixed mapping to a specific allocation pool.
`• Dynamic map. The non-static portion of a LBA is
`the lookup key for a mapping within a pool.
`• Logical page size. The size for the referent of a
`mapping entry might be as large as a flash block
`(256KB), or as small as a quarter-page (1KB) .
`• Page span. A logical page might span related pages
`on different flash packages thus creating the poten-
`tial for accessing sections of the page in parallel.
`
`These variables are then bound by three constraints:
`• Load balancing. Optimally, I/O operations should
`be evenly balanced between allocation pools.
`• Parallel access. The assignment of LBAs to phys-
`ical addresses should interfere as little as possible
`with the ability to access those LBAs in parallel. So,
`for example if LBA0..LBAn are always accessed at
`the same time, they should not be stored on a com-
`ponent that requires each to be accessed in series.
`• Block erasure. Flash pages cannot be re-written
`without first being erased. Only fixed-size blocks of
`contiguous pages can be erased.
`
`60
`
`USENIX ’08: 2008 USENIX Annual Technical Conference
`
`USENIX Association
`
`

`

`The variables that define allocation pools trade off
`against these constraints. For example, if a large portion
`of the LBA space is statically mapped, then there is little
`scope for load-balancing. If a contiguous range of LBAs
`is mapped to the same physical die, performance for se-
`quential access in large chunks will suffer. With a small
`logical page size, more work will be required to elimi-
`nate valid pages from erasure candidates. If the logical
`page size (with unit span) is equal to the block size, then
`erasure is simplified because the write unit and erase unit
`are the same, however all writes smaller than the logical
`page size result in a read-modify-write operation involv-
`ing the portions of the logical page not being modified.
`RAID systems [26] often stripe logically contiguous
`chunks of data (e.g. 64KB or larger) across multiple
`physical disks. Here, we use striping at fine granular-
`ity to distribute logical pages (4K) across multiple flash
`dies or packages. Doing so serves both to distribute load
`and to arrange that consecutive pages will be placed on
`different packages that can be accessed in parallel.
`
`3.2 Cleaning
`
`Fleshing out the design sketched by Birrell et al. [2], we
`use flash blocks as the natural allocation unit within an
`allocation pool. At any given time, a pool can have one or
`more active blocks available to hold incoming writes. To
`support the continued allocation of fresh active blocks,
`we need a garbage collector to enumerate previously-
`used blocks that must be erased and recycled. If the log-
`ical page granularity is smaller than the flash block size,
`then flash blocks must be cleaned prior to erasure. Clean-
`ing can be summarized as follows. When a page write
`is complete, the previously mapped page location is su-
`perseded since its contents are now out-of-date. When
`recycling a candidate block, all non-superseded pages in
`the candidate must be written elsewhere prior to erasure.
`In the worst case, where superseded pages are dis-
`tributed evenly across all blocks, N − 1 cleaning writes
`must be issued for every new data write (where there are
`N pages per block). Of course, most workloads produce
`clusters of write activity, which in turn lead to multiple
`superseded pages per block when the data is overwrit-
`ten. We introduce the term cleaning efficiency to quantify
`the ratio of superseded pages to total pages during block
`cleaning. Although there are many possible algorithms
`for choosing candidate blocks for recycling, it is always
`desirable to optimize cleaning efficiency. It’s worth not-
`ing that the use of striping to enhance parallel access for
`sequential addresses works against the clustering of su-
`perseded pages.
`For each allocation pool we maintain a free block list
`that we populate with recycled blocks. In this section and
`the next, we assume a purely greedy approach that calls
`
`for choosing blocks to recycle based on potential clean-
`ing efficiency. As described in Section 2, NAND flash
`sustains only a limited number of erasures per block.
`Therefore, it is desirable to choose candidates for recy-
`cling such that all blocks age evenly. This property is
`enforced through the process known as wear-leveling. In
`Section 5, we discuss how the choice of cleaning candi-
`dates interacts directly with wear-leveling, and suggest a
`modified greedy algorithm.
`In an SSD that emulates a traditional disk interface,
`there is no abstraction of a free disk sector. Hence, the
`SSD is always full with respect to its advertised capacity.
`In order for cleaning to work, there must be enough spare
`blocks (not counted in the overall capacity) to allow
`writes and cleaning to proceed, and to allow for block
`replacement if a block fails. An SSD can be substan-
`tially overprovisioned with such spare capacity in order
`to reduce the demand for cleaning blocks in foreground.
`Delayed block cleaning might also produce better clus-
`tering of superseded pages in non-random workloads.
`In the previous subsection, we stipulated that a given
`LBA is statically mapped to a specific allocation pool.
`Cleaning can, however, operate at a finer granularity.
`One reason for doing so is to exploit low-level efficiency
`in the flash architecture such as the internal copy-back
`operation described in Section 2.2, which only applies
`when pages are moved within the same plane. Since a
`single flash plane of 2048 blocks represents a very small
`allocation pool for the purposes of load distribution, we
`would like to allocate from a larger pool. However, if an
`active block and cleaning state per plane is maintained,
`then cleaning operations within the same plane can be
`arranged with high probability.
`It might be tempting to view block cleaning as simi-
`lar to log-cleaning in a Log-Structured File System [28]
`and indeed there are similarities. However, apart from
`the obvious difference that we model a block store as op-
`posed to a file system, a log-structured store that writes
`and cleans in strict disk-order cannot choose candidate
`blocks so as to yield higher cleaning efficiency. And,
`as with LFS-like file systems, it’s altogether too easy
`to combine workloads that would cause all recoverable
`space to be situated far from the log’s cleaning pointer.
`For example, writing the same sets of blocks over and
`over would require a full cycle over the disk content in
`order for the cleaning pointer to reach the free space
`near the end of the log. And, unlike a log-structured file
`system, the disk here is always “full”, corresponding to
`maximal cleaning pressure all the time.
`
`3.3 Parallelism and Interconnect Density
`
`If an SSD is going to achieve bandwidths or I/O rates
`greater than the single-chip maxima described in Sec-
`
`USENIX Association
`
`USENIX ’08: 2008 USENIX Annual Technical Conference
`
`61
`
`

`

`tion 2.2, it must be able to handle I/O requests on mul-
`tiple flash packages in parallel, making use of the ad-
`ditional serial connections to their pins. There are sev-
`eral possible techniques for obtaining such parallelism
`assuming full connectivity to the flash, some of which
`we have touched on already.
`
`#"'%#
`$
`" &
`
` '
`
` &
`
`
`
` &
`
`
`
` &
`
`
`
` &
`
`
`
`• Parallel requests. In a fully connected flash array,
`each element is an independent entity and can there-
`fore accept a separate flow of requests. The com-
`plexity of the logic necessary to maintain a queue
`per element, however, may be an issue for imple-
`mentations with reduced processing capacity.
`
`• Ganging. A gang of flash packages can be utilized
`in synchrony to optimize a multi-page request. Do-
`ing so can allow multiple packages to be used in
`parallel without the complexity of multiple queues.
`However, if only one queue of requests flows to
`such a gang, elements will lie idle when requests
`don’t span all of them.
`
`• Interleaving. As discussed in Section 2.2, inter-
`leaving can be used to improve bandwidth and hide
`the latency of costly operations.
`
`• Background cleaning. In a perfect world, clean-
`ing would be performed continuously in the back-
`ground on otherwise idle components. The use of
`operations that don’t require data to cross the serial
`interface, such as internal copy-back, can help hide
`the cost of cleaning.
`
`The situation becomes more interesting when full con-
`nectivity to the flash packages is not possible. Two
`choices are readily apparent for organizing a gang of
`flash packages: 1) the packages are connected to a se-
`rial bus where a controller dynamically selects the target
`of each command; and 2) each package has separate data
`path to the controller, but the control pins are connected
`in a single broadcast bus. Configuration (1) is depicted in
`Figure 4. Data and control lines are shared, and an enable
`line for each package selects the target for a command.
`This scheme increases capacity without requiring more
`pins, but it does not increase bandwidth. Configuration
`(2) is depicted in Figure 5. Here there is shared set of
`control pins, but since there are individual data paths to
`each package, synchronous operations which span mul-
`tiple packages can proceed in parallel. The enable lines
`can be removed from the second configuration, but in this
`case all operations must apply to the entire gang, and no
`package can lie idle.
`Interleaving can play a role within a gang. A long run-
`ning operation such as block erasure can be performed on
`one element while reads or writes are proceeding on oth-
`ers (hence the control line need only be held long enough
`
`Figure 4: Shared bus gang
`
`#"'%#
`$
`" &
`
` '
`
` &
`
`
`
` &
`
`
`
` &
`
`
`
` &
`
`
`
`Figure 5: Shared control gang
`
`to issue a command). The only constraint is competition
`for the shared data and command bus. This could be-
`come quite important during block recycling since block
`erasure is an order of magnitude more expensive than
`other operations.
`In another form of parallelism, intra-plane copy-back
`can be used to implement block cleaning in background.
`However, cleaning can take place with somewhat lower
`latency if pages are streamed at maximum speed between
`two chips. This benefit comes, of course, at the expense
`of occupying two sets of controller pins for the duration.
`It is unlikely that any one choice for exploiting paral-
`lelism can be optimal for all workloads, and certainly as
`SSD capacity scales up, it will be difficult to ensure full
`connectivity between controllers and flash packages. The
`best choices will undoubtedly be dictated by workload
`properties. For example, a highly sequential workload
`will benefit from ganging, a workload with inherent par-
`allelism will take advantage of a deeply parallel request
`queuing structure, and a workload with poor cleaning ef-
`ficiency (e.g. no locality) will rely on a cleaning strategy
`that is compatible with the foreground load.
`
`3.4 Persistence
`
`Flash memory is by definition persistent storage. How-
`ever, in order to recover SSD state, it is essential to re-
`build the logical block map and all related data struc-
`tures. This recovery must also reconstruct knowledge of
`failed blocks so that they are not re-introduced into active
`use. There are several possible approaches to this prob-
`lem. Most take advantage of the fact that each flash page
`contains a dedicated area (128 bytes) of metadata storage
`that can be used to store the logical block address that
`maps to a given flash page. In our simulator, we model
`the technique sketched by Birrell et al. [2]. This tech-
`
`62
`
`USENIX ’08: 2008 USENIX Annual Technical Conference
`
`USENIX Association
`
`

`

`nique eliminates the need to visit every page at startup
`by saving mapping information per block rather than per
`page. Note that in any such algorithm, pages whose con-
`tent has been superseded but not yet erased will appear
`multiple times during recovery. In these cases, the stable
`storage representation must allow the recovery algorithm
`to determine the most recent instance of a logical page
`within an allocation pool.
`Flash parts do not, in general, provide error detection
`and correction. These functions must be provided by ap-
`plication firmware. The page metadata can also hold an
`error-detection code to determine which pages are valid
`and which blocks are failure-free, and an error-correction
`code to recover from the single-bit errors that are ex-
`pected with NAND flash. Samsung specifies block life-
`time assuming the presence of a single-bit ECC [29]. It
`may be possible to extend block lifetime by using more
`robust error correction.
`The problem of recovering SSD state can be bypassed
`altogether by holding the logical block map in Phase-
`Change RAM [14] or Magnetoresistive RAM [9]. These
`non-volatile memories are writable at byte granularity
`and don’t have the block-erasure constraints of NAND
`flash. The former is not yet widely available, and the lat-
`ter can be obtained in small capacities, but at 1000 times
`the cost of NAND flash. Alternatively, backup power
`(e.g. big capacitors) might be enough to flush the neces-
`sary recovery state to flash on demand.
`
`3.5 Industry Trends
`
`The market for NAND flash continues to grow both in
`volume and in breadth as storage capacity increases and
`cost per unit storage declines. As of late 2007, laptops
`are available that use an SSD as the primary disk drive.
`In addition, high-end flash systems have become avail-
`able for the enterprise marketplace. Most products in the
`marketplace can be placed into one of three categories.
`Consumer portable storage. These are inexpensive
`units with one or perhaps two flash packages and a sim-
`ple controller. They commonly appear as USB flash
`sticks or camera memories, and feature moderate band-
`width for sequential operations, moderate random read
`performance, and very poor random write performance.
`Laptop disk replacements. These disks provide sub-
`stantial bandwidth to approximate that of the IDE and
`SATA disks they replace. Random read performance is
`far superior to that of rotating media. Random write per-
`formance is comparable to that of rotating media.
`Enterprise/database accelerators. These units promise
`very fast sequential performance, random read perfor-
`mance superior to that of a high-end RAID array, and
`very strong random write performance. They have costs
`to match. However, the specified random-write perfor-
`
`Sequential
`
`Read
`11.7 MB/sec
`100 MB/sec
`200 MB/sec
`700 MB/sec
`
`Write
`4.3 MB/sec
`80 MB/sec
`100 MB/sec
`600 MB/sec
`
`Random 4K
`Read
`Write
`150/sec
`<20/sec
`11K/sec
`130/sec
`52K/sec
`11K/sec
`87K/sec
`Not avail
`
`USB
`MTron
`Zeus
`FusionIO
`
`Table 2: Sample Flash Device Performance
`
`mance numbers are often a bit mysterious, and some-
`times are missing altogether. For example, it is often dif-
`ficult to tell whether cleaning costs are included.
`In Table 2, we provide performance numbers for sam-
`ple devices in these categories. The sample devices are
`the Lexar JD Firefly 8GB USB flash stick, the MTron
`MSD-SATA3025 SSD [20], the ZeusIOP S SSD [30],
`and the FusionIO ioDrive [10]. The MTron device is in
`the second category above, while the Zeus and FusionIO
`devices are clearly enterprise-class. We are not privy to
`the internals of these devices, but we can speculate on
`how they are organized. Previous work [2] points out
`that consumer USB flash devices almost certainly use a
`logical page granularity close to the size of a flash block.
`Limited random read performance suggests that only a
`single I/O request is in flight at a time. The MTron device
`gives random read I/O performance better than can be
`expected with a single flash request stream, so it almost
`certainly employs some sort of parallel request structure.
`However, the poor random write I/O performance sug-
`gests that a large logical page size is being used (with
`substantial read-modify-write costs). The random write
`performance of the Zeus and FusionIO devices suggest a
`sophisticated block mapping scheme, and multiple par-
`allel activities are clearly needed to reach the stated ran-
`dom I/O performance. The Zeus system appears to lower
`pressure on its cleaning algorithm by substantial over-
`provisioning [7]. The FusionIO system (for which only
`a preliminary specification was available in 2007) is the
`first to offer a PCI-Express interconnect.
`
`4 Design Details and Evaluation
`
`From our discussion of industry trends, it is clear that
`current SSDs can get very good performance numbers
`from ideal workloads.
`For exam

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket