`
`Reliable Secondary
`
`Storage
`
`PETER M. CHEN
`
`Department
`University
`
`of Electr~cal
`of Michigan,
`
`Sctence,
`and Computer
`Engineering
`Ann Arbor, Michigan,
`48109-2122
`
`1301 Beal Avenue,
`
`EDWARD
`
`K. LEE
`
`DECSystems
`
`Research
`
`Center,
`
`130 Lytton
`
`Avenue.
`
`Palo Alto, California
`
`94301-1044
`
`GARTH A. GIBSON
`
`Sctence, Carnegie Mellon
`of Computer
`School
`Pittsburgh,
`Pennsylvania
`15213-3891
`
`University,
`
`5000 Forbes Aven ue,
`
`RANDY H. KATZ
`
`Department
`University
`
`of Electrical
`of California,
`
`Engineering
`Berkeley,
`
`and Computer
`California
`94720
`
`Sctence.
`
`571 Evans Hall,
`
`DAVID A. PATTERSON
`
`Department
`University
`
`of Electrical
`of Cahfornia,
`
`Engineering
`Berkeley,
`
`and Computer
`California
`94720
`
`Science,
`
`571 Euans Hall,
`
`between multiple
`in the 1980s as a way to use parallelism
`arrays were proposed
`Disk
`of
`in the product
`lines
`1/0
`performance.
`Today
`they
`appear
`to improve
`aggregate
`disks
`of
`most major
`computer
`manufacturers.
`This
`article
`gives
`a comprehensive
`overview
`disk
`arrays
`and provides
`a framework
`in which
`to organize
`current
`and future
`work.
`First,
`the article
`introduces
`disk
`technology
`and reviews
`the driving
`forces
`that
`have
`popularized
`disk arrays:
`performance
`and reliability.
`It discusses
`the two architectural
`techniques
`used in disk
`arrays:
`striping
`across multiple
`disks
`to improve
`performance
`and redundancy
`to improve
`reliability.
`Next,
`the article
`describes
`seven
`disk
`array
`architectures,
`called RAID
`(Redundant
`Arrays
`of
`Inexpensive
`Disks)
`levels O–6 and
`compares
`performance,
`cost,
`and reliability.
`It goes on to discuss
`advanced
`their
`research
`and implementation
`topics
`such as refining
`the basic RAID levels
`to improve
`performance
`and designing
`algorithms
`to maintain
`data
`consistency.
`Last,
`the article
`describes
`six disk
`array
`prototypes
`or products
`and discusses
`future
`opportunities
`for
`research,
`with
`an annotated
`bibliography
`of disk
`array-related
`literature.
`
`Descriptors:
`and Subject
`Categories
`Communications]:
`Input/Output
`Communications]:
`Reliability,
`Systems]:
`Storage Management;
`
`and Data
`Output
`B.4.2 [Input/
`and Data
`Output
`Devices;
`B.4.5 [Input/
`D.4.2 [Operating
`Testing,
`and Fault-Tolerance;
`E.4 [Data]:
`Coding
`and Information
`Theory;
`
`General
`
`Terms:
`
`Design,
`
`Performance,
`
`Reliability
`
`Additional
`storage,
`
`Key Words
`striping
`
`and Phrases:
`
`Disk Array,
`
`parallel
`
`1/0,
`
`RAID,
`
`redundancy,
`
`the copies are not made
`that
`provided
`is granted
`this material
`of
`fee all or part
`to copy without
`Permission
`and the title
`of
`the publication
`notice
`commercial
`advantage,
`the ACM copyright
`for direct
`or distributed
`the Association
`for Computing
`appear,
`and notice
`is given
`that
`copying
`is by permission
`of
`and its data
`or
`To copy otherwise,
`to republish,
`requires
`a fee and/or
`specific
`permission.
`Machinery,
`01994
`ACM 0360-0300/94/0600-0145
`$03,50
`
`ACM Computmg
`
`Surveys, Vol 26, No. 2, June 1994
`
`LENOVO ET AL. EXHIBIT 1026
`Lenovo et al. v. Intellectual Ventures I, LLC
`IPR2017-00477
`
`Page 1 of 41
`
`
`
`46
`
`l
`
`Peter M. Chen
`
`et al.
`
`CONTENTS
`
`secondary
`improvements
`and
`qualitative
`
`Considerations
`
`for
`
`1 INTRODUCTION
`2 BACKGROUND
`2.1 Disk Termmology
`2.2 Data Paths
`Trends
`2.3 Technology
`3 DISK ARRAY BASICS
`3.1 Data StrlpIng
`and Redundancy
`32 Basic RAID Orgamzations
`33 Performance
`and C!ost Comparisons
`34 Reliability
`35
`Implementation
`4 ADVANCED
`TOPICS
`41
`Impruvmg
`Small Write Performance
`RAID Leve15
`Parity
`42 Declustered
`Spare Disks
`On-I,lne
`43 Exploltmg
`44 Data Strip] ngm Dlsli Arrays
`45 Performance
`and Rellabdlty
`5. CASE STUDIES
`ScaleArray
`51 Thmkmg Mach] nes Corporation
`52 StorageTek
`Iceherg 9200 D]sk Array Subsystem
`5.3 NCR 6298
`5.4 l’lckerTAIP/DataMesh
`5.5 The RAID-11 Storage Server
`56
`IBM Hagar Disk Array Controller
`6 OPPORTUNITIES
`F’OR FUTURE
`RESEARCH
`61 Experience
`with Disk Arrays
`62
`InteractIon
`among New Orgamzatlons
`63 Scalabdlty,
`Massively
`Parallel Computers,
`and Small Disks
`Latency
`64
`7 CONCLUSIONS
`ACKNOWLEDGMENTS
`ANNOTATED
`BIBLIOGRAPHY
`
`Modellng
`
`1.
`
`INTRODUCTION
`
`in RAID, Redun-
`interest
`years,
`In recent
`Inexpensive
`Disks,l
`has
`dant Arrays
`of
`The
`driving
`force
`be-
`grown
`explosively.
`hind
`this
`phenomenon
`is the
`sustained
`exponential
`improvements
`in
`the
`per-
`formance
`and
`density
`of semiconductor
`technology.
`Improvements
`in
`semicon-
`ductor
`technology
`make
`possible
`faster
`and
`larger
`primary
`microprocessors
`memory
`systems
`which
`in turn
`require
`
`1Because
`sometimes
`Arrays
`of
`
`restrictiveness
`the
`of
`is said
`to stand
`RAID
`Independent
`Disks.”
`
`of
`
`“Inexpensive,”
`for
`“Redundant
`
`ACM Computing
`
`Surveys, Vol
`
`26, No 2, June
`
`1994
`
`higher-performance
`larger,
`systems.
`These
`storage
`both
`quantitative
`have
`consequences.
`Amdahl’s
`side,
`quantitative
`On
`the
`that
`large
`predicts
`1967]
`Law [Amdahl
`will
`re-
`in microprocessors
`improvements
`marginal
`improvements
`sult
`in
`only
`in
`overall
`system
`performance
`unless
`accompanied
`by
`corresponding
`improve-
`ments
`in secondary
`storage
`systems. Un-
`fortunately,
`while
`RISC microprocessor
`performance
`has been improving
`50~0 or
`more
`per year
`[Patterson
`and Hennessy
`1994,
`p. 27],
`disk
`access
`times,
`which
`depend
`on improvements
`of mechanical
`systems,
`have
`been improving
`less than
`10% per year. Disk
`transfer
`rates, which
`track
`improvements
`in both mechanical
`systems
`and magnetic-media
`densities,
`have
`improved
`at
`the
`faster
`rate
`of ap-
`proximately
`20% per
`year,
`but
`this
`is
`still
`far slower
`than
`the rate
`of processor
`improvement.
`Assuming
`that
`semicon-
`ductor
`and
`disk
`technologies
`continue
`their
`current
`trends,
`we must
`conclude
`that
`the performance
`gap between micro-
`processors
`and magnetic
`disks will
`con-
`tinue
`to widen.
`effect,
`to the quantitative
`In addition
`qualita-
`second,
`perhaps more important,
`tive
`effect
`is driving
`the need for higher-
`performance
`secondary
`storage
`systems.
`As microprocessors
`become
`faster,
`they
`make
`possible
`applications
`and
`new
`greatly
`expand
`the
`scope of existing
`ap-
`plications.
`In particular,
`image-intensive
`and
`applications
`such as video, hypertext,
`Even
`multimedia
`are becoming
`common.
`in
`existing
`application
`areas
`such
`as
`computer-aided
`design
`and
`scientific
`computing,
`faster microprocessors
`make
`it possible
`to tackle
`new problems
`requir-
`ing faster
`access to larger
`data
`sets. This
`shift
`in applications,
`along with
`a trend
`toward
`large,
`shared,
`high-performance,
`network-based
`storage
`systems,
`is caus-
`ing us to reevaluate
`the way we design
`and use secondary
`storage
`systems
`[Katz
`1992].
`arrays,
`Disk
`independent
`formance
`
`a
`
`which
`disks
`into
`logical
`disk,
`
`multiple,
`organize
`high-per-
`a large,
`are a natural
`solu-
`
`Page 2 of 41
`
`
`
`stripe
`arrays
`Disk
`problem.
`to the
`tion
`access
`and
`disks
`across multiple
`data
`higher
`both
`them in parallel
`to achieve
`access-
`data
`data
`transfer
`rates
`on large
`data
`es and
`higher
`1/0
`rates
`on small
`1986;
`accesses [Salem and Garcia-Molina
`also re-
`Livny
`et al. 1987]. Data
`striping
`across all
`sults
`in uniform
`load balancing
`spots
`that
`of
`the
`disks,
`eliminating
`hot
`number
`of
`otherwise
`saturate
`a small
`disks while
`the majority
`of disks
`sit
`idle.
`Large
`disk
`arrays,
`are highly
`vulnera-
`ble to disk
`failures
`however.
`A disk array
`with
`100 disks
`is 100 times more likely
`to
`fail
`than
`a single-disk
`array.
`An MTTF
`(mean-time-to-failure)
`of 200,000
`hours,
`or approximately
`23 years,
`for
`a single
`disk
`implies
`an MTTF
`of 2000 hours,
`or
`approximately
`three months,
`for
`a disk
`array with
`100 disks.
`The
`obvious
`solu-
`tion
`is to employ
`redundancy
`in the form
`of error-correcting
`codes
`to tolerate
`disk
`failures.
`This
`allows
`a redundant
`disk
`array
`to avoid losing
`data for much longer
`than
`an unprotected
`single
`disk. How-
`has
`ever,
`redundancy
`negative
`conse-
`all write
`quences.
`Since
`operations
`must
`update
`the
`redundant
`information,
`the
`performance
`of writes
`in redundant
`disk
`arrays
`can
`be significantly
`worse
`than
`the
`performance
`of writes
`in nonredun-
`dant
`disk
`arrays.
`Also,
`keeping
`the
`re-
`dundant
`information
`consistent
`in
`the
`1/0
`face
`of concurrent
`operations
`and
`system crashes
`can be difficult.
`and
`A number
`of different
`data-striping
`devel-
`redundancy
`schemes
`have
`been
`oped.
`The
`combinations
`and
`arrange-
`ments
`of
`these
`schemes
`lead
`to a bewil-
`dering
`set
`of
`options
`for
`users
`and
`designers
`of disk arrays.
`Each option
`pre-
`sents
`subtle
`tradeoffs
`among
`reliability,
`performance,
`and
`cost
`that
`are difficult
`to evaluate
`without
`understanding
`the
`alternatives.
`To
`address
`this
`problem,
`this article
`presents
`a systematic
`tutorial
`and
`survey
`of disk
`arrays. We describe
`seven
`basic
`disk
`array
`organizations
`along with
`their
`advantages
`and
`disad-
`vantages
`and
`compare
`their
`reliability,
`performance,
`and
`cost. We
`draw
`atten-
`tion
`to the
`general
`principles
`governing
`the design
`and configuration
`of disk
`ar-
`
`RAID
`
`l
`
`147
`
`that must
`issues
`as practical
`rays as well
`implementation
`of
`in
`the
`be addressed
`section
`describes
`op-
`A later
`disk
`arrays.
`and
`variations
`to the
`seven
`timization
`array
`organizations.
`Finally,
`basic
`disk
`existing
`research
`in the mod-
`we discuss
`arrays
`and fruitful
`avenules
`eling
`of disk
`for
`future
`research.
`This article
`should
`be
`of value
`to anyone
`interested
`in disk
`ar-
`rays,
`including
`students,
`researchers,
`de-
`signers,
`and users
`of disk arrays.
`
`2. BACKGROUND
`
`provides
`section
`This
`material
`on disks,
`1/0
`disk
`technology
`trends
`are
`unfamiliar
`with
`systems.
`
`background
`basic
`paths,
`and
`data
`readers
`who
`for
`secondary
`storage
`
`2.1 Disk Terminology
`
`components
`the basic
`1 illustrates
`Figure
`A
`magnetic
`disk
`drive.
`of a simplified
`coated
`of a set of platters
`disk
`consists
`with
`a magnetic
`medium
`rotating
`at a
`constant
`angular
`velocity
`and
`a set
`of
`with
`magnetic
`read/write
`disk
`arms
`the
`that
`are moved
`radially
`across
`heads
`platters’
`surfaces
`by an
`Once
`actuator.
`data
`the
`heads
`are correctly
`positioned,
`called
`is read
`and written
`in small
`arcs
`as the
`on the
`platters’
`surfaces
`sectors
`platters
`rotate
`relative
`to the heads. Al-
`though
`all heads
`are moved
`collective]
`y,
`in almost
`every
`disk
`drive,
`only
`a single
`head can read or write
`data
`at any given
`time. A complete
`circular
`swath
`of data is
`and each platter’s
`referred
`to as a track,
`of
`surface
`consists
`of
`concentric
`rings
`at
`tracks.
`A vertical
`collection
`tracks
`of
`re-
`the
`same
`radial
`position
`is logically
`Sectors
`are numb-
`ferred
`to as a cylinder.
`ered
`so that
`a sequential
`scan
`of all
`sectors
`traverses
`the
`entire
`disk
`in the
`minimal
`possible
`time.
`described
`disk
`Given
`the
`simplified
`be brok-
`can
`above,
`disk
`service
`times
`en into
`three
`primary
`components:
`seek
`and
`data
`traris-
`time,
`rotational
`latency,
`Seek time
`is the amount
`of
`time
`fer
`time.
`needed
`to move
`a head
`to the
`correct
`radial
`position
`and typically
`ranges
`from
`
`ACM Computing
`
`Surveys, Vol. 26, No 2, June 1994
`
`Page 3 of 41
`
`
`
`148
`
`*
`
`Peter M. Chen
`
`et al.
`
`Inner Track
`
`l+tld
`
`Plattel-
`
`-Actuator
`
`__..–—
`
`.—----
`
`...~-k-.
`
`--- —-..
`-._ ._.. ._
`___
`
`——--
`----------
`
`~-----
`
`1. Disk
`Figure
`by
`positioned
`are
`which
`arms
`res] de on
`Heads
`terminology
`and writes
`A cylinder
`of reads
`is the basic unit
`A sector
`cm a platter.
`rings
`concentric
`is everything
`in the figure
`plus
`positron.
`An HDA
`(head-disk
`assembly)
`one actuator
`In some devices
`it M possible
`to transfer
`data
`from multiple
`surfaces
`simultaneously,
`.-
`and
`expensive.
`The
`collection
`of heads
`that
`participate
`m a single
`logical
`transfer
`multiple
`surfaces
`is called
`a head groap.
`
`are
`Tracks
`actuators.
`at
`of
`tracks
`is a stack
`the air-tight
`casing
`but
`this
`is both
`rare
`that
`is suread
`over
`
`on the
`depending
`30 milliseconds
`1 to
`disk.
`the
`particular
`distance
`and
`seek
`time
`of
`is the amount
`Rotational
`latency
`to rotate
`desired
`sector
`needed
`the
`for
`head. Full
`rotation
`times
`under
`the disk
`currently
`from 8 to
`28
`for
`disks
`vary
`The
`data
`transfer
`time
`is
`milliseconds.
`on the rate
`at which
`data
`can
`dependent
`to/from
`a platter’s
`surface
`be transferred
`and is a function
`of
`the platter’s
`rate
`of
`rotation,
`the density
`of
`the magnetic
`me-
`dia,
`and the radial
`distance
`of
`the head
`of
`from the
`center
`the
`platter—some
`disks
`use a technique
`called
`zone-bit-re-
`cording
`to store more
`data
`on the longer
`outside
`tracks
`than
`the
`shorter
`inside
`tracks.
`Typical
`data
`transfer
`rates
`range
`from 1 to 5 MB per second. The seek time
`and rotational
`latency
`are sometimes
`col-
`lectively
`referred
`to as the heacl-position-
`Table
`1 tabulates
`the statistics
`ing
`time.
`for
`a typical
`high-end
`disk
`available
`1993.
`and
`time
`slow head-positioning
`The
`to
`data
`transfer
`rate
`of disks
`lead
`fast
`se-
`different
`performance
`for
`a
`very
`of accesses depending
`on the size
`quence
`and relative
`location
`of each access. Sup-
`pose we need to transfer
`1 MB from the
`disk
`in Table
`1, and the data
`is laid
`out
`in two ways:
`sequential
`within
`a single
`cylinder
`or
`randomly
`placed
`in
`8 KB
`blocks.
`either
`case the
`time
`for
`the
`
`In
`
`in
`
`ACM Comput]ng
`
`Surveys,
`
`Vol
`
`26, No
`
`2, June
`
`1994
`
`Table 1.
`
`the Seagate
`for
`Speclflcatlons
`Elite-3 SCSI D!sk Drive
`
`ST43401 N
`
`Form Factor/Disk
`
`Diameter
`
`5,25 inch
`
`Capxity
`
`Cylinders
`
`Tracks Per Cylinder
`
`Sec[ors Pcr Tmck
`
`Bytes Pcr Sector
`
`Full Rolahon Time
`
`Mlnunum Seek
`(single cylinder)
`
`Average Seek
`(random cylinder
`
`to cylmdcr)
`
`2.8 GB
`
`2627
`
`21
`
`-99
`
`512
`
`11.lms
`
`1,7 ms
`
`11.Oms
`
`~
`1s calculated
`table
`seek in this
`Average
`of accesses.
`This
`distribution
`umform
`report
`average
`dard way manufacturers
`of production
`In
`reality,
`measurements
`sigmficantly
`show that
`spatial
`locality
`effective
`average
`seek distance
`[Hennessy
`terson
`1990,
`p 559]
`
`a
`
`assuming
`is the
`stan-
`seek times.
`systems
`lowers
`the
`and Pat-
`
`200
`of 1 MB is about
`data transfer
`actual
`the time
`for positioning
`the head
`ms. But
`goes from about
`16 ms in the sequential
`layout
`to about
`2000 ms in the random
`is
`layout.
`This
`sensitivity
`to the workload
`why
`I/O-intensive
`applications
`are cate-
`
`Page 4 of 41
`
`
`
`mini-
`rate, meaning
`as high
`gorized
`data
`sequen-
`via
`large,
`mal
`head
`positioning
`rate, meaning
`tial
`accesses,
`or high
`1/0
`via
`small, more
`lots
`of head
`positioning
`random
`accesses. For example,
`scientific
`programs
`that manipulate
`large
`arrays
`of data
`in the
`high
`data
`rate
`cate-
`gory,
`transaction-processing
`pro-
`grams
`in the high
`1/0
`rate
`category.
`
`fall
`while
`fall
`
`2.2 Data Paths
`
`inter-
`standard
`industry
`of
`A hierarchy
`for
`transferring
`defined
`faces
`has been
`platter’s
`surface
`on a disk
`data
`recorded
`to or
`from a host
`computer.
`In this
`sec-
`tion we review
`the
`complete
`data
`path,
`from a disk
`to a users’
`application
`(Fig-
`ure
`2). We assume
`a read
`operation
`for
`the purposes
`of
`this
`discussion.
`informa-
`On the disk
`platter’s
`surface,
`in the
`tion
`is represented
`as reversals
`These
`direction
`of stored magnetic
`fields.
`“flux
`reversals”
`are
`sensed,
`amplified,
`and
`digitized
`pulses
`by the
`lowest-
`into
`level
`read
`electronics.
`The
`protocol
`ST506/
`412 is one standard
`that
`defines
`an interface
`to disk
`systems
`at
`this
`low-
`est, most
`inflexible,
`and
`technology-de-
`pendent
`level. Above
`this
`level of the read
`electronics
`path,
`pulses
`are
`decoded
`to
`separate
`data
`bits
`from timing-related
`flux
`reversals.
`The
`bit-level
`ESDI
`(En-
`hanced Small Device
`Interface)
`and SMD
`(Storage Module
`Interface)
`standards
`de-
`fine
`an interface
`at
`this more
`flexible,
`encoding-independent
`level.
`Then,
`to be
`transformed
`into
`the highest,
`most
`flexi-
`ble
`packet-level,
`these
`bits
`are
`aligned
`into bytes,
`error-correcting
`codes applied,
`and the
`extracted
`data
`delivered
`to the
`host as data blocks
`over a peripheral
`bus
`interface
`such as SCSI
`(Small
`Computer
`Standard
`Interface),
`or
`IPI-3
`(the
`third
`level
`of
`the
`Intelligent
`Peripheral
`Inter-
`face). These steps are performed
`today
`by
`intelligent
`on-disk
`controllers,
`which
`of-
`ten include
`speed matching
`and caching
`“track
`buffers.”
`SCSI
`and IPI-3
`also in-
`clude
`a level
`of data mapping:
`the
`com-
`puter
`specifies
`a logical
`block
`number,
`and the controller
`embedded
`on the disk
`maps
`that
`block
`number
`to a physical
`
`RAID
`
`-
`
`14’9
`
`to
`
`and sector. This mapping
`track,
`cylinder,
`the
`embedded
`disk
`controller
`allows
`avoid bad areas of
`the disk by remapping
`logical
`blocks
`that
`are
`affected
`to new
`areas of
`the disk.
`Topologies
`and devices on the data path
`between
`disk
`and
`host
`computer
`vary
`widely
`depending
`on the size and type of
`1/0
`system. Mainframes
`have
`the
`rich-
`est 1/0
`systems,
`with many
`devices
`and
`complex
`interconnection
`schemes
`to ac-
`cess them. An IBM channel
`which
`path,
`encompasses
`the set of cables and associ-
`ated
`electronics
`that
`transfer
`data
`and
`control
`information
`between
`an 1/0
`de-
`of
`vice
`and main
`memory,
`consists
`a
`a head
`a storage
`and
`channel,
`director,
`The
`collection
`of disks
`that
`of
`string.
`share
`the
`same
`pathway
`to the
`head
`of
`string
`is called
`a string.
`In the worksta-
`tion/file
`server world,
`the
`channel
`pro-
`cessor
`is usually
`called
`an 1/0
`controller
`or host-bus
`adaptor
`(HBA),
`and the func-
`tionality
`of
`the storage
`director
`and head
`of string
`is
`contained
`in
`an embedded
`controller
`on the
`disk
`drive.
`As
`in the
`mainframe
`world,
`the
`use of high-level
`peripheral
`interfaces
`such
`as SCSI
`and
`IPI-3
`allow multiple
`disks
`to share
`a sin-
`gle peripheral
`bus or string.
`via
`From the HBA,
`data
`is transferred
`direct memory
`access, over a system bus,
`such
`as VME
`(Versa Module
`Eurocarcl),
`S-Bus,
`MicroChannel,
`EISA
`(Extended
`Industry
`Standard
`Architecture),
`or PCI
`(Peripheral
`Component
`Interconnect),
`to
`the host operating
`system’s
`buffers.
`Then,
`in most
`operating
`systems,
`the CPU per-
`forms
`a memory-to-memory
`copy over
`a
`high-speed
`memory
`bus from the operat-
`ing system buffers
`to buffers
`in the appli-
`cation’s
`address
`space.
`
`2.3 Technology
`
`Trends
`
`arrays
`disk
`for
`the motivation
`of
`Much
`in
`disk
`trends
`from the
`current
`comes
`technology.
`As Table
`2 shows, magnetic
`disk
`drives
`have
`been improving
`rapidly
`by
`some metrics
`and
`hardly
`all
`by
`at
`other metrics.
`Smaller
`distances
`between
`the magnetic
`read/write
`head
`and
`the
`disk
`surface,
`more
`accurate
`positioning
`
`ACM Computmg
`
`Surveys, Vol. 26, No 2, June 1994
`
`Page 5 of 41
`
`
`
`.-A
`
`15U
`
`“
`
`-.
`
`Yeter M.
`
`..-,
`
`(,’hen et al.
`
`M3~fA
`
`or C7vmntl Proccs70r
`E!:?!
`
`+
`
`L.T
`
`Slor<lgt Dltcctor
`
`& T!xh
`
`Bullcr$
`
`_d__HDisk Cont!ollcr/
`.Clocking
`
`I
`
`S1l-loz
`——..
`
`.
`
`lPI-3,
`
`SCSI-I,
`
`SCSI-2,
`
`DEC
`
`C1/MSCP
`
`IPI-2, SCSI-I,DECSD1,
`IE+hlC1l’i[ml P,III1 ([111,1blWk\,l
`
`1
`
`DI\k Cooitmllcr/
`Slor.yc
`Iltcclor
`
`. S\l D, ESDI
`Zk
`
`Clmhin:
`
`(h]i,
`
`)
`
`.
`
`Srx)(l, srd I I IPUIWSI
`
`nl<lgncllc
`
`nlilgncl]c
`
`Figure
`that
`Data
`pathways.
`Host-to-device
`2.
`processor,
`requesting
`on its way
`to the
`layers
`closelv
`deal more
`such
`as ST506
`interfaces
`such
`as SCSI
`d;al
`Higher
`layers
`dependent,
`A string
`connects multiple
`disks
`independent,
`between
`the 1/0
`and disk
`controllers.
`
`from a magnetic
`is read
`dashed
`line marks
`Each
`raw maxnetic
`fields
`the
`with
`packets
`or b~ocks
`of data
`in
`to a single
`1/0
`controller,
`control
`
`many
`through
`pass
`disk must
`Lower
`interface
`a standard
`hi~hlv
`technology
`and
`are
`;or;
`technology
`and
`are
`of
`the string
`1s distributed
`
`advanced magnetic
`and more
`electronics,
`the recording
`den-
`increased
`media
`have
`disks
`dramatically.
`This
`in-
`sity
`on the
`creased
`density
`has
`improved
`disks
`in
`two ways.
`First,
`it has allowed
`disk
`ca-
`pacities
`to stay constant
`or increase,
`even
`while
`disk
`sizes have decreased
`from 5.25
`inches
`in
`1983
`to
`1.3 inches
`in
`1993.
`Second,
`the increased
`density,
`along with
`an increase
`in the rotational
`speed of
`the
`disk,
`has made
`possible
`a substantial
`in-
`crease in the transfer
`rate
`of disk drives.
`
`im-
`have
`seek times
`hand,
`On the other
`from
`only
`decreasing
`little,
`proved
`very
`in
`20 ms
`1980 to 10 ms
`approximately
`speeds
`have
`increased
`today. Rotational
`at a similarly
`slow rate from 3600 revolu-
`tions
`per minute
`in
`1980
`to 5400-7200
`today.
`
`3. DISK ARRAY BASICS
`
`This
`the
`
`section
`design
`
`basic
`examines
`and
`implementation
`
`in
`issues
`of disk
`
`ACM
`
`Computmg
`
`Surveys,
`
`Vol
`
`26, No
`
`2, June
`
`1994
`
`Page 6 of 41
`
`
`
`Table 2.
`
`Trends
`
`in Disk Technology.
`
`RAID
`
`l
`
`151
`
`1993
`
`50-150
`Mbils/sq.
`
`inch
`
`40,000-60,000
`bilslinch
`
`Rate
`Historical
`of Improvement
`
`27% per year
`
`13% per year
`
`And Density
`
`Linear Density
`
`Inter-Track Density
`
`Capacity
`(3.5”
`form factor)
`
`Transfer Rate
`
`Seek Time
`
`disks
`Magnetic
`is the
`densitv
`Gbit/;q.-inch
`track.
`Intertrack
`
`in performance.
`slowly
`but more
`capacity,
`and
`in density
`rapidly
`are improving
`IBM demonstrated
`1989,
`media.
`In
`of magnetic
`sauare
`inch
`Ber
`recording
`densitv
`of bits written
`Lines;
`density
`is the number
`environment.
`densityi;
`a labo;a;ory
`density
`refers
`to the number
`of concentric
`tracks
`on a single
`platter.
`
`A real
`a 1
`along
`a
`
`In
`
`the
`examine
`we
`particular,
`arrays.
`redun-
`and
`striping
`data
`of
`concepts
`perfor-
`basic RAID organizations;
`dancy;
`and cost comparisons
`between
`the
`mance
`basic RAID
`organizations;
`reliability
`of
`RAID-based
`systems
`in the
`face of sys-
`tem crashes,
`uncorrectable
`bit-errors,
`and
`correlated
`disk
`failures;
`and
`finally,
`is-
`sues
`in the
`implementation
`of block-in-
`terleaved,
`redundant
`disk arrays.
`
`3.1 Data Striping
`
`and Redundancy
`
`or-
`two
`employ
`arrays
`disk
`Redundant
`im-
`for
`striping
`data
`concepts:
`thogonal
`for
`and redundancy
`performance
`proved
`dis-
`Data
`striping
`reliability.
`improved
`data transparently
`over multiple
`tributes
`to make
`them appear
`as a single
`disks
`large
`disk. Striping
`improves
`aggre-
`fast,
`gate 1/0
`performance
`by allowing
`multi-
`ple 1/0s
`to be serviced
`in parallel.
`There
`are two aspects
`to this parallelism.
`First,
`multiple
`independent
`requests
`can
`be
`by
`serviced
`in
`parallel
`separate
`disks.
`This
`decreases
`the queuing
`time
`seen by
`1/0
`requests.
`Second,
`single multiple-
`block
`requests
`can be serviced
`by multi-
`ple disks
`acting
`in coordination.
`This
`in-
`creases
`the effective
`transfer
`rate seen by
`
`in the
`disks
`The more
`request.
`a single
`per-
`disk
`array,
`the larger
`the potential
`a large
`formance
`benefits.
`Unfortunately,
`relia-
`number
`of disks
`lowers
`the overall
`bility
`the
`disk
`array,
`as mentioned
`of
`before.
`Assuming
`independent
`failures,
`100 disks
`collectively
`have
`only
`1/100th
`the reliability
`of a single
`disk. Thus,
`re-
`dundancy
`is necessary
`to tolerate
`disk
`failures
`and
`allow
`continuous
`operation
`without
`data loss.
`of redun-
`the majority
`We will
`see that
`can be dis-
`dant
`disk array
`organizations
`tinguished
`based on two features:
`(1)
`the
`granularity
`of data
`interleaving
`and (2)
`the method
`and
`pattern
`in which
`the
`redundant
`information
`is computed
`and
`distributed
`across
`the
`disk
`array.
`Data
`interleaving
`can be characterized
`as ei-
`ther
`fine grained
`or coarse grained.
`Fine-
`grained
`disk
`arrays
`conceptually
`inter-
`leave
`data
`relatively
`small
`units
`so
`in
`that
`all 1/0
`requests,
`regardless
`of
`their
`size,
`access all of
`the
`disks
`in the
`disk
`array.
`This
`results
`very
`high
`data
`all
`transfer
`rates
`for
`1/0
`requests
`but
`has
`the disadvantages
`that
`(1) only
`one
`logical
`1/0
`request
`can be in service
`at
`any
`given
`time
`and
`(2) all
`disks must
`waste
`time
`positioning
`for every
`request.
`
`in
`
`ACM Computing
`
`Surveys, Vol. 26, No. 2, June 1994
`
`Page 7 of 41
`
`
`
`152
`
`l
`
`Peter M. Chen
`
`et al.
`
`to
`
`interleave
`arrays
`disk
`Coarse-grained
`so that
`small
`large units
`data in relatively
`only
`a small
`need
`access
`1/0
`requests
`requests
`can
`number
`of disks while
`large
`disk
`array.
`access
`all
`the
`disks
`in
`the
`This
`allows multiple
`small
`requests
`to be
`serviced
`simultaneously
`while
`still
`allow-
`ing
`large
`requests
`see
`the
`higher
`transfer
`rates
`afforded
`by using multiple
`disks.
`in
`redundancy
`of’
`incorporation
`The
`or-
`up two somewhat
`arrays
`brings
`disk
`is
`The
`first
`problem
`thogonal
`problems.
`the
`for
`computing
`to select
`the method
`Most
`redundant
`redundant
`information.
`disk arrays
`today use parity,
`though
`some
`use Hamming
`or Reed-Solomon
`codes.
`The second problem is that
`of selecting
`a
`method
`for
`distributing
`the
`redundant
`information
`across
`the
`disk
`array.
`Al-
`though
`there
`are an unlimited
`number
`of
`patterns
`in which
`redundant
`information
`can be distributed,
`we classify
`these
`pat-
`terns
`roughly
`into
`two different
`distribu-
`tions
`schemes,
`those that
`concentrate
`re-
`dundant
`information
`on a small
`number
`of disks
`and
`those
`that
`distributed
`re-
`dundant
`information
`uniformly
`across all
`of
`the disks. Schemes
`that
`uniformly
`dis-
`tribute
`redundant
`information
`are gener-
`ally more
`desirable
`because
`they
`avoid
`hot
`spots
`and other
`load-balancing
`prob-
`lems
`suffered
`by
`schemes
`that
`do not
`distribute
`redundant
`information
`uni-
`formly.
`Although
`the
`basic
`concepts
`of
`data
`striping
`and
`redundancy
`are
`con-
`ceptually
`simple,
`selecting
`between
`the
`many
`possible
`data
`striping
`and
`redun-
`dancy
`schemes
`involves
`complex
`trade-
`offs between
`reliability,
`performance,
`and
`cost.
`
`3.2 Basic RAID Organizations
`
`basic RAID
`the
`describes
`section
`This
`be used
`as the
`that
`will
`organizations
`examinations
`of
`the per-
`basis
`for
`further
`and
`reliability
`of
`disk
`formance,
`cost,
`arrays.
`addition
`to presenting
`RAID
`In
`levels
`1 through
`5 that
`first
`appeared
`in
`the
`landmark
`paper
`by Patterson,
`Gib-
`son, and Katz
`[Patterson
`et al. 1988], we
`present
`two
`other
`RAID
`organizations,
`
`ACM
`
`Computing
`
`Surveys,
`
`Vol
`
`26, No
`
`2, June
`
`1994
`
`since
`have
`that
`6,
`O and
`levels
`RAID
`For
`the ben-
`generally
`accepted.x
`become
`the original
`those
`unfamiliar
`with
`efit of
`numerical
`classification
`of RAID, we will
`use English
`phrases
`in preference
`to the
`numerical
`classifications.
`It should
`come
`as no surprise
`to the reader
`that
`even the
`original
`authors
`have
`been
`confused
`sometimes
`with
`regard
`to the disk
`array
`organization
`referred
`to by a particular
`RAID level! Figure
`3 illustrates
`the seven
`RAID organizations
`schematically.
`
`3.2.1 Nonredundant
`
`(RAID Level O)
`
`or RAID level
`disk array,
`A nonredundant
`cost of any RAID orga-
`O, has the lowest
`it does not
`employ
`re-
`nization
`because
`dundancy
`at all. This
`scheme
`offers
`the
`best write
`performance
`since it does not
`need
`to update
`redundant
`information.
`Surprisingly,
`it
`does not
`have
`the
`best
`read
`performance.
`Redundancy
`schemes
`data,
`that
`duplicate
`such
`as mirroring,
`better
`can
`perform
`on reads
`by
`selec-
`tively
`scheduling
`requests
`on the
`disk
`with
`the shortest
`expected
`seek and rota-
`tional
`delays
`[Bitten
`and Gray
`1988].
`Without
`redundancy,
`any single
`disk
`fail-
`ure will
`result
`in data
`loss. Nonredun-
`in
`dant
`disk
`arrays
`are widely
`used
`supercomputing
`environments
`where
`performance
`capacity,
`than
`and
`reliability,
`are the primary
`
`rather
`concerns.
`
`3.2.2 Mirrored (RAID Level 1)
`
`called mirroring
`solution,
`The traditional
`as many
`disks
`uses twice
`or shadowing,
`disk array
`[Bitten
`and
`as a nonredundant
`Gray
`1988]. Whenever
`data
`is written
`to
`a disk
`the same data
`is also written
`to a
`redundant
`disk,
`so that
`there
`are always
`two copies of
`the information.
`When
`data
`is read,
`it can he retrieved
`from the disk
`with
`the shorter
`queuing,
`seek, and rota-
`tional
`delays
`[Chen
`et al. 1990].
`If a disk
`fails,
`the
`other
`copy
`is used
`to service
`requests. Mirroring
`is frequently
`used in
`
`speaking,
`2Strictly
`array
`redundant
`no error-correcting
`
`of
`
`of
`a type
`level O IS not
`RAID
`inexpensive
`disks
`since it stores
`codes.
`
`Page 8 of 41
`
`
`
`RAID
`
`l
`
`153
`
`Non-Redundant
`
`(RAID Level O)
`
`E3E3E3
`EE3B
`
`Mirrored
`
`(RAID Level 1)
`
`Mcmo]y-S[ylc ECC (RAID
`
`LCW4 2)
`
`BwIntcrleavcd
`
`Parity (R~lD LCVC1 3)
`
`EE3mn!iiii
`
`crleavcd PJriIy (RAID Level 4)
`
`Block-In{
`
`I]lock-In~c!lcavcd
`
`Dislribuwd-Parjly
`
`[RAID Level 5)
`
`~.\’
`
`B\ ... Ei23,, . .. Es
`
`N.... ‘
`
`,.,
`
`‘.+’
`
`..,’
`
`P+Q Redundancy
`
`(RAID Level
`
`6)
`
`Figure 3.
`RAID levels O through
`platters
`indicate
`with multiple
`striping.
`The shaded
`platters
`
`6. All RAID levels
`block-level
`striping
`represent
`redundant
`
`at a user
`are illustrated
`while
`disks without
`multiple
`information.
`
`capacity
`platters
`
`of
`
`disks. Disks
`four
`indicate
`bit-level
`
`applications
`database
`and transaction
`rate
`than
`storage
`efficiency
`
`where
`are more
`[Gray
`
`availability
`important
`et al. 1990].
`
`3.2.3 Memoiy-Siyle ECC (RAID Level 2)
`
`have
`systems
`Memory
`from failed
`components
`
`recovery
`provided
`with much
`less
`
`Hamming
`by using
`than mirroring
`cost
`1972]. Ham-
`codes [Peterson
`and Weldon
`for
`distinct
`ming
`codes
`contain
`parity
`In
`overlapping
`subsets
`of
`components.
`data
`one
`version
`of
`this
`scheme,
`four
`one
`disks
`require
`three
`redundant
`disks,
`of
`less than mirroring.
`Since the number
`redundant
`disks
`is proportional
`to the log
`
`ACM Computing
`
`Surveys, Vol. 26, No. 2, June 1.994
`
`Page 9 of 41
`
`
`
`154
`
`“
`
`Peter M. Chenetal.
`
`of
`
`is
`
`sys-
`in the
`of disks
`number
`the total
`of
`increases
`as the
`efficiency
`tem,
`storage
`increases.
`number
`of data
`disks
`several
`If a single
`component
`fails,
`will
`the parity
`components
`have inconsis-
`tent
`values,
`and the failed
`component
`the one held in common
`by each incorrect
`subset.
`The lost
`information
`is recovered
`by
`reading
`the
`other
`components
`in
`subset,
`including
`the
`parity
`component,
`and setting
`the missing
`bit
`to O or 1 to
`create
`the
`proper
`parity
`value
`for
`that
`subset.
`Thus, multiple
`redundant
`disks
`are needed
`to identify
`the failed
`disk, but
`only
`one is needed
`to recover
`the
`lost
`information.
`can
`parity
`with
`unfamiliar
`Readers
`as having
`disk
`the
`redundant
`think
`of
`the sum of all
`the data in the other
`disks.
`When
`a disk
`fails,
`you
`can subtract
`all
`the data
`on the good disks
`from the par-
`ity disk;
`the remaining
`information
`must
`be
`the missing
`information.
`Parity
`is
`simply
`this
`sum modulo
`two.
`
`a
`
`3.2.4 Bit-Interleaved Parity (RAID Level 3)
`
`disk
`
`EGG
`upon memory-style
`One can improve
`disk
`arrays
`by noting
`that,
`unlike mem-
`ory
`component
`failures,
`disk
`controllers
`can easilv
`identifv
`which
`disk has failed.
`Thus,
`on”e can u~e a single
`parity
`disk
`rather
`than
`a set of parity
`disks
`to re-
`cover
`lost
`information.
`array,
`parity
`In a bit-interleaved
`bit-wise
`interleaved
`data
`is conceptually
`parity
`and a single
`over
`the
`data
`disks,
`disk
`disk
`is added
`to tolerate
`any
`single
`all
`failure.
`Each
`read
`request
`accesses
`ac-
`data
`disks,
`and
`each write
`request
`disk.
`cesses all data
`disks
`and the parity
`Thus,
`only one request
`can be serviced
`at
`a time. Because
`the parity
`disk
`contains
`only
`parity
`and no data,
`the parity
`disk
`cannot
`participate
`on reads,
`resulting
`in
`slightly
`lower
`read
`performance
`than
`for
`redundancy
`schemes
`that
`distribute
`the
`parity
`and data
`over all disks. Bit-inter-
`leaved
`parity
`disk
`arrays
`are frequently
`in
`used
`applications
`that
`require
`high
`bandwidth
`but
`not high
`1/0
`rates. Also
`they are simpler
`to implement
`than RAID
`Levels
`4, 5, and 6.
`
`ACM Computing
`
`Surveys, Vol
`
`26, No 2, June 1994
`
`3.2,5 Block-interleaved
`
`Parity (RAID Level 4)
`
`disk
`
`array
`parity
`block-interleaved
`The
`parity
`bit-interleaved
`similar
`to the
`is
`that
`data is interleaved
`disk array
`except
`across
`disks
`in blocks
`of arbitrary
`size
`rather
`than
`in
`bits.
`The
`size
`of
`these
`blocks
`is called
`the
`striping
`unit
`[Chen
`and
`Patterson
`1990].
`Read
`requests
`smaller
`than
`the striping
`unit
`access only
`a single
`data
`disk. Write
`reauests
`must
`upda~e
`the
`requested
`data
`‘blocks
`and
`must
`compute
`and
`update
`the
`parity
`block.
`For
`large writes
`that
`touch
`blocks
`on all disks,
`p~rity
`is easily
`computed
`by
`exclusive-oring
`the
`new data
`for
`each
`disk.
`For
`small
`write
`reauests
`that
`uw
`date
`only
`one data
`disk,’
`parity
`is comp-
`uted
`by noting
`how the new data differs
`from the
`old
`data
`and
`applying
`those
`differences
`to
`the
`~aritv
`block.
`Small
`write
`requests
`thu~
`re&ire
`four
`disk
`1/0s:
`one to write
`the new data,
`two to
`read the old data
`and old parity
`for com-
`puting
`the new parity,
`and one to write
`the
`new parity.
`This
`is referred
`to as a
`read-modify-write
`procedure.
`Because
`a
`block-interleaved
`parity
`disk
`array
`has
`only
`one parity
`disk, which must
`be up-
`dated
`on all write
`operations.
`the ~aritv
`disk
`can easily
`become
`a bottleneck.
`B~-
`cause
`of
`this
`limitation,
`the block-inter-
`leaved
`distributed-parity
`disk
`array
`over
`universally
`m-eferred
`the
`block-in-
`terleaved
`~a~ity
`disk array.
`
`is
`
`3.2.6 Block-Interleaved
`Level 5)
`
`Distributed-Parly
`
`(RAID
`
`aritv
`distributed-~
`block-interleaved
`The
`the parity
`disk bo~-
`disk array
`eliminates
`tleneck
`present
`in the
`block-interleaved
`par-
`disk
`array
`by distributing
`the
`parity
`An
`ity
`uniformly
`over
`all
`of
`the
`disks.
`additional,
`frequently
`overlooked
`advan-
`is
`tage
`to
`distributing
`the
`parity
`that
`data
`the
`of
`also
`distributes
`over
`all
`disks
`all
`over
`rather
`than
`but
`one. This
`allows
`all
`disks
`to participate
`in
`servicing
`read
`operations
`in
`contrast
`to
`redundance
`.
`schemes
`with
`dedicated
`parity
`disks
`which
`the parity
`disk
`cannot
`participate
`in
`servicing
`read
`requests.
`Block-inter-
`
`.
`in
`
`it
`
`Page 10 of 41
`
`
`
`arrays
`disk
`distributed-parity
`leaved
`read,
`and
`large
`have the best small
`read,
`of any
`redun-
`large
`write
`performance
`dant
`disk array. Small write
`requests
`are
`somewhat
`inefficient
`compared
`with
`re-
`dundancy
`schemes
`such
`as mirroring
`however,
`due
`to
`need
`to
`perform
`the
`read-modify-write
`operations
`to
`update
`is
`parity.
`This
`the major
`performance
`weakness
`of RAID level-5
`disk arrays
`and
`has
`been
`the
`subject
`of
`intensive
`re-
`search
`[Menon
`et al. 1993; Stodolsky
`and
`Gibson
`1993].
`to distribute
`used
`The
`exact method
`distributed-
`parity
`in
`block-interleaved
`affect
`perfor-
`parity
`disk
`arrays
`can
`best
`the
`mance.
`Figure
`4
`illustrates
`investigated
`parity
`distribution
`of
`those
`in [Lee
`and Katz
`1991b],
`called
`the left-
`symmetric
`parity
`distribution.
`A useful
`property
`the left-symme