throbber
RAID: High-Performance,
`
`Reliable Secondary
`
`Storage
`
`PETER M. CHEN
`
`Department
`University
`
`of Electr~cal
`of Michigan,
`
`Sctence,
`and Computer
`Engineering
`Ann Arbor, Michigan,
`48109-2122
`
`1301 Beal Avenue,
`
`EDWARD
`
`K. LEE
`
`DECSystems
`
`Research
`
`Center,
`
`130 Lytton
`
`Avenue.
`
`Palo Alto, California
`
`94301-1044
`
`GARTH A. GIBSON
`
`Sctence, Carnegie Mellon
`of Computer
`School
`Pittsburgh,
`Pennsylvania
`15213-3891
`
`University,
`
`5000 Forbes Aven ue,
`
`RANDY H. KATZ
`
`Department
`University
`
`of Electrical
`of California,
`
`Engineering
`Berkeley,
`
`and Computer
`California
`94720
`
`Sctence.
`
`571 Evans Hall,
`
`DAVID A. PATTERSON
`
`Department
`University
`
`of Electrical
`of Cahfornia,
`
`Engineering
`Berkeley,
`
`and Computer
`California
`94720
`
`Science,
`
`571 Euans Hall,
`
`between multiple
`in the 1980s as a way to use parallelism
`arrays were proposed
`Disk
`of
`in the product
`lines
`1/0
`performance.
`Today
`they
`appear
`to improve
`aggregate
`disks
`of
`most major
`computer
`manufacturers.
`This
`article
`gives
`a comprehensive
`overview
`disk
`arrays
`and provides
`a framework
`in which
`to organize
`current
`and future
`work.
`First,
`the article
`introduces
`disk
`technology
`and reviews
`the driving
`forces
`that
`have
`popularized
`disk arrays:
`performance
`and reliability.
`It discusses
`the two architectural
`techniques
`used in disk
`arrays:
`striping
`across multiple
`disks
`to improve
`performance
`and redundancy
`to improve
`reliability.
`Next,
`the article
`describes
`seven
`disk
`array
`architectures,
`called RAID
`(Redundant
`Arrays
`of
`Inexpensive
`Disks)
`levels O–6 and
`compares
`performance,
`cost,
`and reliability.
`It goes on to discuss
`advanced
`their
`research
`and implementation
`topics
`such as refining
`the basic RAID levels
`to improve
`performance
`and designing
`algorithms
`to maintain
`data
`consistency.
`Last,
`the article
`describes
`six disk
`array
`prototypes
`or products
`and discusses
`future
`opportunities
`for
`research,
`with
`an annotated
`bibliography
`of disk
`array-related
`literature.
`
`Descriptors:
`and Subject
`Categories
`Communications]:
`Input/Output
`Communications]:
`Reliability,
`Systems]:
`Storage Management;
`
`and Data
`Output
`B.4.2 [Input/
`and Data
`Output
`Devices;
`B.4.5 [Input/
`D.4.2 [Operating
`Testing,
`and Fault-Tolerance;
`E.4 [Data]:
`Coding
`and Information
`Theory;
`
`General
`
`Terms:
`
`Design,
`
`Performance,
`
`Reliability
`
`Additional
`storage,
`
`Key Words
`striping
`
`and Phrases:
`
`Disk Array,
`
`parallel
`
`1/0,
`
`RAID,
`
`redundancy,
`
`the copies are not made
`that
`provided
`is granted
`this material
`of
`fee all or part
`to copy without
`Permission
`and the title
`of
`the publication
`notice
`commercial
`advantage,
`the ACM copyright
`for direct
`or distributed
`the Association
`for Computing
`appear,
`and notice
`is given
`that
`copying
`is by permission
`of
`and its data
`or
`To copy otherwise,
`to republish,
`requires
`a fee and/or
`specific
`permission.
`Machinery,
`01994
`ACM 0360-0300/94/0600-0145
`$03,50
`
`ACM Computmg
`
`Surveys, Vol 26, No. 2, June 1994
`
`LENOVO ET AL. EXHIBIT 1026
`Lenovo et al. v. Intellectual Ventures I, LLC
`IPR2017-00477
`
`Page 1 of 41
`
`

`

`46
`
`l
`
`Peter M. Chen
`
`et al.
`
`CONTENTS
`
`secondary
`improvements
`and
`qualitative
`
`Considerations
`
`for
`
`1 INTRODUCTION
`2 BACKGROUND
`2.1 Disk Termmology
`2.2 Data Paths
`Trends
`2.3 Technology
`3 DISK ARRAY BASICS
`3.1 Data StrlpIng
`and Redundancy
`32 Basic RAID Orgamzations
`33 Performance
`and C!ost Comparisons
`34 Reliability
`35
`Implementation
`4 ADVANCED
`TOPICS
`41
`Impruvmg
`Small Write Performance
`RAID Leve15
`Parity
`42 Declustered
`Spare Disks
`On-I,lne
`43 Exploltmg
`44 Data Strip] ngm Dlsli Arrays
`45 Performance
`and Rellabdlty
`5. CASE STUDIES
`ScaleArray
`51 Thmkmg Mach] nes Corporation
`52 StorageTek
`Iceherg 9200 D]sk Array Subsystem
`5.3 NCR 6298
`5.4 l’lckerTAIP/DataMesh
`5.5 The RAID-11 Storage Server
`56
`IBM Hagar Disk Array Controller
`6 OPPORTUNITIES
`F’OR FUTURE
`RESEARCH
`61 Experience
`with Disk Arrays
`62
`InteractIon
`among New Orgamzatlons
`63 Scalabdlty,
`Massively
`Parallel Computers,
`and Small Disks
`Latency
`64
`7 CONCLUSIONS
`ACKNOWLEDGMENTS
`ANNOTATED
`BIBLIOGRAPHY
`
`Modellng
`
`1.
`
`INTRODUCTION
`
`in RAID, Redun-
`interest
`years,
`In recent
`Inexpensive
`Disks,l
`has
`dant Arrays
`of
`The
`driving
`force
`be-
`grown
`explosively.
`hind
`this
`phenomenon
`is the
`sustained
`exponential
`improvements
`in
`the
`per-
`formance
`and
`density
`of semiconductor
`technology.
`Improvements
`in
`semicon-
`ductor
`technology
`make
`possible
`faster
`and
`larger
`primary
`microprocessors
`memory
`systems
`which
`in turn
`require
`
`1Because
`sometimes
`Arrays
`of
`
`restrictiveness
`the
`of
`is said
`to stand
`RAID
`Independent
`Disks.”
`
`of
`
`“Inexpensive,”
`for
`“Redundant
`
`ACM Computing
`
`Surveys, Vol
`
`26, No 2, June
`
`1994
`
`higher-performance
`larger,
`systems.
`These
`storage
`both
`quantitative
`have
`consequences.
`Amdahl’s
`side,
`quantitative
`On
`the
`that
`large
`predicts
`1967]
`Law [Amdahl
`will
`re-
`in microprocessors
`improvements
`marginal
`improvements
`sult
`in
`only
`in
`overall
`system
`performance
`unless
`accompanied
`by
`corresponding
`improve-
`ments
`in secondary
`storage
`systems. Un-
`fortunately,
`while
`RISC microprocessor
`performance
`has been improving
`50~0 or
`more
`per year
`[Patterson
`and Hennessy
`1994,
`p. 27],
`disk
`access
`times,
`which
`depend
`on improvements
`of mechanical
`systems,
`have
`been improving
`less than
`10% per year. Disk
`transfer
`rates, which
`track
`improvements
`in both mechanical
`systems
`and magnetic-media
`densities,
`have
`improved
`at
`the
`faster
`rate
`of ap-
`proximately
`20% per
`year,
`but
`this
`is
`still
`far slower
`than
`the rate
`of processor
`improvement.
`Assuming
`that
`semicon-
`ductor
`and
`disk
`technologies
`continue
`their
`current
`trends,
`we must
`conclude
`that
`the performance
`gap between micro-
`processors
`and magnetic
`disks will
`con-
`tinue
`to widen.
`effect,
`to the quantitative
`In addition
`qualita-
`second,
`perhaps more important,
`tive
`effect
`is driving
`the need for higher-
`performance
`secondary
`storage
`systems.
`As microprocessors
`become
`faster,
`they
`make
`possible
`applications
`and
`new
`greatly
`expand
`the
`scope of existing
`ap-
`plications.
`In particular,
`image-intensive
`and
`applications
`such as video, hypertext,
`Even
`multimedia
`are becoming
`common.
`in
`existing
`application
`areas
`such
`as
`computer-aided
`design
`and
`scientific
`computing,
`faster microprocessors
`make
`it possible
`to tackle
`new problems
`requir-
`ing faster
`access to larger
`data
`sets. This
`shift
`in applications,
`along with
`a trend
`toward
`large,
`shared,
`high-performance,
`network-based
`storage
`systems,
`is caus-
`ing us to reevaluate
`the way we design
`and use secondary
`storage
`systems
`[Katz
`1992].
`arrays,
`Disk
`independent
`formance
`
`a
`
`which
`disks
`into
`logical
`disk,
`
`multiple,
`organize
`high-per-
`a large,
`are a natural
`solu-
`
`Page 2 of 41
`
`

`

`stripe
`arrays
`Disk
`problem.
`to the
`tion
`access
`and
`disks
`across multiple
`data
`higher
`both
`them in parallel
`to achieve
`access-
`data
`data
`transfer
`rates
`on large
`data
`es and
`higher
`1/0
`rates
`on small
`1986;
`accesses [Salem and Garcia-Molina
`also re-
`Livny
`et al. 1987]. Data
`striping
`across all
`sults
`in uniform
`load balancing
`spots
`that
`of
`the
`disks,
`eliminating
`hot
`number
`of
`otherwise
`saturate
`a small
`disks while
`the majority
`of disks
`sit
`idle.
`Large
`disk
`arrays,
`are highly
`vulnera-
`ble to disk
`failures
`however.
`A disk array
`with
`100 disks
`is 100 times more likely
`to
`fail
`than
`a single-disk
`array.
`An MTTF
`(mean-time-to-failure)
`of 200,000
`hours,
`or approximately
`23 years,
`for
`a single
`disk
`implies
`an MTTF
`of 2000 hours,
`or
`approximately
`three months,
`for
`a disk
`array with
`100 disks.
`The
`obvious
`solu-
`tion
`is to employ
`redundancy
`in the form
`of error-correcting
`codes
`to tolerate
`disk
`failures.
`This
`allows
`a redundant
`disk
`array
`to avoid losing
`data for much longer
`than
`an unprotected
`single
`disk. How-
`has
`ever,
`redundancy
`negative
`conse-
`all write
`quences.
`Since
`operations
`must
`update
`the
`redundant
`information,
`the
`performance
`of writes
`in redundant
`disk
`arrays
`can
`be significantly
`worse
`than
`the
`performance
`of writes
`in nonredun-
`dant
`disk
`arrays.
`Also,
`keeping
`the
`re-
`dundant
`information
`consistent
`in
`the
`1/0
`face
`of concurrent
`operations
`and
`system crashes
`can be difficult.
`and
`A number
`of different
`data-striping
`devel-
`redundancy
`schemes
`have
`been
`oped.
`The
`combinations
`and
`arrange-
`ments
`of
`these
`schemes
`lead
`to a bewil-
`dering
`set
`of
`options
`for
`users
`and
`designers
`of disk arrays.
`Each option
`pre-
`sents
`subtle
`tradeoffs
`among
`reliability,
`performance,
`and
`cost
`that
`are difficult
`to evaluate
`without
`understanding
`the
`alternatives.
`To
`address
`this
`problem,
`this article
`presents
`a systematic
`tutorial
`and
`survey
`of disk
`arrays. We describe
`seven
`basic
`disk
`array
`organizations
`along with
`their
`advantages
`and
`disad-
`vantages
`and
`compare
`their
`reliability,
`performance,
`and
`cost. We
`draw
`atten-
`tion
`to the
`general
`principles
`governing
`the design
`and configuration
`of disk
`ar-
`
`RAID
`
`l
`
`147
`
`that must
`issues
`as practical
`rays as well
`implementation
`of
`in
`the
`be addressed
`section
`describes
`op-
`A later
`disk
`arrays.
`and
`variations
`to the
`seven
`timization
`array
`organizations.
`Finally,
`basic
`disk
`existing
`research
`in the mod-
`we discuss
`arrays
`and fruitful
`avenules
`eling
`of disk
`for
`future
`research.
`This article
`should
`be
`of value
`to anyone
`interested
`in disk
`ar-
`rays,
`including
`students,
`researchers,
`de-
`signers,
`and users
`of disk arrays.
`
`2. BACKGROUND
`
`provides
`section
`This
`material
`on disks,
`1/0
`disk
`technology
`trends
`are
`unfamiliar
`with
`systems.
`
`background
`basic
`paths,
`and
`data
`readers
`who
`for
`secondary
`storage
`
`2.1 Disk Terminology
`
`components
`the basic
`1 illustrates
`Figure
`A
`magnetic
`disk
`drive.
`of a simplified
`coated
`of a set of platters
`disk
`consists
`with
`a magnetic
`medium
`rotating
`at a
`constant
`angular
`velocity
`and
`a set
`of
`with
`magnetic
`read/write
`disk
`arms
`the
`that
`are moved
`radially
`across
`heads
`platters’
`surfaces
`by an
`Once
`actuator.
`data
`the
`heads
`are correctly
`positioned,
`called
`is read
`and written
`in small
`arcs
`as the
`on the
`platters’
`surfaces
`sectors
`platters
`rotate
`relative
`to the heads. Al-
`though
`all heads
`are moved
`collective]
`y,
`in almost
`every
`disk
`drive,
`only
`a single
`head can read or write
`data
`at any given
`time. A complete
`circular
`swath
`of data is
`and each platter’s
`referred
`to as a track,
`of
`surface
`consists
`of
`concentric
`rings
`at
`tracks.
`A vertical
`collection
`tracks
`of
`re-
`the
`same
`radial
`position
`is logically
`Sectors
`are numb-
`ferred
`to as a cylinder.
`ered
`so that
`a sequential
`scan
`of all
`sectors
`traverses
`the
`entire
`disk
`in the
`minimal
`possible
`time.
`described
`disk
`Given
`the
`simplified
`be brok-
`can
`above,
`disk
`service
`times
`en into
`three
`primary
`components:
`seek
`and
`data
`traris-
`time,
`rotational
`latency,
`Seek time
`is the amount
`of
`time
`fer
`time.
`needed
`to move
`a head
`to the
`correct
`radial
`position
`and typically
`ranges
`from
`
`ACM Computing
`
`Surveys, Vol. 26, No 2, June 1994
`
`Page 3 of 41
`
`

`

`148
`
`*
`
`Peter M. Chen
`
`et al.
`
`Inner Track
`
`l+tld
`
`Plattel-
`
`-Actuator
`
`__..–—
`
`.—----
`
`...~-k-.
`
`--- —-..
`-._ ._.. ._
`___
`
`——--
`----------
`
`~-----
`
`1. Disk
`Figure
`by
`positioned
`are
`which
`arms
`res] de on
`Heads
`terminology
`and writes
`A cylinder
`of reads
`is the basic unit
`A sector
`cm a platter.
`rings
`concentric
`is everything
`in the figure
`plus
`positron.
`An HDA
`(head-disk
`assembly)
`one actuator
`In some devices
`it M possible
`to transfer
`data
`from multiple
`surfaces
`simultaneously,
`.-
`and
`expensive.
`The
`collection
`of heads
`that
`participate
`m a single
`logical
`transfer
`multiple
`surfaces
`is called
`a head groap.
`
`are
`Tracks
`actuators.
`at
`of
`tracks
`is a stack
`the air-tight
`casing
`but
`this
`is both
`rare
`that
`is suread
`over
`
`on the
`depending
`30 milliseconds
`1 to
`disk.
`the
`particular
`distance
`and
`seek
`time
`of
`is the amount
`Rotational
`latency
`to rotate
`desired
`sector
`needed
`the
`for
`head. Full
`rotation
`times
`under
`the disk
`currently
`from 8 to
`28
`for
`disks
`vary
`The
`data
`transfer
`time
`is
`milliseconds.
`on the rate
`at which
`data
`can
`dependent
`to/from
`a platter’s
`surface
`be transferred
`and is a function
`of
`the platter’s
`rate
`of
`rotation,
`the density
`of
`the magnetic
`me-
`dia,
`and the radial
`distance
`of
`the head
`of
`from the
`center
`the
`platter—some
`disks
`use a technique
`called
`zone-bit-re-
`cording
`to store more
`data
`on the longer
`outside
`tracks
`than
`the
`shorter
`inside
`tracks.
`Typical
`data
`transfer
`rates
`range
`from 1 to 5 MB per second. The seek time
`and rotational
`latency
`are sometimes
`col-
`lectively
`referred
`to as the heacl-position-
`Table
`1 tabulates
`the statistics
`ing
`time.
`for
`a typical
`high-end
`disk
`available
`1993.
`and
`time
`slow head-positioning
`The
`to
`data
`transfer
`rate
`of disks
`lead
`fast
`se-
`different
`performance
`for
`a
`very
`of accesses depending
`on the size
`quence
`and relative
`location
`of each access. Sup-
`pose we need to transfer
`1 MB from the
`disk
`in Table
`1, and the data
`is laid
`out
`in two ways:
`sequential
`within
`a single
`cylinder
`or
`randomly
`placed
`in
`8 KB
`blocks.
`either
`case the
`time
`for
`the
`
`In
`
`in
`
`ACM Comput]ng
`
`Surveys,
`
`Vol
`
`26, No
`
`2, June
`
`1994
`
`Table 1.
`
`the Seagate
`for
`Speclflcatlons
`Elite-3 SCSI D!sk Drive
`
`ST43401 N
`
`Form Factor/Disk
`
`Diameter
`
`5,25 inch
`
`Capxity
`
`Cylinders
`
`Tracks Per Cylinder
`
`Sec[ors Pcr Tmck
`
`Bytes Pcr Sector
`
`Full Rolahon Time
`
`Mlnunum Seek
`(single cylinder)
`
`Average Seek
`(random cylinder
`
`to cylmdcr)
`
`2.8 GB
`
`2627
`
`21
`
`-99
`
`512
`
`11.lms
`
`1,7 ms
`
`11.Oms
`
`~
`1s calculated
`table
`seek in this
`Average
`of accesses.
`This
`distribution
`umform
`report
`average
`dard way manufacturers
`of production
`In
`reality,
`measurements
`sigmficantly
`show that
`spatial
`locality
`effective
`average
`seek distance
`[Hennessy
`terson
`1990,
`p 559]
`
`a
`
`assuming
`is the
`stan-
`seek times.
`systems
`lowers
`the
`and Pat-
`
`200
`of 1 MB is about
`data transfer
`actual
`the time
`for positioning
`the head
`ms. But
`goes from about
`16 ms in the sequential
`layout
`to about
`2000 ms in the random
`is
`layout.
`This
`sensitivity
`to the workload
`why
`I/O-intensive
`applications
`are cate-
`
`Page 4 of 41
`
`

`

`mini-
`rate, meaning
`as high
`gorized
`data
`sequen-
`via
`large,
`mal
`head
`positioning
`rate, meaning
`tial
`accesses,
`or high
`1/0
`via
`small, more
`lots
`of head
`positioning
`random
`accesses. For example,
`scientific
`programs
`that manipulate
`large
`arrays
`of data
`in the
`high
`data
`rate
`cate-
`gory,
`transaction-processing
`pro-
`grams
`in the high
`1/0
`rate
`category.
`
`fall
`while
`fall
`
`2.2 Data Paths
`
`inter-
`standard
`industry
`of
`A hierarchy
`for
`transferring
`defined
`faces
`has been
`platter’s
`surface
`on a disk
`data
`recorded
`to or
`from a host
`computer.
`In this
`sec-
`tion we review
`the
`complete
`data
`path,
`from a disk
`to a users’
`application
`(Fig-
`ure
`2). We assume
`a read
`operation
`for
`the purposes
`of
`this
`discussion.
`informa-
`On the disk
`platter’s
`surface,
`in the
`tion
`is represented
`as reversals
`These
`direction
`of stored magnetic
`fields.
`“flux
`reversals”
`are
`sensed,
`amplified,
`and
`digitized
`pulses
`by the
`lowest-
`into
`level
`read
`electronics.
`The
`protocol
`ST506/
`412 is one standard
`that
`defines
`an interface
`to disk
`systems
`at
`this
`low-
`est, most
`inflexible,
`and
`technology-de-
`pendent
`level. Above
`this
`level of the read
`electronics
`path,
`pulses
`are
`decoded
`to
`separate
`data
`bits
`from timing-related
`flux
`reversals.
`The
`bit-level
`ESDI
`(En-
`hanced Small Device
`Interface)
`and SMD
`(Storage Module
`Interface)
`standards
`de-
`fine
`an interface
`at
`this more
`flexible,
`encoding-independent
`level.
`Then,
`to be
`transformed
`into
`the highest,
`most
`flexi-
`ble
`packet-level,
`these
`bits
`are
`aligned
`into bytes,
`error-correcting
`codes applied,
`and the
`extracted
`data
`delivered
`to the
`host as data blocks
`over a peripheral
`bus
`interface
`such as SCSI
`(Small
`Computer
`Standard
`Interface),
`or
`IPI-3
`(the
`third
`level
`of
`the
`Intelligent
`Peripheral
`Inter-
`face). These steps are performed
`today
`by
`intelligent
`on-disk
`controllers,
`which
`of-
`ten include
`speed matching
`and caching
`“track
`buffers.”
`SCSI
`and IPI-3
`also in-
`clude
`a level
`of data mapping:
`the
`com-
`puter
`specifies
`a logical
`block
`number,
`and the controller
`embedded
`on the disk
`maps
`that
`block
`number
`to a physical
`
`RAID
`
`-
`
`14’9
`
`to
`
`and sector. This mapping
`track,
`cylinder,
`the
`embedded
`disk
`controller
`allows
`avoid bad areas of
`the disk by remapping
`logical
`blocks
`that
`are
`affected
`to new
`areas of
`the disk.
`Topologies
`and devices on the data path
`between
`disk
`and
`host
`computer
`vary
`widely
`depending
`on the size and type of
`1/0
`system. Mainframes
`have
`the
`rich-
`est 1/0
`systems,
`with many
`devices
`and
`complex
`interconnection
`schemes
`to ac-
`cess them. An IBM channel
`which
`path,
`encompasses
`the set of cables and associ-
`ated
`electronics
`that
`transfer
`data
`and
`control
`information
`between
`an 1/0
`de-
`of
`vice
`and main
`memory,
`consists
`a
`a head
`a storage
`and
`channel,
`director,
`The
`collection
`of disks
`that
`of
`string.
`share
`the
`same
`pathway
`to the
`head
`of
`string
`is called
`a string.
`In the worksta-
`tion/file
`server world,
`the
`channel
`pro-
`cessor
`is usually
`called
`an 1/0
`controller
`or host-bus
`adaptor
`(HBA),
`and the func-
`tionality
`of
`the storage
`director
`and head
`of string
`is
`contained
`in
`an embedded
`controller
`on the
`disk
`drive.
`As
`in the
`mainframe
`world,
`the
`use of high-level
`peripheral
`interfaces
`such
`as SCSI
`and
`IPI-3
`allow multiple
`disks
`to share
`a sin-
`gle peripheral
`bus or string.
`via
`From the HBA,
`data
`is transferred
`direct memory
`access, over a system bus,
`such
`as VME
`(Versa Module
`Eurocarcl),
`S-Bus,
`MicroChannel,
`EISA
`(Extended
`Industry
`Standard
`Architecture),
`or PCI
`(Peripheral
`Component
`Interconnect),
`to
`the host operating
`system’s
`buffers.
`Then,
`in most
`operating
`systems,
`the CPU per-
`forms
`a memory-to-memory
`copy over
`a
`high-speed
`memory
`bus from the operat-
`ing system buffers
`to buffers
`in the appli-
`cation’s
`address
`space.
`
`2.3 Technology
`
`Trends
`
`arrays
`disk
`for
`the motivation
`of
`Much
`in
`disk
`trends
`from the
`current
`comes
`technology.
`As Table
`2 shows, magnetic
`disk
`drives
`have
`been improving
`rapidly
`by
`some metrics
`and
`hardly
`all
`by
`at
`other metrics.
`Smaller
`distances
`between
`the magnetic
`read/write
`head
`and
`the
`disk
`surface,
`more
`accurate
`positioning
`
`ACM Computmg
`
`Surveys, Vol. 26, No 2, June 1994
`
`Page 5 of 41
`
`

`

`.-A
`
`15U
`
`“
`
`-.
`
`Yeter M.
`
`..-,
`
`(,’hen et al.
`
`M3~fA
`
`or C7vmntl Proccs70r
`E!:?!
`
`+
`
`L.T
`
`Slor<lgt Dltcctor
`
`& T!xh
`
`Bullcr$
`
`_d__HDisk Cont!ollcr/
`.Clocking
`
`I
`
`S1l-loz
`——..
`
`.
`
`lPI-3,
`
`SCSI-I,
`
`SCSI-2,
`
`DEC
`
`C1/MSCP
`
`IPI-2, SCSI-I,DECSD1,
`IE+hlC1l’i[ml P,III1 ([111,1blWk\,l
`
`1
`
`DI\k Cooitmllcr/
`Slor.yc
`Iltcclor
`
`. S\l D, ESDI
`Zk
`
`Clmhin:
`
`(h]i,
`
`)
`
`.
`
`Srx)(l, srd I I IPUIWSI
`
`nl<lgncllc
`
`nlilgncl]c
`
`Figure
`that
`Data
`pathways.
`Host-to-device
`2.
`processor,
`requesting
`on its way
`to the
`layers
`closelv
`deal more
`such
`as ST506
`interfaces
`such
`as SCSI
`d;al
`Higher
`layers
`dependent,
`A string
`connects multiple
`disks
`independent,
`between
`the 1/0
`and disk
`controllers.
`
`from a magnetic
`is read
`dashed
`line marks
`Each
`raw maxnetic
`fields
`the
`with
`packets
`or b~ocks
`of data
`in
`to a single
`1/0
`controller,
`control
`
`many
`through
`pass
`disk must
`Lower
`interface
`a standard
`hi~hlv
`technology
`and
`are
`;or;
`technology
`and
`are
`of
`the string
`1s distributed
`
`advanced magnetic
`and more
`electronics,
`the recording
`den-
`increased
`media
`have
`disks
`dramatically.
`This
`in-
`sity
`on the
`creased
`density
`has
`improved
`disks
`in
`two ways.
`First,
`it has allowed
`disk
`ca-
`pacities
`to stay constant
`or increase,
`even
`while
`disk
`sizes have decreased
`from 5.25
`inches
`in
`1983
`to
`1.3 inches
`in
`1993.
`Second,
`the increased
`density,
`along with
`an increase
`in the rotational
`speed of
`the
`disk,
`has made
`possible
`a substantial
`in-
`crease in the transfer
`rate
`of disk drives.
`
`im-
`have
`seek times
`hand,
`On the other
`from
`only
`decreasing
`little,
`proved
`very
`in
`20 ms
`1980 to 10 ms
`approximately
`speeds
`have
`increased
`today. Rotational
`at a similarly
`slow rate from 3600 revolu-
`tions
`per minute
`in
`1980
`to 5400-7200
`today.
`
`3. DISK ARRAY BASICS
`
`This
`the
`
`section
`design
`
`basic
`examines
`and
`implementation
`
`in
`issues
`of disk
`
`ACM
`
`Computmg
`
`Surveys,
`
`Vol
`
`26, No
`
`2, June
`
`1994
`
`Page 6 of 41
`
`

`

`Table 2.
`
`Trends
`
`in Disk Technology.
`
`RAID
`
`l
`
`151
`
`1993
`
`50-150
`Mbils/sq.
`
`inch
`
`40,000-60,000
`bilslinch
`
`Rate
`Historical
`of Improvement
`
`27% per year
`
`13% per year
`
`And Density
`
`Linear Density
`
`Inter-Track Density
`
`Capacity
`(3.5”
`form factor)
`
`Transfer Rate
`
`Seek Time
`
`disks
`Magnetic
`is the
`densitv
`Gbit/;q.-inch
`track.
`Intertrack
`
`in performance.
`slowly
`but more
`capacity,
`and
`in density
`rapidly
`are improving
`IBM demonstrated
`1989,
`media.
`In
`of magnetic
`sauare
`inch
`Ber
`recording
`densitv
`of bits written
`Lines;
`density
`is the number
`environment.
`densityi;
`a labo;a;ory
`density
`refers
`to the number
`of concentric
`tracks
`on a single
`platter.
`
`A real
`a 1
`along
`a
`
`In
`
`the
`examine
`we
`particular,
`arrays.
`redun-
`and
`striping
`data
`of
`concepts
`perfor-
`basic RAID organizations;
`dancy;
`and cost comparisons
`between
`the
`mance
`basic RAID
`organizations;
`reliability
`of
`RAID-based
`systems
`in the
`face of sys-
`tem crashes,
`uncorrectable
`bit-errors,
`and
`correlated
`disk
`failures;
`and
`finally,
`is-
`sues
`in the
`implementation
`of block-in-
`terleaved,
`redundant
`disk arrays.
`
`3.1 Data Striping
`
`and Redundancy
`
`or-
`two
`employ
`arrays
`disk
`Redundant
`im-
`for
`striping
`data
`concepts:
`thogonal
`for
`and redundancy
`performance
`proved
`dis-
`Data
`striping
`reliability.
`improved
`data transparently
`over multiple
`tributes
`to make
`them appear
`as a single
`disks
`large
`disk. Striping
`improves
`aggre-
`fast,
`gate 1/0
`performance
`by allowing
`multi-
`ple 1/0s
`to be serviced
`in parallel.
`There
`are two aspects
`to this parallelism.
`First,
`multiple
`independent
`requests
`can
`be
`by
`serviced
`in
`parallel
`separate
`disks.
`This
`decreases
`the queuing
`time
`seen by
`1/0
`requests.
`Second,
`single multiple-
`block
`requests
`can be serviced
`by multi-
`ple disks
`acting
`in coordination.
`This
`in-
`creases
`the effective
`transfer
`rate seen by
`
`in the
`disks
`The more
`request.
`a single
`per-
`disk
`array,
`the larger
`the potential
`a large
`formance
`benefits.
`Unfortunately,
`relia-
`number
`of disks
`lowers
`the overall
`bility
`the
`disk
`array,
`as mentioned
`of
`before.
`Assuming
`independent
`failures,
`100 disks
`collectively
`have
`only
`1/100th
`the reliability
`of a single
`disk. Thus,
`re-
`dundancy
`is necessary
`to tolerate
`disk
`failures
`and
`allow
`continuous
`operation
`without
`data loss.
`of redun-
`the majority
`We will
`see that
`can be dis-
`dant
`disk array
`organizations
`tinguished
`based on two features:
`(1)
`the
`granularity
`of data
`interleaving
`and (2)
`the method
`and
`pattern
`in which
`the
`redundant
`information
`is computed
`and
`distributed
`across
`the
`disk
`array.
`Data
`interleaving
`can be characterized
`as ei-
`ther
`fine grained
`or coarse grained.
`Fine-
`grained
`disk
`arrays
`conceptually
`inter-
`leave
`data
`relatively
`small
`units
`so
`in
`that
`all 1/0
`requests,
`regardless
`of
`their
`size,
`access all of
`the
`disks
`in the
`disk
`array.
`This
`results
`very
`high
`data
`all
`transfer
`rates
`for
`1/0
`requests
`but
`has
`the disadvantages
`that
`(1) only
`one
`logical
`1/0
`request
`can be in service
`at
`any
`given
`time
`and
`(2) all
`disks must
`waste
`time
`positioning
`for every
`request.
`
`in
`
`ACM Computing
`
`Surveys, Vol. 26, No. 2, June 1994
`
`Page 7 of 41
`
`

`

`152
`
`l
`
`Peter M. Chen
`
`et al.
`
`to
`
`interleave
`arrays
`disk
`Coarse-grained
`so that
`small
`large units
`data in relatively
`only
`a small
`need
`access
`1/0
`requests
`requests
`can
`number
`of disks while
`large
`disk
`array.
`access
`all
`the
`disks
`in
`the
`This
`allows multiple
`small
`requests
`to be
`serviced
`simultaneously
`while
`still
`allow-
`ing
`large
`requests
`see
`the
`higher
`transfer
`rates
`afforded
`by using multiple
`disks.
`in
`redundancy
`of’
`incorporation
`The
`or-
`up two somewhat
`arrays
`brings
`disk
`is
`The
`first
`problem
`thogonal
`problems.
`the
`for
`computing
`to select
`the method
`Most
`redundant
`redundant
`information.
`disk arrays
`today use parity,
`though
`some
`use Hamming
`or Reed-Solomon
`codes.
`The second problem is that
`of selecting
`a
`method
`for
`distributing
`the
`redundant
`information
`across
`the
`disk
`array.
`Al-
`though
`there
`are an unlimited
`number
`of
`patterns
`in which
`redundant
`information
`can be distributed,
`we classify
`these
`pat-
`terns
`roughly
`into
`two different
`distribu-
`tions
`schemes,
`those that
`concentrate
`re-
`dundant
`information
`on a small
`number
`of disks
`and
`those
`that
`distributed
`re-
`dundant
`information
`uniformly
`across all
`of
`the disks. Schemes
`that
`uniformly
`dis-
`tribute
`redundant
`information
`are gener-
`ally more
`desirable
`because
`they
`avoid
`hot
`spots
`and other
`load-balancing
`prob-
`lems
`suffered
`by
`schemes
`that
`do not
`distribute
`redundant
`information
`uni-
`formly.
`Although
`the
`basic
`concepts
`of
`data
`striping
`and
`redundancy
`are
`con-
`ceptually
`simple,
`selecting
`between
`the
`many
`possible
`data
`striping
`and
`redun-
`dancy
`schemes
`involves
`complex
`trade-
`offs between
`reliability,
`performance,
`and
`cost.
`
`3.2 Basic RAID Organizations
`
`basic RAID
`the
`describes
`section
`This
`be used
`as the
`that
`will
`organizations
`examinations
`of
`the per-
`basis
`for
`further
`and
`reliability
`of
`disk
`formance,
`cost,
`arrays.
`addition
`to presenting
`RAID
`In
`levels
`1 through
`5 that
`first
`appeared
`in
`the
`landmark
`paper
`by Patterson,
`Gib-
`son, and Katz
`[Patterson
`et al. 1988], we
`present
`two
`other
`RAID
`organizations,
`
`ACM
`
`Computing
`
`Surveys,
`
`Vol
`
`26, No
`
`2, June
`
`1994
`
`since
`have
`that
`6,
`O and
`levels
`RAID
`For
`the ben-
`generally
`accepted.x
`become
`the original
`those
`unfamiliar
`with
`efit of
`numerical
`classification
`of RAID, we will
`use English
`phrases
`in preference
`to the
`numerical
`classifications.
`It should
`come
`as no surprise
`to the reader
`that
`even the
`original
`authors
`have
`been
`confused
`sometimes
`with
`regard
`to the disk
`array
`organization
`referred
`to by a particular
`RAID level! Figure
`3 illustrates
`the seven
`RAID organizations
`schematically.
`
`3.2.1 Nonredundant
`
`(RAID Level O)
`
`or RAID level
`disk array,
`A nonredundant
`cost of any RAID orga-
`O, has the lowest
`it does not
`employ
`re-
`nization
`because
`dundancy
`at all. This
`scheme
`offers
`the
`best write
`performance
`since it does not
`need
`to update
`redundant
`information.
`Surprisingly,
`it
`does not
`have
`the
`best
`read
`performance.
`Redundancy
`schemes
`data,
`that
`duplicate
`such
`as mirroring,
`better
`can
`perform
`on reads
`by
`selec-
`tively
`scheduling
`requests
`on the
`disk
`with
`the shortest
`expected
`seek and rota-
`tional
`delays
`[Bitten
`and Gray
`1988].
`Without
`redundancy,
`any single
`disk
`fail-
`ure will
`result
`in data
`loss. Nonredun-
`in
`dant
`disk
`arrays
`are widely
`used
`supercomputing
`environments
`where
`performance
`capacity,
`than
`and
`reliability,
`are the primary
`
`rather
`concerns.
`
`3.2.2 Mirrored (RAID Level 1)
`
`called mirroring
`solution,
`The traditional
`as many
`disks
`uses twice
`or shadowing,
`disk array
`[Bitten
`and
`as a nonredundant
`Gray
`1988]. Whenever
`data
`is written
`to
`a disk
`the same data
`is also written
`to a
`redundant
`disk,
`so that
`there
`are always
`two copies of
`the information.
`When
`data
`is read,
`it can he retrieved
`from the disk
`with
`the shorter
`queuing,
`seek, and rota-
`tional
`delays
`[Chen
`et al. 1990].
`If a disk
`fails,
`the
`other
`copy
`is used
`to service
`requests. Mirroring
`is frequently
`used in
`
`speaking,
`2Strictly
`array
`redundant
`no error-correcting
`
`of
`
`of
`a type
`level O IS not
`RAID
`inexpensive
`disks
`since it stores
`codes.
`
`Page 8 of 41
`
`

`

`RAID
`
`l
`
`153
`
`Non-Redundant
`
`(RAID Level O)
`
`E3E3E3
`EE3B
`
`Mirrored
`
`(RAID Level 1)
`
`Mcmo]y-S[ylc ECC (RAID
`
`LCW4 2)
`
`BwIntcrleavcd
`
`Parity (R~lD LCVC1 3)
`
`EE3mn!iiii
`
`crleavcd PJriIy (RAID Level 4)
`
`Block-In{
`
`I]lock-In~c!lcavcd
`
`Dislribuwd-Parjly
`
`[RAID Level 5)
`
`~.\’
`
`B\ ... Ei23,, . .. Es
`
`N.... ‘
`
`,.,
`
`‘.+’
`
`..,’
`
`P+Q Redundancy
`
`(RAID Level
`
`6)
`
`Figure 3.
`RAID levels O through
`platters
`indicate
`with multiple
`striping.
`The shaded
`platters
`
`6. All RAID levels
`block-level
`striping
`represent
`redundant
`
`at a user
`are illustrated
`while
`disks without
`multiple
`information.
`
`capacity
`platters
`
`of
`
`disks. Disks
`four
`indicate
`bit-level
`
`applications
`database
`and transaction
`rate
`than
`storage
`efficiency
`
`where
`are more
`[Gray
`
`availability
`important
`et al. 1990].
`
`3.2.3 Memoiy-Siyle ECC (RAID Level 2)
`
`have
`systems
`Memory
`from failed
`components
`
`recovery
`provided
`with much
`less
`
`Hamming
`by using
`than mirroring
`cost
`1972]. Ham-
`codes [Peterson
`and Weldon
`for
`distinct
`ming
`codes
`contain
`parity
`In
`overlapping
`subsets
`of
`components.
`data
`one
`version
`of
`this
`scheme,
`four
`one
`disks
`require
`three
`redundant
`disks,
`of
`less than mirroring.
`Since the number
`redundant
`disks
`is proportional
`to the log
`
`ACM Computing
`
`Surveys, Vol. 26, No. 2, June 1.994
`
`Page 9 of 41
`
`

`

`154
`
`“
`
`Peter M. Chenetal.
`
`of
`
`is
`
`sys-
`in the
`of disks
`number
`the total
`of
`increases
`as the
`efficiency
`tem,
`storage
`increases.
`number
`of data
`disks
`several
`If a single
`component
`fails,
`will
`the parity
`components
`have inconsis-
`tent
`values,
`and the failed
`component
`the one held in common
`by each incorrect
`subset.
`The lost
`information
`is recovered
`by
`reading
`the
`other
`components
`in
`subset,
`including
`the
`parity
`component,
`and setting
`the missing
`bit
`to O or 1 to
`create
`the
`proper
`parity
`value
`for
`that
`subset.
`Thus, multiple
`redundant
`disks
`are needed
`to identify
`the failed
`disk, but
`only
`one is needed
`to recover
`the
`lost
`information.
`can
`parity
`with
`unfamiliar
`Readers
`as having
`disk
`the
`redundant
`think
`of
`the sum of all
`the data in the other
`disks.
`When
`a disk
`fails,
`you
`can subtract
`all
`the data
`on the good disks
`from the par-
`ity disk;
`the remaining
`information
`must
`be
`the missing
`information.
`Parity
`is
`simply
`this
`sum modulo
`two.
`
`a
`
`3.2.4 Bit-Interleaved Parity (RAID Level 3)
`
`disk
`
`EGG
`upon memory-style
`One can improve
`disk
`arrays
`by noting
`that,
`unlike mem-
`ory
`component
`failures,
`disk
`controllers
`can easilv
`identifv
`which
`disk has failed.
`Thus,
`on”e can u~e a single
`parity
`disk
`rather
`than
`a set of parity
`disks
`to re-
`cover
`lost
`information.
`array,
`parity
`In a bit-interleaved
`bit-wise
`interleaved
`data
`is conceptually
`parity
`and a single
`over
`the
`data
`disks,
`disk
`disk
`is added
`to tolerate
`any
`single
`all
`failure.
`Each
`read
`request
`accesses
`ac-
`data
`disks,
`and
`each write
`request
`disk.
`cesses all data
`disks
`and the parity
`Thus,
`only one request
`can be serviced
`at
`a time. Because
`the parity
`disk
`contains
`only
`parity
`and no data,
`the parity
`disk
`cannot
`participate
`on reads,
`resulting
`in
`slightly
`lower
`read
`performance
`than
`for
`redundancy
`schemes
`that
`distribute
`the
`parity
`and data
`over all disks. Bit-inter-
`leaved
`parity
`disk
`arrays
`are frequently
`in
`used
`applications
`that
`require
`high
`bandwidth
`but
`not high
`1/0
`rates. Also
`they are simpler
`to implement
`than RAID
`Levels
`4, 5, and 6.
`
`ACM Computing
`
`Surveys, Vol
`
`26, No 2, June 1994
`
`3.2,5 Block-interleaved
`
`Parity (RAID Level 4)
`
`disk
`
`array
`parity
`block-interleaved
`The
`parity
`bit-interleaved
`similar
`to the
`is
`that
`data is interleaved
`disk array
`except
`across
`disks
`in blocks
`of arbitrary
`size
`rather
`than
`in
`bits.
`The
`size
`of
`these
`blocks
`is called
`the
`striping
`unit
`[Chen
`and
`Patterson
`1990].
`Read
`requests
`smaller
`than
`the striping
`unit
`access only
`a single
`data
`disk. Write
`reauests
`must
`upda~e
`the
`requested
`data
`‘blocks
`and
`must
`compute
`and
`update
`the
`parity
`block.
`For
`large writes
`that
`touch
`blocks
`on all disks,
`p~rity
`is easily
`computed
`by
`exclusive-oring
`the
`new data
`for
`each
`disk.
`For
`small
`write
`reauests
`that
`uw
`date
`only
`one data
`disk,’
`parity
`is comp-
`uted
`by noting
`how the new data differs
`from the
`old
`data
`and
`applying
`those
`differences
`to
`the
`~aritv
`block.
`Small
`write
`requests
`thu~
`re&ire
`four
`disk
`1/0s:
`one to write
`the new data,
`two to
`read the old data
`and old parity
`for com-
`puting
`the new parity,
`and one to write
`the
`new parity.
`This
`is referred
`to as a
`read-modify-write
`procedure.
`Because
`a
`block-interleaved
`parity
`disk
`array
`has
`only
`one parity
`disk, which must
`be up-
`dated
`on all write
`operations.
`the ~aritv
`disk
`can easily
`become
`a bottleneck.
`B~-
`cause
`of
`this
`limitation,
`the block-inter-
`leaved
`distributed-parity
`disk
`array
`over
`universally
`m-eferred
`the
`block-in-
`terleaved
`~a~ity
`disk array.
`
`is
`
`3.2.6 Block-Interleaved
`Level 5)
`
`Distributed-Parly
`
`(RAID
`
`aritv
`distributed-~
`block-interleaved
`The
`the parity
`disk bo~-
`disk array
`eliminates
`tleneck
`present
`in the
`block-interleaved
`par-
`disk
`array
`by distributing
`the
`parity
`An
`ity
`uniformly
`over
`all
`of
`the
`disks.
`additional,
`frequently
`overlooked
`advan-
`is
`tage
`to
`distributing
`the
`parity
`that
`data
`the
`of
`also
`distributes
`over
`all
`disks
`all
`over
`rather
`than
`but
`one. This
`allows
`all
`disks
`to participate
`in
`servicing
`read
`operations
`in
`contrast
`to
`redundance
`.
`schemes
`with
`dedicated
`parity
`disks
`which
`the parity
`disk
`cannot
`participate
`in
`servicing
`read
`requests.
`Block-inter-
`
`.
`in
`
`it
`
`Page 10 of 41
`
`

`

`arrays
`disk
`distributed-parity
`leaved
`read,
`and
`large
`have the best small
`read,
`of any
`redun-
`large
`write
`performance
`dant
`disk array. Small write
`requests
`are
`somewhat
`inefficient
`compared
`with
`re-
`dundancy
`schemes
`such
`as mirroring
`however,
`due
`to
`need
`to
`perform
`the
`read-modify-write
`operations
`to
`update
`is
`parity.
`This
`the major
`performance
`weakness
`of RAID level-5
`disk arrays
`and
`has
`been
`the
`subject
`of
`intensive
`re-
`search
`[Menon
`et al. 1993; Stodolsky
`and
`Gibson
`1993].
`to distribute
`used
`The
`exact method
`distributed-
`parity
`in
`block-interleaved
`affect
`perfor-
`parity
`disk
`arrays
`can
`best
`the
`mance.
`Figure
`4
`illustrates
`investigated
`parity
`distribution
`of
`those
`in [Lee
`and Katz
`1991b],
`called
`the left-
`symmetric
`parity
`distribution.
`A useful
`property
`the left-symme

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket