`Channel Based Storage
`
`RANDY H. KATZ, SENIOR MEMBER. IEEE
`
`In vited Paper
`
`In llie tradilioMl milin{rQ~'<t,"ered _few of II comp~"
`system, J IOTllge tkviu .. lire cOI4pled 10 1M $Ji5'''''' through compl=
`Iwrd'Wlft subsystems CQlld ItO ChQfmt'is. 1t7th Ille dramiJ/ic
`sIIift lot\I<Ird wor/c.sutliQll·oo.sN comJ"'tin8, "lid its /l$sodQlro
`c/~IIIIJe~r model of computation, storage facilities are IIOW
`througlwut th£
`/o..ruJ QII~lu!d 10 fife Urvtlrs OtlJ J,'slribuld
`netw.:>rk. In litis f1<JPU, wt! discuss the wnder/yillg IccltfWIogy /f"nth
`Ihal Qre leading 10 II/sh.performance m!rwork-bused sIOTa8".
`IUJmely advtmces in networks, Slarllge del'ica, Qfld 110 control/er
`aM se" 'i!r (If'(itiu:cttol'U. We review $f'>'f'ra/ cQmmcrr;i(1/ .l)'siems
`and research prolOlypcs Ihul Qff! leuding 10 a ne ... appl'Q(lch /0
`hig"·ptrformanct: computing bostd 0/1 network-aI/ached storage.
`
`I.
`
`INTRODUCTION
`The traditional m.inframc-ccntcrtd model of comp'lI(cid:173)
`ing can be characteriud by small numbers of large-scale
`mainframe computers, with shared storage devices attached
`via I/O channel hardware. Today, we are experiencing a
`major paradigm shift away from centralized mainframes to
`a distributed model of computation based on work5tati<lns
`and fi le s.ervers connected via high-performance networks.
`What makes this new paradigm possible is the rapid
`development and acceptance of the client/server model
`of comptuation. The clienl/Server model is a message(cid:173)
`based protocol in which clients make requests of service
`providers, which are called servers. Perhaps the most
`successful application of this concept is the widespread
`use of fi le sen'Crs in networks of computer workstations
`and personal oomputers. Evell a high-end workstation
`for data storage. A
`has
`rather
`limited capabilities
`distinguished machine on the network, customized either
`by hardware, software, or both, provides a file service. It
`
`Monuscripl r«e;VN October l. 1\19 1: re.ised March 24. t\l92. This
`work wU ... ppon_d by Ih. Del_",. Ad •• JI<.~ Re~~rcb Project' AgellCY
`and lhe N.lion.t M fOnl utict and SpOitt Admini,lralion under """tract
`NAG2·59t ( Diskles:o SUpercolllilUI • .,: High PtnOflllance IJO for lhe
`T .. aOp Tcchnology Ilaoo). A(Id;'io:m.t ""l'I'U'l ""os prO\li<\e<l b~ lhe Stal<:
`of Catiforn;. MtCRO Pros.ram in rotIjUJIClion with ind",nilt matching
`... ppon PfO'IlO:d br DEC, Em.lox. " .. byl., !HM, NCR. ond Siorage
`TechroolotY ~liont..
`The l UI"'" is wilh lilt eornpu[tr ~ Di. ision, Departnl<:fl1 <JI
`Eloctric.oJ EnJin~rin. and Compul~r Scionce... Uni,· ... i~y <JI California.
`Bc rk~tey, CA 'MT.!O.
`tEEE Loc N~mbcr 9203610.
`
`accepts network messages from client machines containing
`open/ciose/readJwrite fi le reque.~ts and processes these,
`transmilling the requested data back and forth across the
`network.
`This is in contra~t to the pure distributed storage model,
`in which the files are dispersed among the storage on work.(cid:173)
`stations rather than centralized in a scrvtr. The advantages
`o f a distributed organization arc that resources are placed
`near where they are needed, leading to better perfonnance,
`and that the environment can be more autonomous bc<:ause
`individual machines continue to perform useful work even
`in the face of network fai lures. While this has been the
`more popular approach over the last few years, there has
`emerged a growing awareness of the advantages of the
`centralized view. Thai is, every user sus the same file
`system, independent of ~he machine they are currentl y
`using. The view of storage is pervasive and transparent.
`Further, it i~ much easier to administer a centralized system,
`10 provide software updates and archival backups. The re·
`suiting organization combines distributed processing polO."er
`with a centralized view of storage.
`Admittedly, centralized slOragc also has its weaknesses.
`A server or network failure re ndeB the client work5tarioos
`un usable and the network represents the critical perfor(cid:173)
`mance bottleneck. A hipl)" tuned remote fi le system on II
`10 megabit (Mbil) per 5Ccond Ethernet can provide perh3ps
`SOOK bytes per second to remote client a ppl ieat ion~. Six t~
`8K byte I/O 's per second would fully utilize this bandwidth.
`Obtainil1g the right balance of workstations to servers
`depo:nds on their relative processing power, the amount
`of memory dedicated to fi le caches on workstations and
`servers, the available nC l wor ~ bandwidth, and the I/O
`bandwidth of the server. It is interesting to !kite that loday's
`servers are not 110 limited: the Ethernet bandwidth can be
`fully utilizeu b~ the 110 bandwidth of unl~ two magnetic
`disks!
`Meanwhile, other technology developments in proces(cid:173)
`sors, networks, 3nd storage systems arc affecting the re(cid:173)
`lationship between clients and servers. It is well know n
`that processor performance. as measured in MIPS ratings,
`
`""
`
`OOI8-9219N2S03.00 CI t992 IEEE
`
`PRQC£EDlN(lS 01' 1\1£ tf.EE . VOL. SO. NO. 8, AUCIlST t'l9~
`
`CROSSROADS EXHIBIT 2038
`Cisco Systems et al. v. Crossroads Systems, Inc.
`IPR2014-01544
`
`1 of 24
`
`
`
`
`is increasing at an astonishing rate, doubling on the order of is increasing at an astonishing rate, doubling on the order of
`
`once every 18 months to two years. The newest generation once every 18 months to two years. The newest generation
`
`of RISC processors has performance in the 50 10 60 MIPS of RISC processors has performance in the 50 10 60 MIPS
`
`range. For example, a recent workstation announced by range. For example, a recent workstation announced by
`
`the Hewlett·Packard Corporation, the HP 9OOOn30, has the Hewlett·Packard Corporation, the HP 9OOOn30, has
`
`been rated at 72 SPECMarks (I SPECMark is roughly the been rated at 72 SPECMarks (I SPECMark is roughly the
`
`processing power of a single Digital Equipment Corporation processing power of a single Digital Equipment Corporation
`
`VAX IIn80 on a particular benchmark set). Powerful VAX IIn80 on a particular benchmark set). Powerful
`
`shared memory multiprocessor systems, now available from shared memory multiprocessor systems, now available from
`
`companies such as Silicon Graphics and Solborne, provide companies such as Silicon Graphics and Solborne, provide
`
`well over 100 MIPS performance. One of Amdahl's famous well over 100 MIPS performance. One of Amdahl's famous
`
`laws equated one MIPS of processing power with one laws equated one MIPS of processing power with one
`
`megabit of VO per second. Obviously such processing rates megabit of VO per second. Obviously such processing rates
`
`far exceed anything Ihat can be delivcred by existing server, far exceed anything Ihat can be delivcred by existing server,
`
`network, or storage architectures. network, or storage architectures.
`
`Unlike processor power, network technology cvolves at Unlike processor power, network technology cvolves at
`
`a slower rate, but when it advances, it does so in order a slower rate, but when it advances, it does so in order
`
`of magnitude steps. In the last decadc we have advanced of magnitude steps. In the last decadc we have advanced
`
`from 3 Mbil/second Ethernet to 10 MbiUsecond Ethernet. from 3 Mbil/second Ethernet to 10 MbiUsecond Ethernet.
`
`We are now on the verge of a new generation of network We are now on the verge of a new generation of network
`
`technology, based on fiber·optic intemlnnect, called FOOl. technology, based on fiber·optic intemlnnect, called FOOl.
`
`This technology promises 100 Mbil~ per second, and at This technology promises 100 Mbil~ per second, and at
`
`least initially, it will move the server bottleneck from tile least initially, it will move the server bottleneck from tile
`
`network to the server CPU or its storage system. Witll network to the server CPU or its storage system. Witll
`
`more powerful processors available on the horizon, the more powerful processors available on the horizon, the
`
`pcrfonnance challenge i$ very likely to be in the storage pcrfonnance challenge i$ very likely to be in the storage
`
`system, where a typical magnetic disk can service 30 8K system, where a typical magnetic disk can service 30 8K
`
`byte VO ·s per second and can sustain a data rate in the range byte VO ·s per second and can sustain a data rate in the range
`
`of I to 3 Mbytes per $econd. And even faster networks and of I to 3 Mbytes per $econd. And even faster networks and
`
`interconnects, in the gigabit range, are now commercially interconnects, in the gigabit range, are now commercially
`
`available and will become more widespread as their costs available and will become more widespread as their costs
`
`begin 10 drop [1). begin 10 drop [1).
`
`To keep up with the advances in processors and networks, To keep up with the advances in processors and networks,
`
`storage system..~ are also experiencing rapid improvements. storage system..~ are also experiencing rapid improvements.
`
`Magnetic disks have been doubling in stol1lge capacity Magnetic disks have been doubling in stol1lge capacity
`
`once every three years. As disk form factors shrink from once every three years. As disk form factors shrink from
`
`14 inch to 3.5 incll and below, the disk~ can be made 14 inch to 3.5 incll and below, the disk~ can be made
`
`to spin faster, thus increasing the sequential transfer rate. to spin faster, thus increasing the sequential transfer rate.
`
`Unfortunately, the random VO rate is improving only very Unfortunately, the random VO rate is improving only very
`
`slowly, owing to mccllanically limited positioning delays. slowly, owing to mccllanically limited positioning delays.
`
`Since I/O and data rates are primarily disk actuator limited, Since I/O and data rates are primarily disk actuator limited,
`
`a new storage system approach called disk arrays addresses a new storage system approach called disk arrays addresses
`
`this problem by replacing a small number of large-format this problem by replacing a small number of large-format
`
`disks by a very large number of small·format disks. Disk. disks by a very large number of small·format disks. Disk.
`
`arrays maintain the high capacity of the ~torage system, arrays maintain the high capacity of the ~torage system,
`
`while enormously increasing the system's disk actuators while enormously increasing the system's disk actuators
`
`and thus the aggregate VO and data rate. and thus the aggregate VO and data rate.
`
`The confluence of developments in processors, networts, The confluence of developments in processors, networts,
`
`and storage olTers the possibility of extending the client,/ and storage olTers the possibility of extending the client,/
`
`server model so effectively used in workstation environ(cid:173)server model so effectively used in workstation environ(cid:173)
`
`ments to higher performance environments, which inte· ments to higher performance environments, which inte·
`
`grate supercomputer, near supercomputers, workstations, grate supercomputer, near supercomputers, workstations,
`
`and stoTlige services on a very high performance network. and stoTlige services on a very high performance network.
`
`The technology is rapidly reaching the point where it is The technology is rapidly reaching the point where it is
`
`possible to think in terms of diskless supercomputers in possible to think in terms of diskless supercomputers in
`
`much the same way as we think aoom diskless workstations. much the same way as we think aoom diskless workstations.
`
`Thus. the network is emerging as the future ·'backplane·' Thus. the network is emerging as the future ·'backplane·'
`
`of high-performance systems. The challenge is to develop of high-performance systems. The challenge is to develop
`
`
`
`\(Au, NETWOR K AND ct!ANN~:L !lASED STORAGE \(Au, NETWOR K AND ct!ANN~:L !lASED STORAGE
`
`
`the new hardware and software architectures that will be the new hardware and software architectures that will be
`
`suitable for this world of network."based storage. suitable for this world of network."based storage.
`
`The emphasis of this paper is on the integration of storage The emphasis of this paper is on the integration of storage
`
`and network services, and the challenges of managing and network services, and the challenges of managing
`
`the complex storage hierarchy of the future: file caches, the complex storage hierarchy of the future: file caches,
`
`on·line disk storage, near-line data libraries, and olT-line on·line disk storage, near-line data libraries, and olT-line
`
`archives. We specifically ignore existing mainframe I/O archives. We specifically ignore existing mainframe I/O
`
`architectures, as these are well described elsewhere (for architectures, as these are well described elsewhere (for
`
`example, in [2». The rest of this paper is organized as example, in [2». The rest of this paper is organized as
`
`follows. In the next three sections, we will review the recent follows. In the next three sections, we will review the recent
`
`advances in interconnect, storage devices. and distributed advances in interconnect, storage devices. and distributed
`
`software, to belter undenitand the underlying changes in software, to belter undenitand the underlying changes in
`
`network, storage, and software tcchnologies. Section V con· network, storage, and software tcchnologies. Section V con·
`
`tains detailed case studies of commercially available high· tains detailed case studies of commercially available high·
`
`performance networks. storage servers. and file servers, as performance networks. storage servers. and file servers, as
`
`well as a prototype high-performance network·attached VO well as a prototype high-performance network·attached VO
`
`controller being developed at the University of California, controller being developed at the University of California,
`
`Berkeley. Our summary. conclusions, and suggestions for Berkeley. Our summary. conclusions, and suggestions for
`
`future research are found in Section VI. future research are found in Section VI.
`
`
`
`II. II.
`
`
`
`INTERCONNECT TRENDS INTERCONNECT TRENDS
`
`
`A. Networks, Channels, and Backplanes A. Networks, Channels, and Backplanes
`
`Interconnect is a generic term for the "glue" that inter· Interconnect is a generic term for the "glue" that inter·
`
`faces the components of a computer system. Interconnect faces the components of a computer system. Interconnect
`
`consists of high.speed hardware interfaces and the asso(cid:173)consists of high.speed hardware interfaces and the asso(cid:173)
`
`ciated logical protocols. The former consists of physical ciated logical protocols. The former consists of physical
`
`wires or control registers. The latter may be interpreted wires or control registers. The latter may be interpreted
`
`by either hardware or software. From the viewpoint of by either hardware or software. From the viewpoint of
`
`the storage system, interconnect can be classified as high· the storage system, interconnect can be classified as high·
`
`speed network::;, processor-to·storage channels, or system speed network::;, processor-to·storage channels, or system
`
`backplanes that prov ide ports to a memory system through backplanes that prov ide ports to a memory system through
`
`direct memory access techniques. direct memory access techniques.
`
`Networks, channels, and backplanes dilTer in terms of Networks, channels, and backplanes dilTer in terms of
`
`the interconnection distances they can support, the band(cid:173)the interconnection distances they can support, the band(cid:173)
`
`width and latencies they can achieve, and the fund amental width and latencies they can achieve, and the fund amental
`
`assumptions about the inherent unreliability of data trans(cid:173)assumptions about the inherent unreliability of data trans(cid:173)
`
`mission. While no statement we can make is universally mission. While no statement we can make is universally
`
`true, in general, backplanes can be characterized by parallel true, in general, backplanes can be characterized by parallel
`
`wide data patlls and centralized arbitration, and are oriented wide data patlls and centralized arbitration, and are oriented
`
`toward readlwrite "memory mapped" operations. Thai is, toward readlwrite "memory mapped" operations. Thai is,
`
`access to control registers is treated identically to memory access to control registers is treated identically to memory
`
`word access. Networks, on the other hand, provide serial word access. Networks, on the other hand, provide serial
`
`data, distribUTed arbitration, and support more message(cid:173)data, distribUTed arbitration, and support more message(cid:173)
`
`oriented protocols. The latter require a more comple)!: oriented protocols. The latter require a more comple)!:
`
`handshake, usually involving the e)[change of high.level handshake, usually involving the e)[change of high.level
`
`request and acknowledgment messages. Channels fall be· request and acknowledgment messages. Channels fall be·
`
`tween the two extremes, consisting of wide data paths tween the two extremes, consisting of wide data paths
`
`of medium distance and often incorporating simplified of medium distance and often incorporating simplified
`
`versions of networkJike prOlocols. versions of networkJike prOlocols.
`
`These considerations are summarized These considerations are summarized
`
`in Table in Table
`
`l. l.
`
`Networks Networks
`
`typically sp;!rl more typically sp;!rl more
`
`than 1 km, sustain than 1 km, sustain
`
`10 Mbil/second (Ethemet) to 100 Mbit/second (FDD!) 10 Mbil/second (Ethemet) to 100 Mbit/second (FDD!)
`
`and beyond, e)[perience latencies measured in several and beyond, e)[perience latencies measured in several
`
`milliseconds (ms), and the network medium milliseconds (ms), and the network medium
`
`itself is itself is
`
`considered to be inherently unreliable. Networks include considered to be inherently unreliable. Networks include
`
`extensive data integrity features within their protocols, extensive data integrity features within their protocols,
`
`2 of 24
`
`
`
`CIoI...w1 -
`
`lG- tOO m
`
`4U-1OOO M~'
`
`Di>llIno;:e
`
`N~!WOIk
`
`~HXlO m
`
`a...<tv.'id\h
`
`tG-tOO Mnts
`
`U"""
`
`high (>mol
`
`medium
`
`'m
`J20- tooo.. ",.
`low «".)
`hiSh
`B)'I' Parily
`
`R.li.bit ity
`
`medium
`Bylc !'I,ily
`
`,~
`Elt1ensivc
`cec
`TIle COlIl.po.riSOll " ~ .. pon lile i.1eralnnetlKm dilIHCe, tran$(cid:173)
`rnis.sion blolOdwid!~. transmis.siorl t.I<1I<')'. irl ilerenl ,<:Iiabilily. alOd Iypical
`lIr<'hniqllu 101' irllP' ... i~g data i"1egriLy.
`
`including eRe c hecksums !lllhe packet and message levels,
`and the explicit aCknowledgment of received packel ...
`Channels span small 10'$ of meters, transmit at anywhere
`from 4.5 MbyteS/second (IBM channel interfaces) to 100
`Mbylcslsecond (HiPPI channels). incur latencies of under
`100 Ill) per transfer, and ha ve medium rel iabili ty. Byle
`parity at the individual transfer word is usuall y supponed,
`althoug h packet-level check-summing might also be sup(cid:173)
`ported.
`Backplanes are about I m in length, transfer from 40
`(VM E) to over 100 (FutureBus) MBytes/second, incur sub
`liS latencies. and the interconnect is considered to be
`highly reliable. Backplanes typically support byte parity,
`although some backplanes (unfortunately) dispense with
`pari ty altoget her,
`In the remainder of this section, we will look at each
`of the three ki nds of interconnect, network, c han nel, and
`bac kpl ane, in more delail.
`
`8 . CommunicotiOl1s Networks ol1d NeMOrlc CQIIlroliers
`An exccllent overview of networking technology can be
`found in [3J. For B fu!Uristic.: view, S« [4J and IS). The
`decade of the 198O's has seen a slow matunltion of network
`technology, but the 1990's promise much more rapid devel(cid:173)
`opments. Today, 10 MbiVsecond Ethernets are pervasive,
`with ma ny envi ronments advancing to the next genenltion
`of 100 Mbil/second networks based on the FOOl (Fiber
`Distributed Data Inte rfacc) standard [6J . FOOl provides
`higher ba ndwidth, longer distances, Hnd reduced error rates.
`largely becHuse of the introduction of tiber optiCS for data
`transmissio n. Unfortunately CO!;I, especially for replaci ng
`the existing copper wi re network with tiber, coupled with
`disappointing tra nsmission latencies, has slowed the acce p(cid:173)
`taAC1: of these higher speed networks. The latency problems
`have more to do wit h FOOl's protocols, which Ire based o n
`a token passing arbitration scheme, than any th ing intrinsic
`in fibe r-optic techoology.
`A network system is decomposed into multiple protocol
`layers, from the application interface down to the method
`of physical communication of bits on the network. Fig(cid:173)
`ure I s ummarizes the popular seven-layer ISO protocol
`model. The physical and link levels arc closely tied to the
`
`.(cid:173)
`...... -.... N_
`
`U ••
`
`"'1.01
`
`rl'Hl!nI ......
`
`OeWlcd inf<>nnatiocll-. IIIe ""'" bein. H~""
`0 ... rq>teIOnwion
`M..,.".mcM of ~ IleI_A P"'JIMI'S
`
`Ddj""'YofpKkel~
`""",01 of individllal packeU
`Access., and C<II>IroI of \nJISI1UuiooI modi"",
`MediUM of InnsrnU4icoo
`
`t. Seven.tofu ISO protocot rnodc:t. TIl. physic.t tly"
`)1¥-
`d.:lCtibc, the actual tr~rwni .. ion "",diam. be iL COlo' cable, Rile'
`optia.. 0' I paralkt backplane. The tink iltye, d""",ibcs how
`5I.lioo; pi. acceM 10 tile medium. This lay .. dOlls .... ith the
`protOaIll fot lItoitratillj fot lI1d obuillifli gr.nt permi5Sioft 10 llIe
`media. The MlWotk Ily.r ddines tbe formal of dall pllCk.ls lO
`lie trIIrwni,tcd .... , 1111: media. i""h,din, destination ancl stfldt,
`infonnatioJl as ",.n as any c,,"k SUJPS. 'The Ulnspon t.~'1 ;.
`~bIc to< Ita. ,";obt~ d ..... 1}' of pllCUts. The sosicoo .. ~
`ntal>liobu oommuniatKNr llerwetn Ilt¢ stIIdi"l pn:!IIam anct lile
`mxivin& prosram. 11tc JmSClIIIl ion Ja)'''' .... enn ines the dtllliJed
`fonnl~ .,rIb.: daLI cmbo<dded ,..ilhift poct cl" Th< IWlicllicft ta~e,
`h .. I"" responsibility of IllUkrstandia& how I~ d.u shook! be
`in~rp .. tcd wiLh in In .pp1i<:rt,;"ns oonlu1.
`
`underl yi ng transport medium. and deal with the physical
`attachment 10 the network and the method of acquiring
`access to it, The ne twork, transport, Llnd session levels
`focus on the detailed fannats of communica tions packets
`and the met1'lods for transmitting them from one program
`10 another. 1be presentation and applicatio ns layers define
`the formats of the data embedded within the packets and
`the applicalion-specific semantics of that data.
`A number of ~rrorm ance measurements of network
`transmission services point oul that the significant over(cid:173)
`head is not protocol interpretation (approXimately 10% of
`instructions Llre s pent in interpreting the network headers),
`The culprits are memory system overheads arising from
`data movement anrJ operating s ystem overheads related to
`context switches and data copying f1J-jlOJ. We will see
`thi$ again and aga in in the sections to follow .
`The network controller is the collection of ha rdware
`and firmware that im plemeots the interface between the
`network and the host processor. It is typically impleme nted
`on a small printed circuil board, and contains its own
`processor, memory mapped cootrol registers, interlace to
`the network, Hnd small memory to hold messages being
`traflSrn;lIed and received. The on-board processor, usua lly
`in conjunction with VLSI components within the netwo rk
`interface, impleme nts the physical and link· level protocols
`of the network.
`The interaction between the nelwork controller and the
`host's memory
`is depicted in Fig. 2, Lists of blocks
`containi ng packets to be sent and packets that have been
`received are maintained in the host proccs!IOr's memory.
`The locations of buffers for these blocks are made known
`10 the ne twork controller, and it will copy packets to and
`from the request/receive block areas using direct memory
`access (DMA) tech niques. This means that the copy of data
`across the peripheral bus is under the control of the network
`controller, artod does not require the intervention of the host
`processor. The cont ro ller will interrupt the host wheneve r
`a message has been received or sent.
`
`PROCUDINGS Of THE IEEE. VOL 110. NO, i. AU(iUST !99l
`
`3 of 24
`
`
`
`-
`
`-"~
`'r:iii"., ... ~'
`U. rI __
`G[J.- -
`Uotrl __
`
`Nclwolt COIII.roIJ&o:r
`
`~ t- -Coo",, _
`
`Proteoaor _ ~ .. I/'Il. _
`
`I~ ~'-
`"-
`
`-
`
`~-
`
`FIt. 2. NellO'Ork COIIl roIl .. fptocn5of memory ink,.mo... The
`tlpre daaihn I~ inl.rlCtion ~ the IIOlwork conlrol"',
`and liot memory or lhe not_It nocIr. The COIIh'OIk, rontaillS
`;m on·board microp"",",""",'. vll'ious memory· mopped OOI'IlruI rell'
`;Siers Ihrough which 5I:",i.e requQ.I$ an be ma<k and stalUS
`~he~ke<I, I physical inl.,.f~ 10 the no",,"'" media. and I buffer
`memory 10 hokl "'qo",", and receivo blocU. 1bese conllin n<lwor.
`m .... aes 10 be Ir:afl$lrlilred or which have betn received ~.
`1;""ly. A Ii" of pendin, fC<!UCJ.U and mC1Sages already re«ived
`f.,i<k!. in lhe """I PfOC:.WII' ·~ memory. Dil't>Cl memory opoTliion.
`(DMA'I). under lhe COIIllol of Ihe node proceMOf, ropy Ihese
`bIocb I/) and from Ibi, memory .
`
`While this prelic:nts a panicularly clean interface betwccn
`the network cont roller and Ihe operating system, it points
`oot some of the intrinsic mcmory system latencies that
`reduce network performance. Consider a message that will
`be transmitted to the network. First the contents of the
`message are created within a Ulic:r application. A call to the
`operating system results in a process switch and a data copy
`from the user's address space to the operating system's area.
`A protocol-specific ne twork header is then appended to the
`data to form a ~ck.aged network message. This must be
`copied one more time, to place the mes.o;.age into a request
`block thai can be accessed by the network controlle r. 'The
`final copy is the DMA operation that moves the message
`wilhin the request block to memory within Ihe network
`conlroller.
`Dala integrity is the aspect of system reliability concerned
`with the transmission of correct data and the explicit
`fl8gging of incol'TecI data. An overriding consideration of
`network prolocols is Iheir concern with reliable transmis(cid:173)
`sion. Because of the distances involved and the complexity
`of the Iransmission path, nelwork transmission is inherently
`lossy, The solution is to append eheck·sum protection bits
`to all network packcts and to include explicit acknowledg.
`ment as part of the network protocols. For cxample, if the
`check sum computed at the receiving end does not match
`the transmitted cheek sum, the receiver sends a negative
`acknowledgmenl 10 the sender.
`
`C. Charrrrel Af{:hil«IurC"s
`Channels provide the logical and physical pathways
`between 110 controllers and storage devices. They are
`medium-distance interconnecl lhat carry signals in parallel,
`usually with some parit y tech niquc to provide data inlegTity.
`In Ihis subsection, we will describe three alternative
`channel organiu tions that characterize the opposite cuds
`of lhe performance spec!rum: SCSI (small computer system
`
`imerface), HIPPI (high-performance parallel interface), and
`FCS (fibre channel standard).
`1) Small Compuler SySlem IIII(:r/oce SCSI is the channel
`interface most frequently encountered in small form facl or
`(5.25 in diameter and smaller) disk drives, as well as a
`wide variety of peripherals such as tape drives, optical disk
`readers, and image scanners. SCSI treats peripheral devices
`in a largcly device-independent fashion. For example, a disk
`drive is viewed as a linear bYle sire am; its detailed structure
`in terms of sectors, tracks, and cylinders is not visible
`through the SCSI inlerfacc. A SCSI channel can support
`up to eight devices sharing a common bus with an g·bit(cid:173)
`wide data path. In SCSI terminology, the I/O controller
`counts as one of these devices, and is called the h05t bus
`adapter (HBA). Burst transfers at 4 to 5 Mbytesls are widely
`available today. In SCSI terminology. a device that requests
`service from another de vice is called the master or the
`initiator. The device thai iN providing the service is called
`the slave or the target.
`SCSI provide. .. a high-level message-based protocol for
`communications between initiators and targets. While this
`makes it pos..~ible to mix widely diffe rent kinds on devices
`on the same channel. it does lead to relatively high over(cid:173)
`heads. The prOiocol has been designed to allow ini tiators
`10 manage multiple simultaneous operations. Targets are
`intelligent in the 5('nse tht lhey explicitly notify the initiator
`when they are ready to trans mit data or when Ihey need 10
`throule a transfer.
`It is ..... orthwhile to examine Ihe SCSI prolocol in some
`detail, 10 cltarly diSlinguish what it does from the kinds of
`messages exchanged on a computer network. The SCSI pro(cid:173)
`locol proceeds in a series of phases, which we summarize
`below:
`Bus Free: No device currently has the bus allocated.
`Arbitration: Initiators arbitrllc lOf access 10 the bus. A
`device's physical address dele rmi nes ilS priority.
`Selection: The initiator informs the target that it will
`panicipate in an 110 operation.
`Reselection: The target informs the initililor Ihat an
`outstanding operalion is to he resumed. FOf example,
`an operation could have been previously suspended
`bccau!lt the 110 device had to obtain more data.
`Command: Command bytes are written to the ta rget by
`the initiator. The target hegins uecuting the operation.
`Data Transfer: The protocol Supports two forms of the
`data transfer phase, Dow In and Do/a Ollt. The former
`refers to the movernent of data from the target to the
`initiator. In the latter, data move from the iniliator to
`the target.
`Message: The message phase also comes in twO forms,
`Messoge In and Messoge OUI. Messoge Irr consists of
`lic:veral allematives. Idefll;fy identifies the reselected
`target. So ~ DolO Poirrter saves the place in the current
`data \Tander if the targel is about to disconnecl. Restore
`DolO Pointer restores this pointer, Discon.rrC'Ct notifies
`tile: initiator that the targetis about to give up the data
`bus. Commond CQII1p{ete occurs when the target tells
`tile: initiator that the operalion has completed. Messoge
`
`4 of 24
`
`
`
`M ...... . -1>1-.. (lo,O (1-,.1
`._-
`-""(l4eMI)1 --~"
`.. ~ .. -.
`~ .t.~_ --~1. (I&ntify)
`
`O""",,_ .. _OItiU .. rr..
`_10 (00 -<)
`· ·U .. """, ··
`AdIi_
`
`~1061 ''" ...
`............ ($o"'O ... PuI
`-,-1. (DI..-<)
`
`.......,. .. ( _
`
`!)aU; .... ,
`
`~.-
`
`~-
`
`H .. _,I0_
`
`,/ "'~-
`
`,--
`I- .... = .... c..pku)
`
`Fi&:. l. SCSI ph • .., u.nsilions on • ",.d. 1'bt basic ph • .., ...
`quencing fOJ l read (from di~) ope.alion i$ shown. Fin, lbe
`initiator,.1.$ ~p lhe read commaDd .nd .. nels il iO lhe IJO device.
`The large! device di.~ecI' from ,h. SCSI bus 10 perform a .. ek
`aDd In begin 10 fin ;1$ lu10mal buffer. It lhen , .. nsfe!$ Ihe dal' 10
`the inilialOl. This may be inlel1ip«scd with additional di.<ronnecl$,
`as II>< tr.n~f.r ge15 ahead of tile internal bufferin8. A oommand
`",,<i$Oge 'erm;na' .. lite operation. This figure is adapled
`complm
`from [40].
`
`OUI hall just one fonn: Identify. This is used 10 identify
`the requesting initiator and its intended target.
`Status: JuSt before command roIDpletion. Ihe target
`sends 11 status message 10 tbe initiator.
`To beltef understa nd the sequencing among the phases,
`see Fig. 3. This illustrates the phase transitions for a typical
`SCSI read operation. The s«juencing of an I/O operation
`actually begins when Ihe host's operating sySlem eslablishes
`data and stalUS blocks within ils memory. Next, it issues an
`va command to the HBA, passing il pointers to command,
`status, and data blocks, as well as the SCSI address of
`the target device. These are staged from host memory to
`device-specific queues within the HBA's memory using
`direct memory access techniques.
`Now the va operation can begin in earnest. The HBA
`arbitrates for and wins control of the SCSI bus. It then
`indicates Ihe target device it wishes to communicate with
`during the seloction phase. The target responds by iden(cid:173)
`tifying itself during a following message out phase. Now
`the actual command, such as "read a sequence of byles,"
`is Iransmiued to the device.
`We assume that the target device is a disk. If the disk
`must first seek before it can obtain tbe requested data,
`it will disconnect from the bus. It sends a disconnect
`message to Ihe initiator, which in turn gives up the bus.
`Note that Ihe HBA can communicate with other devices on
`the SCSI channel, initiating additionalI/O operations. Now
`the device will seek