`
`Internet SCSI Adapter
`
`*
`
`Rodney Van Meter! Gregory G. Finn and Steve Hotz
`Information
`Sciences
`Institute
`University
`of Southern California
`Marina
`de1 Rey, CA 90292
`{rdv,finn,hotz}@ISI.Edu
`
`Abstract
`
`of VISA, our
`the implementation
`In this paper we describe
`Virtual
`Internet SCSI Adapter.
`VISA was built
`to evalu-
`ate the performance
`impact on the host operating system of
`using IP to communicate with peripherals,
`especially stor-
`age devices. We have built and benchmarked
`file systems
`on VISA-attached
`emulated disk drives using UDP/IP. By
`using IP, we expect
`to take advantage of its scaling charac-
`teristics and support
`for heterogeneous media to build
`large,
`long-lived systems. Detailed
`file system and network CPU
`utilization
`and performance data indicate
`that
`it is possible
`for UDP/IP
`to reach more
`than 80% of SCSI’s maximum
`throughput without
`the use of network coprocessors. We
`conclude
`that
`IP is a viable alternative
`to special-purpose
`storage network protocols, and presents numerous advan-
`tages.
`
`1
`
`introduction
`
`Storage system architectures are increasingly network-oriented,
`exploiting
`the ubiquity of networks
`to replace the direct host
`channel. Peripherals attached directly
`to networks are called
`network-attached
`peripherals
`(NAPS), or more specifically,
`network-attached
`storage devices (NASDs).
`We have proposed
`that NAPS use the Internet protocol
`suite; in this paper we provide data on a sample
`implemen-
`tation which supports
`the claim
`that
`the Internet Protocol
`(IP) can perform acceptably
`for NAPS.
`In the experiments presented
`in this paper, we have used
`the User Datagram Protocol
`(UDP) as a transport
`proto-
`col to send and receive SCSI commands and data
`to emu-
`lated network-attached
`disk drives. We have achieved data
`rates of 70+ megabits per second (Mbps)
`for read and write
`
`Research
`the Defense Advanced
`by
`research was sponsored
`*This
`Views and
`No. DABT63-93-C-0062.
`Agency
`under Contract
`Projects
`and should
`not
`conclusions
`contained
`in this
`report
`arc
`the authors’
`or policies,
`either
`be interpreted
`as representing
`the official
`opinion
`expressed
`or implied,
`of ARPA,
`the 1J.S. Government,
`or any person
`or agency
`connected
`with
`them.
`“Author’s
`current
`address:
`Rodney.VanMeterQqntrn.corn.
`
`Quantum
`
`Corp.,
`
`Milpitas,
`
`CA,
`
`the file system over Myrinet before becoming CPU
`through
`limited, and known optimizations
`could be expected
`to raise
`that
`to approximately
`95 for write and 110 for read. TCP
`is predicted
`to be lo-25% slower
`than UDP. Fast ethernet
`is only 10% slower than Myrinet, with
`the difference caused
`primarily
`by ethernet’s
`smaller MTU
`raising
`the amount
`of per-packet
`processing which must be done. Our analy-
`sis predicts
`that SCSI performance
`on the same hardware
`would become CPU-bound
`at 110 Mbps write, 133 Mbps
`read. Our conclusion
`is that more than 80% of maximum
`SCSI performance can be achieved using IP.
`By demonstrating
`comparable performance, we open the
`door to the adoption of the Internet protocol suite by oper-
`ating system vendors and disk drive manufacturers
`as an
`alternative
`to the numerous storage networks now being
`developed.
`This adoption would
`improve
`the sharability
`and scalability
`of storage systems with
`respect
`to numbers
`of network-attached
`devices and client systems, while sub-
`stantially
`reducing
`legacy problems as technology advances,
`shortening development
`time and leveraging networking
`tech-
`nology.
`In addition, TCP/IP
`enables such new wide-area
`uses as remote backup and mirroring
`of devices across the
`Internet.
`This work was done in the course of the Netstation project,
`which concentrates
`on operating
`systems, network proto-
`cols, hardware mechanisms, and security and sharing mod-
`els for network-attached
`peripherals.
`Some of the goals of
`the project are to demonstrate
`(1) that
`IP can provide ac-
`ceptable performance
`in a host operating
`system when used
`to access peripherals,
`(2) that
`IP can be implemented
`effi-
`ciently
`inside network-attached
`peripherals,
`and (3) that our
`derived virtual device model enables efficient, secure use of
`NAPS. This paper addresses only
`the first point;
`the other
`two are presented
`in separate papers
`[HVF98, VHFSG].
`The paper begins with a brief description
`of Netsta-
`tion,
`the network-as-backplane
`system architecture
`of which
`VISA
`forms a part. Section 3 describes
`the principles of net-
`working as applied
`to network-attached
`peripherals, with
`special emphasis on the problems of scalability
`and host OS
`adaptation. We then describe related work,
`followed by the
`VISA architecture,
`performance,
`and possible performance
`improvements.
`Finally, we present our conclusions.
`
`2 Netstation
`
`of all or part of
`or hard copes
`dlgital
`to make
`Permiwon
`use 4s granted
`wthout
`fee prowded
`or classroom
`personal
`are not made or dwributed
`for proflt
`or commercial
`copes
`tage and
`that
`copses bear
`this ncace
`and
`the
`full citatmn
`To copy otherww,
`to
`republish.
`to post on servers
`or to
`redtstribute
`to
`IIsts.
`reqwes
`pnor specific
`permission
`and/or
`ASPLOS
`VIII
`lo/98
`CA,USA
`0 1998 ACM
`1.58113107.0/98/0010...$5.00
`
`this work
`that
`advan~
`on
`the
`frst
`
`for
`
`page
`
`a fee.
`
`system composed
`is a heterogeneous distributed
`Netstation
`peripherals
`[FinSl,
`of processor nodes and network-attached
`FM94]. The peripherals are attached
`to a shared 640 Mbps
`Myrinet network or to a 100 Mbps ethernet, as in Figure 1.
`
`71
`
`Ex.1075.001
`
`DELL
`
`
`
`Hi-Def
`
`magnetic
`disk
`
`Display
`
`- / -/I /-
`,cl II’ camera
`1
`
`Local Arca Network
`
`RAM Disk
`
`disk
`
`Figure 1: A Netstation
`
`network
`
`the other
`The display and camera NAPS have been built;
`NAPS are currently
`emulated.
`The CPU nodes are Spare
`20/71 workstations
`running SunOS 4.1.3; the ideal Netsta-
`tion CPU node would have only a CPU, memory and a
`network
`interface.
`to the network allows
`directly
`Connecting
`peripherals
`sharing of resources and improves system configuration
`flex-
`ibility. Network clients can access peripherals without
`the
`intervention
`of a server.
`to an open network
`Because
`the devices are attached
`with both
`trusted and untrusted nodes on the net, security
`at the NAPS is critical. We have developed a model we refer
`to as the derived
`virtual
`device, or DVD
`[VHF96]. DVDs
`provide a protected execution context at the device, allowing
`direct use of the devices by untrusted
`clients, such as user
`applications.
`The owner of a device defines
`the security
`policy, downloads a description
`to the NAP, and the NAP
`enforces the policy. This allows
`the owner
`to define a set
`of resources
`and operations allowed.
`Thus, a camera
`can
`be granted write access to only a specific region of a frame
`buffer, or a user application
`can be given read-only access to
`a DVD which represents a disk-based
`file or disk partition.
`The only network-attached
`peripheral
`used during
`the
`experiments
`described
`in this paper
`is IPdislc, our emulated
`network-attached
`disk drive.
`IPdisk will be described
`in
`more detail
`in section 5.1.
`VISA, our Virtual
`Internet SCSI Adapter,
`topic of this paper.
`It is the OS mechanism
`access to storage peripherals
`via the network.
`
`is the primary
`for supporting
`
`3 Networking
`
`for NAPS
`
`peripherals has a pro-
`shift to network-attached
`Netstation’s
`found
`impact on the overall system architecture.
`In this
`section, we explore
`the technological
`and architectural mo-
`tivations
`to make the change and its effect on host operating
`systems. We discuss the problems
`that must be solved
`to
`build
`large systems on heterogeneous networks, and show
`how choosing
`the TCP/IP
`suite solves these problems. Fi-
`nally, we briefly discuss the appropriate
`high-level
`interface
`NAPS should present.
`technologies have started
`Developers of storage network
`with different goals and assumptions about
`important
`is-
`sues such as number and type of devices and hosts to be
`interconnected,
`physical distance, cost and bandwidth.
`The
`result has been a proliferation
`of technologies developed pri-
`marily
`for NAPS,
`including 1394 (Firewire), Fibre Channel
`fabrics and Arbitrated
`Loop, HiPPI, and Serial Storage Ar-
`chitecture
`(SSA), as well as vendor-specific networks. Most
`of these have included development of complementary
`new
`physical,
`link, network and transport
`layers [Van96].
`
`Motivation
`3.1
`to better share
`peripherals
`Netstation uses network-attached
`technological
`peripherals and take advantage of the relative
`trends of buses, networks, and peripheral processors.
`The architectural
`reasons
`to shift
`from host adapter-
`attached devices
`to network-attached
`are better sharing of
`devices and reduction of the server’s workload
`[RG96]. By
`allowing clients
`to directly access the devices,
`the server is
`no longer in the data path, reducing
`latency and demands on
`its buses, memory and processors. Devices can also commu-
`nicate directly with each other without
`sending data across
`
`72
`
`Ex.1075.002
`
`DELL
`
`
`
`a single shared system bus.
`Buses do not scale well. They do not scale in distance;
`shortening a bus can raise data rates, forcing a direct
`trade-
`off in design. They do not scale in the number of devices
`interconnected,
`generally having a firm upper bound below
`twenty. They do not scale in aggregate bandwidth with
`the
`number of devices, as bandwidth
`is shared among
`the con-
`nected devices, and, due to increased capacitance, available
`bandwidth may actually decrease as devices are added.
`Networks, especially serial optical networks, are improv-
`ing rapidly
`in speed,
`typically
`scale well
`to large numbers
`of nodes, and can stretch over significant
`distances. Such
`networks are pushing
`into
`the gigabit per second range, a
`speed comparable
`to popular
`low-end buses such as PCI.
`Multicomputers
`and clustered systems have interconnected
`processors via networks
`for many years;
`I/O systems are
`now adopting networks
`to receive the same benefits.
`
`interaction
`and Networking
`I/O System
`Host
`3.2
`for
`I/O
`There are several characteristics which differentiate
`I/O
`file systems
`from
`interactive
`application-level
`network
`such as HTTP or telnet. Netstation uses some of these to im-
`prove
`the operating
`system efficiency, but
`further
`improve-
`ments are possible.
`large, page-aligned multiples
`File transfers are generally
`of the system’s page size (in our case, 4KB). On write,
`they
`are already pinned
`into memory and mapped
`into the kernel
`address space by the file system before
`they are handed
`to
`the bus or network subsystem. This may reduce
`the map-
`ping and copying operations
`typically
`used to move data
`from user buffers
`to kernel buffers. On read, pages have al-
`ready been selected, so the memory destination
`of incoming
`data
`is known
`in advance.
`of the
`the completion
`The
`file system cares only about
`entire
`I/O operation. Data buffers will not be released to the
`application
`or virtual memory
`(VM) system until
`the read
`or write
`is finished.
`Thus,
`it
`is unnecessary
`to send par-
`tial completion
`status up the protocol stack until
`the entire
`transfer has either completed or failed. Typically, modern
`SCSI host bus adapters post only one interrupt
`to the host
`on completion or error, with requests that may be megabytes
`long. The host OS maintains a simple
`timer
`for the com-
`pletion of the entire
`I/O. Sophisticated
`cards maintain one
`context per target device on the bus, and are capable of mul-
`tiplexing
`among them. Current Fibre Channel
`interfaces are
`approaching
`a similarly
`autonomous
`level of operation,
`im-
`plementing
`the FC transport
`layer
`in the network
`interface
`card.
`
`I/O Networks
`Faced by
`Scaling Problems
`3.3
`Netstation
`adopted
`the TCP/IP
`suite to leverage off the In-
`ternet community’s
`experience with scale and heterogeneity.
`We believe
`that
`this experience makes the TCP/IP
`suite an
`increasingly
`appropriate
`choice and LAN-specific
`solutions
`increasingly
`inappropriate
`as more clients and servers are at-
`tached
`to more and larger heterogeneous storage networks.
`The
`I/O network
`technologies deployed
`to date appear
`to offer only
`limited
`scalability,
`due to weakness
`in one or
`more areas. Media bridging, heterogeneity
`along many axes,
`security,
`latency
`tolerance, and congestion and flow control
`are several
`important
`areas that must be addressed.
`Media bridging
`or routing becomes
`important
`as more
`hosts spread over more hops, and more legacy systems (both
`
`In many environ-
`hosts and nets) must be accommodated.
`ments, support
`for multiple
`networks and complex
`topolo-
`gies, of the same or different
`types,
`is likely
`to be critical.
`This brings up issues of formatting,
`addressing, and espe-
`cially underlying
`protocol
`interoperability.
`
`Protocols
`Networking
`3.4
`to work with
`project, we have chosen
`the Netstation
`In
`UDP/IP
`and TCP/IP,
`reasoning
`that
`the more general ar-
`chitecture provides
`features
`that will be critical
`to network-
`attached peripherals
`in the long run. We expect
`that
`the
`performance
`shortcomings will be resolved by the network-
`ing research and development
`community.
`lay-
`Media-specific development of network and transport
`ers leaves systems with potential
`interoperability
`and legacy
`problems. While
`it
`is possible,
`for example,
`to route Fi-
`bre Channel
`frames across HiPPI networks, creating such
`exchange protocols
`for every possible pair of network
`tech-
`nologies results
`in O(iV’)
`protocols.
`If other networks can-
`not provide
`in-order delivery and support
`the flow control
`mechanisms Fibre Channel expects (either
`for class 2 or class
`3 service),
`interoperability
`will be difficult.
`Additions
`of
`new clients or devices
`to the network are limited
`to net-
`work technologies which
`interoperate,
`regardless of external
`forces such as economic constraints, availability
`of new net-
`work
`technologies, etc.
`(and de-
`Most developers and users of such networks
`vices for them) have discarded
`the possibility
`of adapting
`the TCP/IP
`protocol suite for communicating with NAPS ‘.
`Reasons cited include
`inappropriateness
`for the type of traf-
`fic, TCP’s wide-area
`tuning,
`the complexity
`of TCP, and
`especially performance
`of the entire TCP/IP
`suite
`in both
`bandwidth
`and latency
`[G+97]. We argue that
`these con-
`cerns are misplaced
`for several reasons.
`Complexity
`is inherent as systems become larger. TCP/IP’s
`complexity was not created arbitrarily,
`but developed
`in re-
`sponse to particular
`external
`stimuli,
`solving
`the problems
`presented above. Fibre Channel and other network
`trans-
`port protocols will have
`to face similar decisions as they
`attempt
`to address the same problems.
`Earlier analyses of TCP/IP
`performance may have been
`based on poorly
`tuned or now outdated TCP/IP
`implemen-
`tations.
`The Fibre Channel standardization
`effort,
`for ex-
`ample, was begun
`in 1988; much of the research on effi-
`cient networking
`implementations
`cited here has been con-
`ducted
`in the last decade. Older
`implementations
`often im-
`pose penalties such as extra data copies, separate checksum-
`ming passes, and inefficient demultiplexing
`of incoming data.
`Some of the improvements will be discussed in section 7.
`
`Model
`Device Command
`3.5
`decision has been made to connect
`Once the architectural
`the peripheral
`to a network,
`the most
`important
`choice is
`the command
`request
`interface.
`Disk drives
`traditionally
`use a block-level
`interface, while network
`file servers use
`a file model appropriate
`to the needs of a particular
`set
`of clients.
`In the course of rearchitecting
`distributed
`stor-
`age systems, other choices are possible, such as an “object”
`model
`in which
`the disk is responsible
`for layout decisions
`but not file naming and security
`[G+96].
`for security
`We have chosen a block
`interface, enhanced
`with derived virtual
`devices,
`rather
`than a file or file-like
`
`technologies
`of these net
`‘Most
`communication,
`however.
`
`can carry
`
`IP traffic
`
`for host-to-host
`
`73
`
`Ex.1075.003
`
`DELL
`
`
`
`file sys-
`reuse of existing
`the simplest
`model. This allows
`system
`technology
`(data
`layout, parti-
`tem and operating
`tioning, mode manipulation,
`the sd SCSI disk driver,
`fsck
`and
`format,
`virtual memory
`interaction,
`etc.).
`This also
`allows
`the broadest
`range of uses of the device, providing
`clients with
`the choice to build a Fast File System
`(FFS) or
`log-structured
`file systems, non-Unix
`file systems, network
`RAID: and other uses of raw partitions
`such as swap space,
`hierarchical
`storage management
`cache and databases.
`
`4 Related Work
`
`are MIT’s ViewSta-
`to Netstation
`The projects most similar
`[BHMPSS].
`tion [HAI+
`and Cambridge’s Desk Area Network
`(DAN)
`Both use AT’M networks as their device
`interconnect,
`and
`establish a physical boundary
`to the system for security pur-
`poses, while Netstation
`uses protocol-based
`security. They
`have defined a useful
`taxonomy
`of dumb, supervised and
`smart devices.
`re-
`is an area of much current
`storage
`Network-attached
`search and development.
`Fibre Channel disk drives, new dis-
`tributed
`file server architectures,
`and custom development
`of storage networks all play a part.
`A TCP/IP RAID controller was developed at Lawrence
`Livermore National Labs; it is the first TCP disk device of
`which we are aware [WM95].
`to run over phone
`Mainframe channels have been extended
`lines and even WANs
`for remote device mirroring;
`products
`from CNT, EMC and others perform such functions. How-
`ever, they
`typically
`use media-specific
`protocols.
`Fibre Channel-attached
`disk drives utilize a simple SCSI
`block
`interface with no security on a moderately
`complex
`network.
`As described above,
`this
`raises concerns about
`security, scalability,
`legacy systems and interoperability
`with
`other
`types of networks.
`The CMU Parallel Data Lab’s Network Attached Secure
`Disk (NASD) project divides NFS-like
`functionality
`between
`a file manager and the disk drives
`themselves, so that not
`only
`read and write commands but also attribute
`set and
`get are executed at the drive
`[Gf96, RG96, G+97].
`The
`file manager
`is responsible primarily
`for verifying credentials
`and establishing
`access tokens.
`Soltis’ Global File System (GFS) uses Fibre Channel disk
`drives modified
`to support a lock primitive
`[SRO96, So197].
`This provides simple, efficient distributed
`locking. The drive
`itself attaches no meaning
`to the locks; by convention among
`the clients,
`they are used to lock inodes and other data struc-
`tures.
`disk system provides a virtual disk
`The Petal distributed
`model on which
`the Frangipani
`distributed
`file system
`is
`built
`[LT96, TML97]. Due to the virtualization
`of storage
`space, their model moves the actual backing store allocation
`to the disk, though
`the interface between Petal and Frangi-
`pani is a block-level one.
`their own net-
`Various system vendors have developed
`works on which distributed
`device sharing
`takes place. VAX-
`clusters
`[KLS86] and ServerNet
`[HG97] are two examples
`which use message passing between devices and hosts;
`the
`new SGI Origin series uses custom hardware
`to implement a
`shared address space on a switched network which
`includes
`many processors and I/O nodes [LL97].
`
`Netstation CPU Node (Sun)
`
`------- Lz ---- -
`
`applications
`
`user
`
`kernel
`
`TCP
`
`UDP
`
`IP
`
`Myrinet API
`
`ethernet
`
`V
`Myrinet
`
`V
`ethernet
`
`SCSI bus
`
`Fibre
`Channel
`
`sd = SCSI
`Figure 2: Netstation MPU node OS components.
`disk device driver, esp = SCSI bus adapter driver, VISA =
`Virtual
`Internet SCSI Adapter, VM = virtual memory, FFS
`= Fast File System
`
`5
`
`VISA Architecture
`
`is an op-
`Internet SCSI Adapter,
`VISA, Netstation’s Virtual
`erating syst,em module which makes
`Internet-attached
`pe-
`
`74
`
`Ex.1075.004
`
`DELL
`
`
`
`ripherals appear as if they were attached
`blIS.
`
`to a local SCSI
`
`of the
`implements an instantiation
`VISA
`scsi-transport
`structure.
`SunOS uses a layered device
`driver model for SCSI devices. Device drivers which present
`the standard block or raw interfaces
`found
`in /dev are spe-
`cific to a type of peripheral,
`such as sd for SCSI disks and
`st for SCSI tapes. For SCSI devices,
`these drivers
`in turn
`depend on lower-level services to communicate with
`the spe-
`The scsi-transport
`cific
`type of host adapter present.
`structure
`consists primarily
`of pointers
`to nine high-level
`functions
`that
`implement
`a well-defined
`interface
`for send-
`ing commands
`to a SCSI device a.nd managing
`the memory
`for returned data.
`is shown transmit-
`In figure 2, the sd SCSI disk driver
`ting requests
`to VISA and to esp. Esp is the standard SCSI
`adapt,er
`type present on Sun SPARC workstations.
`Addi-
`tional
`implementations
`exist
`for third-party
`SCSI adapters.
`It
`is at this
`layer
`that support
`for networked SCSI, such
`as Fibre Channel,
`is installed.
`Because we are using stan-
`dard SCSI commands, no additional
`packaging or format-
`ting, such as XDR,
`is required.
`VISA
`transmits packets by calling UDP, which uses IP to
`send packets over any supported network medium. We have
`used both 1OObT ethernet and Myrinet
`in these experiments.
`We use UDP as the transport-layer
`protocol because we
`expected UDP
`to be faster
`than TCP as well as easier to
`work with
`inside
`the kernel. On top of UDP we found
`it
`necessary to build a simple reliability
`layer. While
`the net-
`work
`itself
`is highly
`reliable, SunOS provides only
`limited
`buffering
`in the sockets, and packets are discarded when
`the buffer
`is full. Our reliability
`layer works with a fixed-
`size window and assumes in-order delivery; on every 48KB
`transmitted,
`the sender pauses for an ACK. On receipt of
`out-of-order data, a NAK
`is sent and the sender rolls back.
`A timeout of one second is currently
`implemented
`only at
`the device (IPdisk).
`The host depends on either
`IPdisk
`to
`discover network glitches, or the higher-level
`I/O
`request
`to
`timeout and be restarted within
`the sd device driver. This is
`purely a convenience; a production-quality
`implementation
`would have timers at both ends.
`We use 8 kilobyte data payloads, plus a small header in-
`cluding sequence numbers, on top of the UDP packet. Over
`Myrinet
`, this is a single packet. Over ethernet,
`this forces IP
`fragmentation
`in order to work within
`the 1500 byte MTU.
`As is common with UDP, checksumming
`is turned off.
`The data integrity
`is still protected by the link
`layer check-
`sum. Both hardware and software mechanisms
`for doing
`the TCP/UDP
`checksum with zero CPU cost are known.
`It
`can be done during data copy, DMA or transmission
`[PP93,
`FHV96].
`We use one kernel pseudo-process per device attached
`to the VISA virtual bus. Much
`like NFS biods,
`these arc
`responsible
`for the communication
`with
`the device, and are
`necessary because the SunOS kernel
`is not multithreaded.
`
`IPdisk
`5.1
`It runs as
`emulated disk drive.
`is our Netstation
`IPdisk
`a user process on another Sun, emulating
`the SCSI block
`device command set running over UDP or TCP.
`It can be
`configured
`to use RAM
`to emulate disk storage, or to use
`a regular
`file, or access an actual
`raw SCSI disk. For the
`experiments
`described
`in this paper,
`IPdisk was used with
`a 32 MB RAM buffer,
`the fast&
`form of backing store,
`in
`order to stress the host operating
`system.
`In this mode, the
`
`75
`
`and seek lat~ency, zone-
`(rotational
`physical characteristics
`variable
`transfer
`rates, etc.) of a disk are not emulated, so
`the host CPU becomes the bottleneck.
`IPdisk supports our derived virtual device model. This
`allows
`the owner of the device to download small programs
`which act as filters on the SCSI RPCs before those RPCs are
`actually executed. This
`feature
`is primarily
`expected
`to be
`used for sharing devices among multiple hosts and execut-
`ing third-party
`(direct device-to-device)
`copies. Because the
`purpose of these experiments
`is to saturate
`the host CPU,
`DVDs, which only affect execution
`time at the device, are
`not enabled
`for the experiments
`presented here.
`
`6
`
`Performance
`
`to measure the performance of reg-
`We have run experiments
`ular Berkeley
`fast file systems built on top of VISA-attached
`disks. The
`read and write
`throughput
`for sequential ac-
`cesses are detailed
`in the following subsections,
`then in the
`next section we discuss potential
`improvements
`to this per-
`formance.
`file systems on
`for
`the CPU utilization
`We measured
`VISA-attached
`disks and on a directly-attached
`fast, nar-
`row (10 MB/set.)
`SCSI bus. This provides a very direct
`comparison of the actual
`impact of different data
`transport
`mechanisms
`in a complete
`file system environment.
`Our
`test configurations
`utilize a 75 MHz Spare 20/71
`with 64 MB of RAM and an 800 Mbps Sbus as the client.
`The CPU has 20KB/16KB
`I/D on-chip caches and 1MB of
`external cache. The STREAM
`benchmark
`reports memory
`copy bandwidth
`of 61.7 MB/s
`[McC98].
`Identical
`20/71s
`running
`IPdisk emulate
`the network-attached
`disk drives.
`Our absolute performance numbers are low by today’s stan-
`dards due to the age of the systems used, but
`the relative
`numbers and conclusions
`remain valid.
`is ex-
`The biggest performance problem we encountered
`cess calls
`to bcopy. On send,
`the Myrinet
`device driver
`copies the data from the mbuf chain to a special send buffer
`allocated and maintained by the device driver
`itself in main
`memory. From that buffer,
`the data
`is then DMAed
`to the
`buffer RAM on the network
`interface card, from where it is
`transmitted
`onto the network. The nominal
`reason for the
`copy
`is the expense and complexity
`of establishing DMA
`mappings
`for arbitrary memory addresses, as well as con-
`cerns about data alignment. However, when we are sending
`data from complete VM pages, as in VISA or NFS, the pages
`are already pinned
`in memory, properly
`aligned, and large
`enough to more than amortize
`the cost of creating
`the map-
`ping, making
`it the proper choice.
`version of
`is a modified
`The benchmark we are using
`and can
`Bonnie which
`reports additional CPU utilization
`loop on individual
`subtests
`to reduce
`the overhead of pro-
`cess creation. Bonnie, written by Tim Bray and available
`on the Internet, performs several tests to measure I/O oper-
`ation overhead and throughput; we have concentrated
`here
`on throughput.
`25MB
`files were written or read repeatedly
`until
`the total amount of data was one gigabyte.
`
`6.1 Write Performance
`Table 1 lists the measured and predicted performance of file
`writes on VISA-attached
`disk drives. Our write
`through-
`put
`is CPU-limited
`at 72 Mbps when using Myrinet, with
`Table 2 lists the measured write
`kernel profiling disabled.
`throughputs
`of other configurations
`and subsystems
`for com-
`parison. Writing
`just
`to the file system buffer cache (tech-
`
`Ex.1075.005
`
`DELL
`
`
`
`VISA configuration
`Mvrinet UDP
`
`1 thru
`I
`72
`
`Table 1: VISA Write Rates. Throughput
`
`in Mbps.
`
`nically, virtual memory pages, under SunOS) achieves ap-
`proximately
`115 Mbps
`(direct measurement of this number
`is difficult). We measured
`the write
`throughput
`of a file
`system on a physical SCSI disk (a Seagate ST31200W) as
`only 22 Mbps, but extrapolation
`from the CPU consumption
`to CPU saturation
`suggests a throughput
`of 110 Mbps can
`be achieved
`(assuming
`the presence of adequately
`fast disks
`and SCSI buses). Our current VISA
`throughput,
`therefore,
`is only 65% of the estimated SCSI throughput.
`The CPU utilization
`figures in tables 3 and 4 were mea-
`sured using a kernel compiled with gprof profiling enabled.
`The kernel subsystem
`figures are calculated
`by assigning
`each kernel
`function profiled
`to one of several groups. The
`raw data and
`tables of which
`functions were assigned
`to
`which groups are available
`for verification.
`These numbers
`are used to estimate
`the potential performance gains for dif-
`ferent optimizations.
`that,
`numbers
`It can be seen from these CPU utilization
`except
`for bcopy,
`the networking CPU cost is only about
`half the file system CPU cost and comparable
`to the “mis-
`cellaneous” kernel
`functions.
`As described above, when using Myrinet under SunOS,
`write carries a penalty of an unnecessary data copy, and the
`1OObT ethernet driver appears
`to have a similar problem.
`We estimate
`the performance
`possible
`if this copy can be
`eliminated
`at approximately
`89 Mbps on Myrinet,
`or ap-
`proximately
`81% of the maximum estimated SCSI through-
`put, as follows:
`147.5 - 16.4 = 131.1 CPU seconds
`for
`measured VISA UDP after
`removing
`the gprof overhead;
`131.1- 24.7 = 106.4 CPU seconds after removing
`the bcopy.
`Multiplying
`by the measured bandwidth,
`72 * 131/106 = 89
`Mbps
`for our estimate of the bcopy-less CPU saturation
`point. The other numbers are estimated similarly.
`For reasons explained
`in section 7.1, we have also pre-
`dicted
`the performance
`using a UDP which consumes 40%
`fewer CPU cycles in conjunction with a bcopy-less write.
`Application
`benchmarks
`show that sending 1GB using
`TCP consumed approximately
`30 seconds more CPU
`time
`on send than UDP. We added this 30 seconds to the 131 sec-
`onds (after removing gprof overhead)
`required
`to write 1GB
`on a file system built on a VISA-attached
`disk. From
`this
`we estimate
`the TCP
`throughput
`to be 58 Mbps. Approx-
`imately one-third
`of the increase
`in CPU
`time
`is in calcu-
`lating
`the TCP checksum,
`the rest is sequence management
`with acknowledgements
`and timer processing. The
`“faster
`TCP” estimate assumes elimination
`of checksum and bcopy,
`but no reduction
`in the other overhead.
`We recorded
`the distribution
`of command sizes on one
`experimental
`run of Bonnie writing one gigabyte. A total of
`9,409 write commands were sent to the device for a total of
`2,064,821 blocks (l.O08GB),
`counting
`format, newfs, mount,
`and 1GB of file write with metadata updates.
`Although
`the application
`always writes
`in chunks of 8KB,
`
`76
`
`tuned NFS
`app.
`to FS buffer cache
`(VM pages)
`to
`host memory
`NIC memory via Sbus
`SCSI through FS Q19% CPU
`SCSI @lOO% CPU
`(est.)
`UDP blast
`TCP blast
`
`Table 2: Write Rate Comparison
`
`(Mbps)
`
`Module
`
`user:
`benchmark application
`kernel:
`file system code
`Myricom driver bcopy
`other networking
`code
`gprof profiling code
`miscellaneous kernel
`total
`
`time % CPU
`(sets)
`
`31.7
`
`21.2%
`
`37.9
`24.7
`18.6
`16.4
`18.2
`147.5
`
`25.4%
`16.6%
`12.5%
`11%
`12.2%
`
`Table 3: CPU Utilization
`
`to Write 1GB on UDP VISA
`
`Module
`
`user:
`benchmark application
`kernel:
`
`time % CPU
`(sets)
`
`Table 4: CPU Utilization
`
`to Write 1GB on an esp SCSI Bus
`
`Ex.1075.006
`
`DELL
`
`
`
`(est.)
`
`VISA configuration
`Myrinet UDP 085% CPU
`Myrinet UDP @lOO%CPU
`1OObT UDP @82% CPU
`1OObT UDP @lOO% CPU (est.)
`without bcopy
`(est.)
`with
`faster UDP
`(est.)
`TCP
`(est.)
`faster TCP
`
`(est.)
`
`thru
`60
`71
`53
`64
`99
`109
`59
`82
`
`Table 5: VISA Read Rates. Throughput
`
`in Mbps.
`
`the operating system’s write-behind mechanism, called kluster,
`coalesces the smaller
`individual writes
`into larger ones when
`it can. As a result, more than 85% of the total data written
`is in commands
`larger
`than 1OOKB. This reduces the collec-
`tive
`I/O operation overhead and wastes fewer disk revolu-
`tions than smaller operations.
`
`Read Performance
`6.2
`the
`is read sequentially,
`file
`For the read
`tests, a 25MB
`and
`VM/file
`buffer cache is flushed using a SunOS
`ioctl,
`it
`is reread.
`This process
`is repeated until 1GB of data
`has been read. The file system
`is clearly effectively detect-
`ing sequential activity
`and reading ahead of the process,
`with
`the result
`that almost all operations are 56KB
`long.
`This slightly
`exceeds our protocol window,
`creating
`some
`idle time and reducing our throughput.
`for
`throughput
`Table 5 lists the measured and estimated
`several VISA configurations,
`and table 6 lists the measured
`and estimated
`throughput
`for related subsystems. Our VISA
`read throughput
`is 60 Mbps at 85% CPU utilization,
`from
`which we infer the CPU will saturate at 71 Mbps, essentially
`the same as our write rate. This is a lower percentage, 53%,
`of the SCSI potential