throbber
VISA: Netstation’s Virtual
`
`Internet SCSI Adapter
`
`*
`
`Rodney Van Meter! Gregory G. Finn and Steve Hotz
`Information
`Sciences
`Institute
`University
`of Southern California
`Marina
`de1 Rey, CA 90292
`{rdv,finn,hotz}@ISI.Edu
`
`Abstract
`
`of VISA, our
`the implementation
`In this paper we describe
`Virtual
`Internet SCSI Adapter.
`VISA was built
`to evalu-
`ate the performance
`impact on the host operating system of
`using IP to communicate with peripherals,
`especially stor-
`age devices. We have built and benchmarked
`file systems
`on VISA-attached
`emulated disk drives using UDP/IP. By
`using IP, we expect
`to take advantage of its scaling charac-
`teristics and support
`for heterogeneous media to build
`large,
`long-lived systems. Detailed
`file system and network CPU
`utilization
`and performance data indicate
`that
`it is possible
`for UDP/IP
`to reach more
`than 80% of SCSI’s maximum
`throughput without
`the use of network coprocessors. We
`conclude
`that
`IP is a viable alternative
`to special-purpose
`storage network protocols, and presents numerous advan-
`tages.
`
`1
`
`introduction
`
`Storage system architectures are increasingly network-oriented,
`exploiting
`the ubiquity of networks
`to replace the direct host
`channel. Peripherals attached directly
`to networks are called
`network-attached
`peripherals
`(NAPS), or more specifically,
`network-attached
`storage devices (NASDs).
`We have proposed
`that NAPS use the Internet protocol
`suite; in this paper we provide data on a sample
`implemen-
`tation which supports
`the claim
`that
`the Internet Protocol
`(IP) can perform acceptably
`for NAPS.
`In the experiments presented
`in this paper, we have used
`the User Datagram Protocol
`(UDP) as a transport
`proto-
`col to send and receive SCSI commands and data
`to emu-
`lated network-attached
`disk drives. We have achieved data
`rates of 70+ megabits per second (Mbps)
`for read and write
`
`Research
`the Defense Advanced
`by
`research was sponsored
`*This
`Views and
`No. DABT63-93-C-0062.
`Agency
`under Contract
`Projects
`and should
`not
`conclusions
`contained
`in this
`report
`arc
`the authors’
`or policies,
`either
`be interpreted
`as representing
`the official
`opinion
`expressed
`or implied,
`of ARPA,
`the 1J.S. Government,
`or any person
`or agency
`connected
`with
`them.
`“Author’s
`current
`address:
`Rodney.VanMeterQqntrn.corn.
`
`Quantum
`
`Corp.,
`
`Milpitas,
`
`CA,
`
`the file system over Myrinet before becoming CPU
`through
`limited, and known optimizations
`could be expected
`to raise
`that
`to approximately
`95 for write and 110 for read. TCP
`is predicted
`to be lo-25% slower
`than UDP. Fast ethernet
`is only 10% slower than Myrinet, with
`the difference caused
`primarily
`by ethernet’s
`smaller MTU
`raising
`the amount
`of per-packet
`processing which must be done. Our analy-
`sis predicts
`that SCSI performance
`on the same hardware
`would become CPU-bound
`at 110 Mbps write, 133 Mbps
`read. Our conclusion
`is that more than 80% of maximum
`SCSI performance can be achieved using IP.
`By demonstrating
`comparable performance, we open the
`door to the adoption of the Internet protocol suite by oper-
`ating system vendors and disk drive manufacturers
`as an
`alternative
`to the numerous storage networks now being
`developed.
`This adoption would
`improve
`the sharability
`and scalability
`of storage systems with
`respect
`to numbers
`of network-attached
`devices and client systems, while sub-
`stantially
`reducing
`legacy problems as technology advances,
`shortening development
`time and leveraging networking
`tech-
`nology.
`In addition, TCP/IP
`enables such new wide-area
`uses as remote backup and mirroring
`of devices across the
`Internet.
`This work was done in the course of the Netstation project,
`which concentrates
`on operating
`systems, network proto-
`cols, hardware mechanisms, and security and sharing mod-
`els for network-attached
`peripherals.
`Some of the goals of
`the project are to demonstrate
`(1) that
`IP can provide ac-
`ceptable performance
`in a host operating
`system when used
`to access peripherals,
`(2) that
`IP can be implemented
`effi-
`ciently
`inside network-attached
`peripherals,
`and (3) that our
`derived virtual device model enables efficient, secure use of
`NAPS. This paper addresses only
`the first point;
`the other
`two are presented
`in separate papers
`[HVF98, VHFSG].
`The paper begins with a brief description
`of Netsta-
`tion,
`the network-as-backplane
`system architecture
`of which
`VISA
`forms a part. Section 3 describes
`the principles of net-
`working as applied
`to network-attached
`peripherals, with
`special emphasis on the problems of scalability
`and host OS
`adaptation. We then describe related work,
`followed by the
`VISA architecture,
`performance,
`and possible performance
`improvements.
`Finally, we present our conclusions.
`
`2 Netstation
`
`of all or part of
`or hard copes
`dlgital
`to make
`Permiwon
`use 4s granted
`wthout
`fee prowded
`or classroom
`personal
`are not made or dwributed
`for proflt
`or commercial
`copes
`tage and
`that
`copses bear
`this ncace
`and
`the
`full citatmn
`To copy otherww,
`to
`republish.
`to post on servers
`or to
`redtstribute
`to
`IIsts.
`reqwes
`pnor specific
`permission
`and/or
`ASPLOS
`VIII
`lo/98
`CA,USA
`0 1998 ACM
`1.58113107.0/98/0010...$5.00
`
`this work
`that
`advan~
`on
`the
`frst
`
`for
`
`page
`
`a fee.
`
`system composed
`is a heterogeneous distributed
`Netstation
`peripherals
`[FinSl,
`of processor nodes and network-attached
`FM94]. The peripherals are attached
`to a shared 640 Mbps
`Myrinet network or to a 100 Mbps ethernet, as in Figure 1.
`
`71
`
`Ex.1075.001
`
`DELL
`
`

`

`Hi-Def
`
`magnetic
`disk
`
`Display
`
`- / -/I /-
`,cl II’ camera
`1
`
`Local Arca Network
`
`RAM Disk
`
`disk
`
`Figure 1: A Netstation
`
`network
`
`the other
`The display and camera NAPS have been built;
`NAPS are currently
`emulated.
`The CPU nodes are Spare
`20/71 workstations
`running SunOS 4.1.3; the ideal Netsta-
`tion CPU node would have only a CPU, memory and a
`network
`interface.
`to the network allows
`directly
`Connecting
`peripherals
`sharing of resources and improves system configuration
`flex-
`ibility. Network clients can access peripherals without
`the
`intervention
`of a server.
`to an open network
`Because
`the devices are attached
`with both
`trusted and untrusted nodes on the net, security
`at the NAPS is critical. We have developed a model we refer
`to as the derived
`virtual
`device, or DVD
`[VHF96]. DVDs
`provide a protected execution context at the device, allowing
`direct use of the devices by untrusted
`clients, such as user
`applications.
`The owner of a device defines
`the security
`policy, downloads a description
`to the NAP, and the NAP
`enforces the policy. This allows
`the owner
`to define a set
`of resources
`and operations allowed.
`Thus, a camera
`can
`be granted write access to only a specific region of a frame
`buffer, or a user application
`can be given read-only access to
`a DVD which represents a disk-based
`file or disk partition.
`The only network-attached
`peripheral
`used during
`the
`experiments
`described
`in this paper
`is IPdislc, our emulated
`network-attached
`disk drive.
`IPdisk will be described
`in
`more detail
`in section 5.1.
`VISA, our Virtual
`Internet SCSI Adapter,
`topic of this paper.
`It is the OS mechanism
`access to storage peripherals
`via the network.
`
`is the primary
`for supporting
`
`3 Networking
`
`for NAPS
`
`peripherals has a pro-
`shift to network-attached
`Netstation’s
`found
`impact on the overall system architecture.
`In this
`section, we explore
`the technological
`and architectural mo-
`tivations
`to make the change and its effect on host operating
`systems. We discuss the problems
`that must be solved
`to
`build
`large systems on heterogeneous networks, and show
`how choosing
`the TCP/IP
`suite solves these problems. Fi-
`nally, we briefly discuss the appropriate
`high-level
`interface
`NAPS should present.
`technologies have started
`Developers of storage network
`with different goals and assumptions about
`important
`is-
`sues such as number and type of devices and hosts to be
`interconnected,
`physical distance, cost and bandwidth.
`The
`result has been a proliferation
`of technologies developed pri-
`marily
`for NAPS,
`including 1394 (Firewire), Fibre Channel
`fabrics and Arbitrated
`Loop, HiPPI, and Serial Storage Ar-
`chitecture
`(SSA), as well as vendor-specific networks. Most
`of these have included development of complementary
`new
`physical,
`link, network and transport
`layers [Van96].
`
`Motivation
`3.1
`to better share
`peripherals
`Netstation uses network-attached
`technological
`peripherals and take advantage of the relative
`trends of buses, networks, and peripheral processors.
`The architectural
`reasons
`to shift
`from host adapter-
`attached devices
`to network-attached
`are better sharing of
`devices and reduction of the server’s workload
`[RG96]. By
`allowing clients
`to directly access the devices,
`the server is
`no longer in the data path, reducing
`latency and demands on
`its buses, memory and processors. Devices can also commu-
`nicate directly with each other without
`sending data across
`
`72
`
`Ex.1075.002
`
`DELL
`
`

`

`a single shared system bus.
`Buses do not scale well. They do not scale in distance;
`shortening a bus can raise data rates, forcing a direct
`trade-
`off in design. They do not scale in the number of devices
`interconnected,
`generally having a firm upper bound below
`twenty. They do not scale in aggregate bandwidth with
`the
`number of devices, as bandwidth
`is shared among
`the con-
`nected devices, and, due to increased capacitance, available
`bandwidth may actually decrease as devices are added.
`Networks, especially serial optical networks, are improv-
`ing rapidly
`in speed,
`typically
`scale well
`to large numbers
`of nodes, and can stretch over significant
`distances. Such
`networks are pushing
`into
`the gigabit per second range, a
`speed comparable
`to popular
`low-end buses such as PCI.
`Multicomputers
`and clustered systems have interconnected
`processors via networks
`for many years;
`I/O systems are
`now adopting networks
`to receive the same benefits.
`
`interaction
`and Networking
`I/O System
`Host
`3.2
`for
`I/O
`There are several characteristics which differentiate
`I/O
`file systems
`from
`interactive
`application-level
`network
`such as HTTP or telnet. Netstation uses some of these to im-
`prove
`the operating
`system efficiency, but
`further
`improve-
`ments are possible.
`large, page-aligned multiples
`File transfers are generally
`of the system’s page size (in our case, 4KB). On write,
`they
`are already pinned
`into memory and mapped
`into the kernel
`address space by the file system before
`they are handed
`to
`the bus or network subsystem. This may reduce
`the map-
`ping and copying operations
`typically
`used to move data
`from user buffers
`to kernel buffers. On read, pages have al-
`ready been selected, so the memory destination
`of incoming
`data
`is known
`in advance.
`of the
`the completion
`The
`file system cares only about
`entire
`I/O operation. Data buffers will not be released to the
`application
`or virtual memory
`(VM) system until
`the read
`or write
`is finished.
`Thus,
`it
`is unnecessary
`to send par-
`tial completion
`status up the protocol stack until
`the entire
`transfer has either completed or failed. Typically, modern
`SCSI host bus adapters post only one interrupt
`to the host
`on completion or error, with requests that may be megabytes
`long. The host OS maintains a simple
`timer
`for the com-
`pletion of the entire
`I/O. Sophisticated
`cards maintain one
`context per target device on the bus, and are capable of mul-
`tiplexing
`among them. Current Fibre Channel
`interfaces are
`approaching
`a similarly
`autonomous
`level of operation,
`im-
`plementing
`the FC transport
`layer
`in the network
`interface
`card.
`
`I/O Networks
`Faced by
`Scaling Problems
`3.3
`Netstation
`adopted
`the TCP/IP
`suite to leverage off the In-
`ternet community’s
`experience with scale and heterogeneity.
`We believe
`that
`this experience makes the TCP/IP
`suite an
`increasingly
`appropriate
`choice and LAN-specific
`solutions
`increasingly
`inappropriate
`as more clients and servers are at-
`tached
`to more and larger heterogeneous storage networks.
`The
`I/O network
`technologies deployed
`to date appear
`to offer only
`limited
`scalability,
`due to weakness
`in one or
`more areas. Media bridging, heterogeneity
`along many axes,
`security,
`latency
`tolerance, and congestion and flow control
`are several
`important
`areas that must be addressed.
`Media bridging
`or routing becomes
`important
`as more
`hosts spread over more hops, and more legacy systems (both
`
`In many environ-
`hosts and nets) must be accommodated.
`ments, support
`for multiple
`networks and complex
`topolo-
`gies, of the same or different
`types,
`is likely
`to be critical.
`This brings up issues of formatting,
`addressing, and espe-
`cially underlying
`protocol
`interoperability.
`
`Protocols
`Networking
`3.4
`to work with
`project, we have chosen
`the Netstation
`In
`UDP/IP
`and TCP/IP,
`reasoning
`that
`the more general ar-
`chitecture provides
`features
`that will be critical
`to network-
`attached peripherals
`in the long run. We expect
`that
`the
`performance
`shortcomings will be resolved by the network-
`ing research and development
`community.
`lay-
`Media-specific development of network and transport
`ers leaves systems with potential
`interoperability
`and legacy
`problems. While
`it
`is possible,
`for example,
`to route Fi-
`bre Channel
`frames across HiPPI networks, creating such
`exchange protocols
`for every possible pair of network
`tech-
`nologies results
`in O(iV’)
`protocols.
`If other networks can-
`not provide
`in-order delivery and support
`the flow control
`mechanisms Fibre Channel expects (either
`for class 2 or class
`3 service),
`interoperability
`will be difficult.
`Additions
`of
`new clients or devices
`to the network are limited
`to net-
`work technologies which
`interoperate,
`regardless of external
`forces such as economic constraints, availability
`of new net-
`work
`technologies, etc.
`(and de-
`Most developers and users of such networks
`vices for them) have discarded
`the possibility
`of adapting
`the TCP/IP
`protocol suite for communicating with NAPS ‘.
`Reasons cited include
`inappropriateness
`for the type of traf-
`fic, TCP’s wide-area
`tuning,
`the complexity
`of TCP, and
`especially performance
`of the entire TCP/IP
`suite
`in both
`bandwidth
`and latency
`[G+97]. We argue that
`these con-
`cerns are misplaced
`for several reasons.
`Complexity
`is inherent as systems become larger. TCP/IP’s
`complexity was not created arbitrarily,
`but developed
`in re-
`sponse to particular
`external
`stimuli,
`solving
`the problems
`presented above. Fibre Channel and other network
`trans-
`port protocols will have
`to face similar decisions as they
`attempt
`to address the same problems.
`Earlier analyses of TCP/IP
`performance may have been
`based on poorly
`tuned or now outdated TCP/IP
`implemen-
`tations.
`The Fibre Channel standardization
`effort,
`for ex-
`ample, was begun
`in 1988; much of the research on effi-
`cient networking
`implementations
`cited here has been con-
`ducted
`in the last decade. Older
`implementations
`often im-
`pose penalties such as extra data copies, separate checksum-
`ming passes, and inefficient demultiplexing
`of incoming data.
`Some of the improvements will be discussed in section 7.
`
`Model
`Device Command
`3.5
`decision has been made to connect
`Once the architectural
`the peripheral
`to a network,
`the most
`important
`choice is
`the command
`request
`interface.
`Disk drives
`traditionally
`use a block-level
`interface, while network
`file servers use
`a file model appropriate
`to the needs of a particular
`set
`of clients.
`In the course of rearchitecting
`distributed
`stor-
`age systems, other choices are possible, such as an “object”
`model
`in which
`the disk is responsible
`for layout decisions
`but not file naming and security
`[G+96].
`for security
`We have chosen a block
`interface, enhanced
`with derived virtual
`devices,
`rather
`than a file or file-like
`
`technologies
`of these net
`‘Most
`communication,
`however.
`
`can carry
`
`IP traffic
`
`for host-to-host
`
`73
`
`Ex.1075.003
`
`DELL
`
`

`

`file sys-
`reuse of existing
`the simplest
`model. This allows
`system
`technology
`(data
`layout, parti-
`tem and operating
`tioning, mode manipulation,
`the sd SCSI disk driver,
`fsck
`and
`format,
`virtual memory
`interaction,
`etc.).
`This also
`allows
`the broadest
`range of uses of the device, providing
`clients with
`the choice to build a Fast File System
`(FFS) or
`log-structured
`file systems, non-Unix
`file systems, network
`RAID: and other uses of raw partitions
`such as swap space,
`hierarchical
`storage management
`cache and databases.
`
`4 Related Work
`
`are MIT’s ViewSta-
`to Netstation
`The projects most similar
`[BHMPSS].
`tion [HAI+
`and Cambridge’s Desk Area Network
`(DAN)
`Both use AT’M networks as their device
`interconnect,
`and
`establish a physical boundary
`to the system for security pur-
`poses, while Netstation
`uses protocol-based
`security. They
`have defined a useful
`taxonomy
`of dumb, supervised and
`smart devices.
`re-
`is an area of much current
`storage
`Network-attached
`search and development.
`Fibre Channel disk drives, new dis-
`tributed
`file server architectures,
`and custom development
`of storage networks all play a part.
`A TCP/IP RAID controller was developed at Lawrence
`Livermore National Labs; it is the first TCP disk device of
`which we are aware [WM95].
`to run over phone
`Mainframe channels have been extended
`lines and even WANs
`for remote device mirroring;
`products
`from CNT, EMC and others perform such functions. How-
`ever, they
`typically
`use media-specific
`protocols.
`Fibre Channel-attached
`disk drives utilize a simple SCSI
`block
`interface with no security on a moderately
`complex
`network.
`As described above,
`this
`raises concerns about
`security, scalability,
`legacy systems and interoperability
`with
`other
`types of networks.
`The CMU Parallel Data Lab’s Network Attached Secure
`Disk (NASD) project divides NFS-like
`functionality
`between
`a file manager and the disk drives
`themselves, so that not
`only
`read and write commands but also attribute
`set and
`get are executed at the drive
`[Gf96, RG96, G+97].
`The
`file manager
`is responsible primarily
`for verifying credentials
`and establishing
`access tokens.
`Soltis’ Global File System (GFS) uses Fibre Channel disk
`drives modified
`to support a lock primitive
`[SRO96, So197].
`This provides simple, efficient distributed
`locking. The drive
`itself attaches no meaning
`to the locks; by convention among
`the clients,
`they are used to lock inodes and other data struc-
`tures.
`disk system provides a virtual disk
`The Petal distributed
`model on which
`the Frangipani
`distributed
`file system
`is
`built
`[LT96, TML97]. Due to the virtualization
`of storage
`space, their model moves the actual backing store allocation
`to the disk, though
`the interface between Petal and Frangi-
`pani is a block-level one.
`their own net-
`Various system vendors have developed
`works on which distributed
`device sharing
`takes place. VAX-
`clusters
`[KLS86] and ServerNet
`[HG97] are two examples
`which use message passing between devices and hosts;
`the
`new SGI Origin series uses custom hardware
`to implement a
`shared address space on a switched network which
`includes
`many processors and I/O nodes [LL97].
`
`Netstation CPU Node (Sun)
`
`------- Lz ---- -
`
`applications
`
`user
`
`kernel
`
`TCP
`
`UDP
`
`IP
`
`Myrinet API
`
`ethernet
`
`V
`Myrinet
`
`V
`ethernet
`
`SCSI bus
`
`Fibre
`Channel
`
`sd = SCSI
`Figure 2: Netstation MPU node OS components.
`disk device driver, esp = SCSI bus adapter driver, VISA =
`Virtual
`Internet SCSI Adapter, VM = virtual memory, FFS
`= Fast File System
`
`5
`
`VISA Architecture
`
`is an op-
`Internet SCSI Adapter,
`VISA, Netstation’s Virtual
`erating syst,em module which makes
`Internet-attached
`pe-
`
`74
`
`Ex.1075.004
`
`DELL
`
`

`

`ripherals appear as if they were attached
`blIS.
`
`to a local SCSI
`
`of the
`implements an instantiation
`VISA
`scsi-transport
`structure.
`SunOS uses a layered device
`driver model for SCSI devices. Device drivers which present
`the standard block or raw interfaces
`found
`in /dev are spe-
`cific to a type of peripheral,
`such as sd for SCSI disks and
`st for SCSI tapes. For SCSI devices,
`these drivers
`in turn
`depend on lower-level services to communicate with
`the spe-
`The scsi-transport
`cific
`type of host adapter present.
`structure
`consists primarily
`of pointers
`to nine high-level
`functions
`that
`implement
`a well-defined
`interface
`for send-
`ing commands
`to a SCSI device a.nd managing
`the memory
`for returned data.
`is shown transmit-
`In figure 2, the sd SCSI disk driver
`ting requests
`to VISA and to esp. Esp is the standard SCSI
`adapt,er
`type present on Sun SPARC workstations.
`Addi-
`tional
`implementations
`exist
`for third-party
`SCSI adapters.
`It
`is at this
`layer
`that support
`for networked SCSI, such
`as Fibre Channel,
`is installed.
`Because we are using stan-
`dard SCSI commands, no additional
`packaging or format-
`ting, such as XDR,
`is required.
`VISA
`transmits packets by calling UDP, which uses IP to
`send packets over any supported network medium. We have
`used both 1OObT ethernet and Myrinet
`in these experiments.
`We use UDP as the transport-layer
`protocol because we
`expected UDP
`to be faster
`than TCP as well as easier to
`work with
`inside
`the kernel. On top of UDP we found
`it
`necessary to build a simple reliability
`layer. While
`the net-
`work
`itself
`is highly
`reliable, SunOS provides only
`limited
`buffering
`in the sockets, and packets are discarded when
`the buffer
`is full. Our reliability
`layer works with a fixed-
`size window and assumes in-order delivery; on every 48KB
`transmitted,
`the sender pauses for an ACK. On receipt of
`out-of-order data, a NAK
`is sent and the sender rolls back.
`A timeout of one second is currently
`implemented
`only at
`the device (IPdisk).
`The host depends on either
`IPdisk
`to
`discover network glitches, or the higher-level
`I/O
`request
`to
`timeout and be restarted within
`the sd device driver. This is
`purely a convenience; a production-quality
`implementation
`would have timers at both ends.
`We use 8 kilobyte data payloads, plus a small header in-
`cluding sequence numbers, on top of the UDP packet. Over
`Myrinet
`, this is a single packet. Over ethernet,
`this forces IP
`fragmentation
`in order to work within
`the 1500 byte MTU.
`As is common with UDP, checksumming
`is turned off.
`The data integrity
`is still protected by the link
`layer check-
`sum. Both hardware and software mechanisms
`for doing
`the TCP/UDP
`checksum with zero CPU cost are known.
`It
`can be done during data copy, DMA or transmission
`[PP93,
`FHV96].
`We use one kernel pseudo-process per device attached
`to the VISA virtual bus. Much
`like NFS biods,
`these arc
`responsible
`for the communication
`with
`the device, and are
`necessary because the SunOS kernel
`is not multithreaded.
`
`IPdisk
`5.1
`It runs as
`emulated disk drive.
`is our Netstation
`IPdisk
`a user process on another Sun, emulating
`the SCSI block
`device command set running over UDP or TCP.
`It can be
`configured
`to use RAM
`to emulate disk storage, or to use
`a regular
`file, or access an actual
`raw SCSI disk. For the
`experiments
`described
`in this paper,
`IPdisk was used with
`a 32 MB RAM buffer,
`the fast&
`form of backing store,
`in
`order to stress the host operating
`system.
`In this mode, the
`
`75
`
`and seek lat~ency, zone-
`(rotational
`physical characteristics
`variable
`transfer
`rates, etc.) of a disk are not emulated, so
`the host CPU becomes the bottleneck.
`IPdisk supports our derived virtual device model. This
`allows
`the owner of the device to download small programs
`which act as filters on the SCSI RPCs before those RPCs are
`actually executed. This
`feature
`is primarily
`expected
`to be
`used for sharing devices among multiple hosts and execut-
`ing third-party
`(direct device-to-device)
`copies. Because the
`purpose of these experiments
`is to saturate
`the host CPU,
`DVDs, which only affect execution
`time at the device, are
`not enabled
`for the experiments
`presented here.
`
`6
`
`Performance
`
`to measure the performance of reg-
`We have run experiments
`ular Berkeley
`fast file systems built on top of VISA-attached
`disks. The
`read and write
`throughput
`for sequential ac-
`cesses are detailed
`in the following subsections,
`then in the
`next section we discuss potential
`improvements
`to this per-
`formance.
`file systems on
`for
`the CPU utilization
`We measured
`VISA-attached
`disks and on a directly-attached
`fast, nar-
`row (10 MB/set.)
`SCSI bus. This provides a very direct
`comparison of the actual
`impact of different data
`transport
`mechanisms
`in a complete
`file system environment.
`Our
`test configurations
`utilize a 75 MHz Spare 20/71
`with 64 MB of RAM and an 800 Mbps Sbus as the client.
`The CPU has 20KB/16KB
`I/D on-chip caches and 1MB of
`external cache. The STREAM
`benchmark
`reports memory
`copy bandwidth
`of 61.7 MB/s
`[McC98].
`Identical
`20/71s
`running
`IPdisk emulate
`the network-attached
`disk drives.
`Our absolute performance numbers are low by today’s stan-
`dards due to the age of the systems used, but
`the relative
`numbers and conclusions
`remain valid.
`is ex-
`The biggest performance problem we encountered
`cess calls
`to bcopy. On send,
`the Myrinet
`device driver
`copies the data from the mbuf chain to a special send buffer
`allocated and maintained by the device driver
`itself in main
`memory. From that buffer,
`the data
`is then DMAed
`to the
`buffer RAM on the network
`interface card, from where it is
`transmitted
`onto the network. The nominal
`reason for the
`copy
`is the expense and complexity
`of establishing DMA
`mappings
`for arbitrary memory addresses, as well as con-
`cerns about data alignment. However, when we are sending
`data from complete VM pages, as in VISA or NFS, the pages
`are already pinned
`in memory, properly
`aligned, and large
`enough to more than amortize
`the cost of creating
`the map-
`ping, making
`it the proper choice.
`version of
`is a modified
`The benchmark we are using
`and can
`Bonnie which
`reports additional CPU utilization
`loop on individual
`subtests
`to reduce
`the overhead of pro-
`cess creation. Bonnie, written by Tim Bray and available
`on the Internet, performs several tests to measure I/O oper-
`ation overhead and throughput; we have concentrated
`here
`on throughput.
`25MB
`files were written or read repeatedly
`until
`the total amount of data was one gigabyte.
`
`6.1 Write Performance
`Table 1 lists the measured and predicted performance of file
`writes on VISA-attached
`disk drives. Our write
`through-
`put
`is CPU-limited
`at 72 Mbps when using Myrinet, with
`Table 2 lists the measured write
`kernel profiling disabled.
`throughputs
`of other configurations
`and subsystems
`for com-
`parison. Writing
`just
`to the file system buffer cache (tech-
`
`Ex.1075.005
`
`DELL
`
`

`

`VISA configuration
`Mvrinet UDP
`
`1 thru
`I
`72
`
`Table 1: VISA Write Rates. Throughput
`
`in Mbps.
`
`nically, virtual memory pages, under SunOS) achieves ap-
`proximately
`115 Mbps
`(direct measurement of this number
`is difficult). We measured
`the write
`throughput
`of a file
`system on a physical SCSI disk (a Seagate ST31200W) as
`only 22 Mbps, but extrapolation
`from the CPU consumption
`to CPU saturation
`suggests a throughput
`of 110 Mbps can
`be achieved
`(assuming
`the presence of adequately
`fast disks
`and SCSI buses). Our current VISA
`throughput,
`therefore,
`is only 65% of the estimated SCSI throughput.
`The CPU utilization
`figures in tables 3 and 4 were mea-
`sured using a kernel compiled with gprof profiling enabled.
`The kernel subsystem
`figures are calculated
`by assigning
`each kernel
`function profiled
`to one of several groups. The
`raw data and
`tables of which
`functions were assigned
`to
`which groups are available
`for verification.
`These numbers
`are used to estimate
`the potential performance gains for dif-
`ferent optimizations.
`that,
`numbers
`It can be seen from these CPU utilization
`except
`for bcopy,
`the networking CPU cost is only about
`half the file system CPU cost and comparable
`to the “mis-
`cellaneous” kernel
`functions.
`As described above, when using Myrinet under SunOS,
`write carries a penalty of an unnecessary data copy, and the
`1OObT ethernet driver appears
`to have a similar problem.
`We estimate
`the performance
`possible
`if this copy can be
`eliminated
`at approximately
`89 Mbps on Myrinet,
`or ap-
`proximately
`81% of the maximum estimated SCSI through-
`put, as follows:
`147.5 - 16.4 = 131.1 CPU seconds
`for
`measured VISA UDP after
`removing
`the gprof overhead;
`131.1- 24.7 = 106.4 CPU seconds after removing
`the bcopy.
`Multiplying
`by the measured bandwidth,
`72 * 131/106 = 89
`Mbps
`for our estimate of the bcopy-less CPU saturation
`point. The other numbers are estimated similarly.
`For reasons explained
`in section 7.1, we have also pre-
`dicted
`the performance
`using a UDP which consumes 40%
`fewer CPU cycles in conjunction with a bcopy-less write.
`Application
`benchmarks
`show that sending 1GB using
`TCP consumed approximately
`30 seconds more CPU
`time
`on send than UDP. We added this 30 seconds to the 131 sec-
`onds (after removing gprof overhead)
`required
`to write 1GB
`on a file system built on a VISA-attached
`disk. From
`this
`we estimate
`the TCP
`throughput
`to be 58 Mbps. Approx-
`imately one-third
`of the increase
`in CPU
`time
`is in calcu-
`lating
`the TCP checksum,
`the rest is sequence management
`with acknowledgements
`and timer processing. The
`“faster
`TCP” estimate assumes elimination
`of checksum and bcopy,
`but no reduction
`in the other overhead.
`We recorded
`the distribution
`of command sizes on one
`experimental
`run of Bonnie writing one gigabyte. A total of
`9,409 write commands were sent to the device for a total of
`2,064,821 blocks (l.O08GB),
`counting
`format, newfs, mount,
`and 1GB of file write with metadata updates.
`Although
`the application
`always writes
`in chunks of 8KB,
`
`76
`
`tuned NFS
`app.
`to FS buffer cache
`(VM pages)
`to
`host memory
`NIC memory via Sbus
`SCSI through FS Q19% CPU
`SCSI @lOO% CPU
`(est.)
`UDP blast
`TCP blast
`
`Table 2: Write Rate Comparison
`
`(Mbps)
`
`Module
`
`user:
`benchmark application
`kernel:
`file system code
`Myricom driver bcopy
`other networking
`code
`gprof profiling code
`miscellaneous kernel
`total
`
`time % CPU
`(sets)
`
`31.7
`
`21.2%
`
`37.9
`24.7
`18.6
`16.4
`18.2
`147.5
`
`25.4%
`16.6%
`12.5%
`11%
`12.2%
`
`Table 3: CPU Utilization
`
`to Write 1GB on UDP VISA
`
`Module
`
`user:
`benchmark application
`kernel:
`
`time % CPU
`(sets)
`
`Table 4: CPU Utilization
`
`to Write 1GB on an esp SCSI Bus
`
`Ex.1075.006
`
`DELL
`
`

`

`(est.)
`
`VISA configuration
`Myrinet UDP 085% CPU
`Myrinet UDP @lOO%CPU
`1OObT UDP @82% CPU
`1OObT UDP @lOO% CPU (est.)
`without bcopy
`(est.)
`with
`faster UDP
`(est.)
`TCP
`(est.)
`faster TCP
`
`(est.)
`
`thru
`60
`71
`53
`64
`99
`109
`59
`82
`
`Table 5: VISA Read Rates. Throughput
`
`in Mbps.
`
`the operating system’s write-behind mechanism, called kluster,
`coalesces the smaller
`individual writes
`into larger ones when
`it can. As a result, more than 85% of the total data written
`is in commands
`larger
`than 1OOKB. This reduces the collec-
`tive
`I/O operation overhead and wastes fewer disk revolu-
`tions than smaller operations.
`
`Read Performance
`6.2
`the
`is read sequentially,
`file
`For the read
`tests, a 25MB
`and
`VM/file
`buffer cache is flushed using a SunOS
`ioctl,
`it
`is reread.
`This process
`is repeated until 1GB of data
`has been read. The file system
`is clearly effectively detect-
`ing sequential activity
`and reading ahead of the process,
`with
`the result
`that almost all operations are 56KB
`long.
`This slightly
`exceeds our protocol window,
`creating
`some
`idle time and reducing our throughput.
`for
`throughput
`Table 5 lists the measured and estimated
`several VISA configurations,
`and table 6 lists the measured
`and estimated
`throughput
`for related subsystems. Our VISA
`read throughput
`is 60 Mbps at 85% CPU utilization,
`from
`which we infer the CPU will saturate at 71 Mbps, essentially
`the same as our write rate. This is a lower percentage, 53%,
`of the SCSI potential

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket