VISA: Netstation’s Virtual
`Internet SCSI Adapter
`Rodney Van Meter! Gregory G. Finn and Steve Hotz
`of Southern California
`de1 Rey, CA 90292
`of VISA, our
`the implementation
`In this paper we describe
`Internet SCSI Adapter.
`VISA was built
`to evalu-
`ate the performance
`impact on the host operating system of
`using IP to communicate with peripherals,
`especially stor-
`age devices. We have built and benchmarked
`file systems
`on VISA-attached
`emulated disk drives using UDP/IP. By
`using IP, we expect
`to take advantage of its scaling charac-
`teristics and support
`for heterogeneous media to build
`long-lived systems. Detailed
`file system and network CPU
`and performance data indicate
`it is possible
`for UDP/IP
`to reach more
`than 80% of SCSI’s maximum
`throughput without
`the use of network coprocessors. We
`IP is a viable alternative
`to special-purpose
`storage network protocols, and presents numerous advan-
`Storage system architectures are increasingly network-oriented,
`the ubiquity of networks
`to replace the direct host
`channel. Peripherals attached directly
`to networks are called
`(NAPS), or more specifically,
`storage devices (NASDs).
`We have proposed
`that NAPS use the Internet protocol
`suite; in this paper we provide data on a sample
`tation which supports
`the claim
`the Internet Protocol
`(IP) can perform acceptably
`for NAPS.
`In the experiments presented
`in this paper, we have used
`the User Datagram Protocol
`(UDP) as a transport
`col to send and receive SCSI commands and data
`to emu-
`lated network-attached
`disk drives. We have achieved data
`rates of 70+ megabits per second (Mbps)
`for read and write
`the Defense Advanced
`research was sponsored
`Views and
`No. DABT63-93-C-0062.
`under Contract
`and should
`in this
`the authors’
`or policies,
`be interpreted
`as representing
`the official
`or implied,
`of ARPA,
`the 1J.S. Government,
`or any person
`or agency
`the file system over Myrinet before becoming CPU
`limited, and known optimizations
`could be expected
`to raise
`to approximately
`95 for write and 110 for read. TCP
`is predicted
`to be lo-25% slower
`than UDP. Fast ethernet
`is only 10% slower than Myrinet, with
`the difference caused
`by ethernet’s
`smaller MTU
`the amount
`of per-packet
`processing which must be done. Our analy-
`sis predicts
`that SCSI performance
`on the same hardware
`would become CPU-bound
`at 110 Mbps write, 133 Mbps
`read. Our conclusion
`is that more than 80% of maximum
`SCSI performance can be achieved using IP.
`By demonstrating
`comparable performance, we open the
`door to the adoption of the Internet protocol suite by oper-
`ating system vendors and disk drive manufacturers
`as an
`to the numerous storage networks now being
`This adoption would
`the sharability
`and scalability
`of storage systems with
`to numbers
`of network-attached
`devices and client systems, while sub-
`legacy problems as technology advances,
`shortening development
`time and leveraging networking
`In addition, TCP/IP
`enables such new wide-area
`uses as remote backup and mirroring
`of devices across the
`This work was done in the course of the Netstation project,
`which concentrates
`on operating
`systems, network proto-
`cols, hardware mechanisms, and security and sharing mod-
`els for network-attached
`Some of the goals of
`the project are to demonstrate
`(1) that
`IP can provide ac-
`ceptable performance
`in a host operating
`system when used
`to access peripherals,
`(2) that
`IP can be implemented
`inside network-attached
`and (3) that our
`derived virtual device model enables efficient, secure use of
`NAPS. This paper addresses only
`the first point;
`the other
`two are presented
`in separate papers
`[HVF98, VHFSG].
`The paper begins with a brief description
`of Netsta-
`the network-as-backplane
`system architecture
`of which
`forms a part. Section 3 describes
`the principles of net-
`working as applied
`to network-attached
`peripherals, with
`special emphasis on the problems of scalability
`and host OS
`adaptation. We then describe related work,
`followed by the
`VISA architecture,
`and possible performance
`Finally, we present our conclusions.
`2 Netstation
`of all or part of
`or hard copes
`to make
`use 4s granted
`fee prowded
`or classroom
`are not made or dwributed
`for proflt
`or commercial
`tage and
`copses bear
`this ncace
`full citatmn
`To copy otherww,
`to post on servers
`or to
`pnor specific
`0 1998 ACM
`this work
`a fee.
`system composed
`is a heterogeneous distributed
`of processor nodes and network-attached
`FM94]. The peripherals are attached
`to a shared 640 Mbps
`Myrinet network or to a 100 Mbps ethernet, as in Figure 1.


`- / -/I /-
`,cl II’ camera
`Local Arca Network
`RAM Disk
`Figure 1: A Netstation
`the other
`The display and camera NAPS have been built;
`NAPS are currently
`The CPU nodes are Spare
`20/71 workstations
`running SunOS 4.1.3; the ideal Netsta-
`tion CPU node would have only a CPU, memory and a
`to the network allows
`sharing of resources and improves system configuration
`ibility. Network clients can access peripherals without
`of a server.
`to an open network
`the devices are attached
`with both
`trusted and untrusted nodes on the net, security
`at the NAPS is critical. We have developed a model we refer
`to as the derived
`device, or DVD
`[VHF96]. DVDs
`provide a protected execution context at the device, allowing
`direct use of the devices by untrusted
`clients, such as user
`The owner of a device defines
`the security
`policy, downloads a description
`to the NAP, and the NAP
`enforces the policy. This allows
`the owner
`to define a set
`of resources
`and operations allowed.
`Thus, a camera
`be granted write access to only a specific region of a frame
`buffer, or a user application
`can be given read-only access to
`a DVD which represents a disk-based
`file or disk partition.
`The only network-attached
`used during
`in this paper
`is IPdislc, our emulated
`disk drive.
`IPdisk will be described
`more detail
`in section 5.1.
`VISA, our Virtual
`Internet SCSI Adapter,
`topic of this paper.
`It is the OS mechanism
`access to storage peripherals
`via the network.
`is the primary
`for supporting
`3 Networking
`for NAPS
`peripherals has a pro-
`shift to network-attached
`impact on the overall system architecture.
`In this
`section, we explore
`the technological
`and architectural mo-
`to make the change and its effect on host operating
`systems. We discuss the problems
`that must be solved
`large systems on heterogeneous networks, and show
`how choosing
`the TCP/IP
`suite solves these problems. Fi-
`nally, we briefly discuss the appropriate
`NAPS should present.
`technologies have started
`Developers of storage network
`with different goals and assumptions about
`sues such as number and type of devices and hosts to be
`physical distance, cost and bandwidth.
`result has been a proliferation
`of technologies developed pri-
`for NAPS,
`including 1394 (Firewire), Fibre Channel
`fabrics and Arbitrated
`Loop, HiPPI, and Serial Storage Ar-
`(SSA), as well as vendor-specific networks. Most
`of these have included development of complementary
`link, network and transport
`layers [Van96].
`to better share
`Netstation uses network-attached
`peripherals and take advantage of the relative
`trends of buses, networks, and peripheral processors.
`The architectural
`to shift
`from host adapter-
`attached devices
`to network-attached
`are better sharing of
`devices and reduction of the server’s workload
`[RG96]. By
`allowing clients
`to directly access the devices,
`the server is
`no longer in the data path, reducing
`latency and demands on
`its buses, memory and processors. Devices can also commu-
`nicate directly with each other without
`sending data across


`a single shared system bus.
`Buses do not scale well. They do not scale in distance;
`shortening a bus can raise data rates, forcing a direct
`off in design. They do not scale in the number of devices
`generally having a firm upper bound below
`twenty. They do not scale in aggregate bandwidth with
`number of devices, as bandwidth
`is shared among
`the con-
`nected devices, and, due to increased capacitance, available
`bandwidth may actually decrease as devices are added.
`Networks, especially serial optical networks, are improv-
`ing rapidly
`in speed,
`scale well
`to large numbers
`of nodes, and can stretch over significant
`distances. Such
`networks are pushing
`the gigabit per second range, a
`speed comparable
`to popular
`low-end buses such as PCI.
`and clustered systems have interconnected
`processors via networks
`for many years;
`I/O systems are
`now adopting networks
`to receive the same benefits.
`and Networking
`I/O System
`There are several characteristics which differentiate
`file systems
`such as HTTP or telnet. Netstation uses some of these to im-
`the operating
`system efficiency, but
`ments are possible.
`large, page-aligned multiples
`File transfers are generally
`of the system’s page size (in our case, 4KB). On write,
`are already pinned
`into memory and mapped
`into the kernel
`address space by the file system before
`they are handed
`the bus or network subsystem. This may reduce
`the map-
`ping and copying operations
`used to move data
`from user buffers
`to kernel buffers. On read, pages have al-
`ready been selected, so the memory destination
`of incoming
`is known
`in advance.
`of the
`the completion
`file system cares only about
`I/O operation. Data buffers will not be released to the
`or virtual memory
`(VM) system until
`the read
`or write
`is finished.
`is unnecessary
`to send par-
`tial completion
`status up the protocol stack until
`the entire
`transfer has either completed or failed. Typically, modern
`SCSI host bus adapters post only one interrupt
`to the host
`on completion or error, with requests that may be megabytes
`long. The host OS maintains a simple
`for the com-
`pletion of the entire
`I/O. Sophisticated
`cards maintain one
`context per target device on the bus, and are capable of mul-
`among them. Current Fibre Channel
`interfaces are
`a similarly
`level of operation,
`the FC transport
`in the network
`I/O Networks
`Faced by
`Scaling Problems
`the TCP/IP
`suite to leverage off the In-
`ternet community’s
`experience with scale and heterogeneity.
`We believe
`this experience makes the TCP/IP
`suite an
`choice and LAN-specific
`as more clients and servers are at-
`to more and larger heterogeneous storage networks.
`I/O network
`technologies deployed
`to date appear
`to offer only
`due to weakness
`in one or
`more areas. Media bridging, heterogeneity
`along many axes,
`tolerance, and congestion and flow control
`are several
`areas that must be addressed.
`Media bridging
`or routing becomes
`as more
`hosts spread over more hops, and more legacy systems (both
`In many environ-
`hosts and nets) must be accommodated.
`ments, support
`for multiple
`networks and complex
`gies, of the same or different
`is likely
`to be critical.
`This brings up issues of formatting,
`addressing, and espe-
`cially underlying
`to work with
`project, we have chosen
`the Netstation
`and TCP/IP,
`the more general ar-
`chitecture provides
`that will be critical
`to network-
`attached peripherals
`in the long run. We expect
`shortcomings will be resolved by the network-
`ing research and development
`Media-specific development of network and transport
`ers leaves systems with potential
`and legacy
`problems. While
`is possible,
`for example,
`to route Fi-
`bre Channel
`frames across HiPPI networks, creating such
`exchange protocols
`for every possible pair of network
`nologies results
`in O(iV’)
`If other networks can-
`not provide
`in-order delivery and support
`the flow control
`mechanisms Fibre Channel expects (either
`for class 2 or class
`3 service),
`will be difficult.
`new clients or devices
`to the network are limited
`to net-
`work technologies which
`regardless of external
`forces such as economic constraints, availability
`of new net-
`technologies, etc.
`(and de-
`Most developers and users of such networks
`vices for them) have discarded
`the possibility
`of adapting
`the TCP/IP
`protocol suite for communicating with NAPS ‘.
`Reasons cited include
`for the type of traf-
`fic, TCP’s wide-area
`the complexity
`of TCP, and
`especially performance
`of the entire TCP/IP
`in both
`and latency
`[G+97]. We argue that
`these con-
`cerns are misplaced
`for several reasons.
`is inherent as systems become larger. TCP/IP’s
`complexity was not created arbitrarily,
`but developed
`in re-
`sponse to particular
`the problems
`presented above. Fibre Channel and other network
`port protocols will have
`to face similar decisions as they
`to address the same problems.
`Earlier analyses of TCP/IP
`performance may have been
`based on poorly
`tuned or now outdated TCP/IP
`The Fibre Channel standardization
`for ex-
`ample, was begun
`in 1988; much of the research on effi-
`cient networking
`cited here has been con-
`in the last decade. Older
`often im-
`pose penalties such as extra data copies, separate checksum-
`ming passes, and inefficient demultiplexing
`of incoming data.
`Some of the improvements will be discussed in section 7.
`Device Command
`decision has been made to connect
`Once the architectural
`the peripheral
`to a network,
`the most
`choice is
`the command
`Disk drives
`use a block-level
`interface, while network
`file servers use
`a file model appropriate
`to the needs of a particular
`of clients.
`In the course of rearchitecting
`age systems, other choices are possible, such as an “object”
`in which
`the disk is responsible
`for layout decisions
`but not file naming and security
`for security
`We have chosen a block
`interface, enhanced
`with derived virtual
`than a file or file-like
`of these net
`can carry
`IP traffic
`for host-to-host


`file sys-
`reuse of existing
`the simplest
`model. This allows
`layout, parti-
`tem and operating
`tioning, mode manipulation,
`the sd SCSI disk driver,
`virtual memory
`This also
`the broadest
`range of uses of the device, providing
`clients with
`the choice to build a Fast File System
`(FFS) or
`file systems, non-Unix
`file systems, network
`RAID: and other uses of raw partitions
`such as swap space,
`storage management
`cache and databases.
`4 Related Work
`are MIT’s ViewSta-
`to Netstation
`The projects most similar
`tion [HAI+
`and Cambridge’s Desk Area Network
`Both use AT’M networks as their device
`establish a physical boundary
`to the system for security pur-
`poses, while Netstation
`uses protocol-based
`security. They
`have defined a useful
`of dumb, supervised and
`smart devices.
`is an area of much current
`search and development.
`Fibre Channel disk drives, new dis-
`file server architectures,
`and custom development
`of storage networks all play a part.
`A TCP/IP RAID controller was developed at Lawrence
`Livermore National Labs; it is the first TCP disk device of
`which we are aware [WM95].
`to run over phone
`Mainframe channels have been extended
`lines and even WANs
`for remote device mirroring;
`from CNT, EMC and others perform such functions. How-
`ever, they
`use media-specific
`Fibre Channel-attached
`disk drives utilize a simple SCSI
`interface with no security on a moderately
`As described above,
`raises concerns about
`security, scalability,
`legacy systems and interoperability
`types of networks.
`The CMU Parallel Data Lab’s Network Attached Secure
`Disk (NASD) project divides NFS-like
`a file manager and the disk drives
`themselves, so that not
`read and write commands but also attribute
`set and
`get are executed at the drive
`[Gf96, RG96, G+97].
`file manager
`is responsible primarily
`for verifying credentials
`and establishing
`access tokens.
`Soltis’ Global File System (GFS) uses Fibre Channel disk
`drives modified
`to support a lock primitive
`[SRO96, So197].
`This provides simple, efficient distributed
`locking. The drive
`itself attaches no meaning
`to the locks; by convention among
`the clients,
`they are used to lock inodes and other data struc-
`disk system provides a virtual disk
`The Petal distributed
`model on which
`the Frangipani
`file system
`[LT96, TML97]. Due to the virtualization
`of storage
`space, their model moves the actual backing store allocation
`to the disk, though
`the interface between Petal and Frangi-
`pani is a block-level one.
`their own net-
`Various system vendors have developed
`works on which distributed
`device sharing
`takes place. VAX-
`[KLS86] and ServerNet
`[HG97] are two examples
`which use message passing between devices and hosts;
`new SGI Origin series uses custom hardware
`to implement a
`shared address space on a switched network which
`many processors and I/O nodes [LL97].
`Netstation CPU Node (Sun)
`------- Lz ---- -
`Myrinet API
`SCSI bus
`sd = SCSI
`Figure 2: Netstation MPU node OS components.
`disk device driver, esp = SCSI bus adapter driver, VISA =
`Internet SCSI Adapter, VM = virtual memory, FFS
`= Fast File System
`VISA Architecture
`is an op-
`Internet SCSI Adapter,
`VISA, Netstation’s Virtual
`erating syst,em module which makes


`ripherals appear as if they were attached
`to a local SCSI
`of the
`implements an instantiation
`SunOS uses a layered device
`driver model for SCSI devices. Device drivers which present
`the standard block or raw interfaces
`in /dev are spe-
`cific to a type of peripheral,
`such as sd for SCSI disks and
`st for SCSI tapes. For SCSI devices,
`these drivers
`in turn
`depend on lower-level services to communicate with
`the spe-
`The scsi-transport
`type of host adapter present.
`consists primarily
`of pointers
`to nine high-level
`a well-defined
`for send-
`ing commands
`to a SCSI device a.nd managing
`the memory
`for returned data.
`is shown transmit-
`In figure 2, the sd SCSI disk driver
`ting requests
`to VISA and to esp. Esp is the standard SCSI
`type present on Sun SPARC workstations.
`for third-party
`SCSI adapters.
`is at this
`that support
`for networked SCSI, such
`as Fibre Channel,
`is installed.
`Because we are using stan-
`dard SCSI commands, no additional
`packaging or format-
`ting, such as XDR,
`is required.
`transmits packets by calling UDP, which uses IP to
`send packets over any supported network medium. We have
`used both 1OObT ethernet and Myrinet
`in these experiments.
`We use UDP as the transport-layer
`protocol because we
`expected UDP
`to be faster
`than TCP as well as easier to
`work with
`the kernel. On top of UDP we found
`necessary to build a simple reliability
`layer. While
`the net-
`is highly
`reliable, SunOS provides only
`in the sockets, and packets are discarded when
`the buffer
`is full. Our reliability
`layer works with a fixed-
`size window and assumes in-order delivery; on every 48KB
`the sender pauses for an ACK. On receipt of
`out-of-order data, a NAK
`is sent and the sender rolls back.
`A timeout of one second is currently
`only at
`the device (IPdisk).
`The host depends on either
`discover network glitches, or the higher-level
`timeout and be restarted within
`the sd device driver. This is
`purely a convenience; a production-quality
`would have timers at both ends.
`We use 8 kilobyte data payloads, plus a small header in-
`cluding sequence numbers, on top of the UDP packet. Over
`, this is a single packet. Over ethernet,
`this forces IP
`in order to work within
`the 1500 byte MTU.
`As is common with UDP, checksumming
`is turned off.
`The data integrity
`is still protected by the link
`layer check-
`sum. Both hardware and software mechanisms
`for doing
`the TCP/UDP
`checksum with zero CPU cost are known.
`can be done during data copy, DMA or transmission
`We use one kernel pseudo-process per device attached
`to the VISA virtual bus. Much
`like NFS biods,
`these arc
`for the communication
`the device, and are
`necessary because the SunOS kernel
`is not multithreaded.
`It runs as
`emulated disk drive.
`is our Netstation
`a user process on another Sun, emulating
`the SCSI block
`device command set running over UDP or TCP.
`It can be
`to use RAM
`to emulate disk storage, or to use
`a regular
`file, or access an actual
`raw SCSI disk. For the
`in this paper,
`IPdisk was used with
`a 32 MB RAM buffer,
`the fast&
`form of backing store,
`order to stress the host operating
`In this mode, the
`and seek lat~ency, zone-
`physical characteristics
`rates, etc.) of a disk are not emulated, so
`the host CPU becomes the bottleneck.
`IPdisk supports our derived virtual device model. This
`the owner of the device to download small programs
`which act as filters on the SCSI RPCs before those RPCs are
`actually executed. This
`is primarily
`to be
`used for sharing devices among multiple hosts and execut-
`ing third-party
`(direct device-to-device)
`copies. Because the
`purpose of these experiments
`is to saturate
`the host CPU,
`DVDs, which only affect execution
`time at the device, are
`not enabled
`for the experiments
`presented here.
`to measure the performance of reg-
`We have run experiments
`ular Berkeley
`fast file systems built on top of VISA-attached
`disks. The
`read and write
`for sequential ac-
`cesses are detailed
`in the following subsections,
`then in the
`next section we discuss potential
`to this per-
`file systems on
`the CPU utilization
`We measured
`disks and on a directly-attached
`fast, nar-
`row (10 MB/set.)
`SCSI bus. This provides a very direct
`comparison of the actual
`impact of different data
`in a complete
`file system environment.
`test configurations
`utilize a 75 MHz Spare 20/71
`with 64 MB of RAM and an 800 Mbps Sbus as the client.
`The CPU has 20KB/16KB
`I/D on-chip caches and 1MB of
`external cache. The STREAM
`reports memory
`copy bandwidth
`of 61.7 MB/s
`IPdisk emulate
`the network-attached
`disk drives.
`Our absolute performance numbers are low by today’s stan-
`dards due to the age of the systems used, but
`the relative
`numbers and conclusions
`remain valid.
`is ex-
`The biggest performance problem we encountered
`cess calls
`to bcopy. On send,
`the Myrinet
`device driver
`copies the data from the mbuf chain to a special send buffer
`allocated and maintained by the device driver
`itself in main
`memory. From that buffer,
`the data
`is then DMAed
`to the
`buffer RAM on the network
`interface card, from where it is
`onto the network. The nominal
`reason for the
`is the expense and complexity
`of establishing DMA
`for arbitrary memory addresses, as well as con-
`cerns about data alignment. However, when we are sending
`data from complete VM pages, as in VISA or NFS, the pages
`are already pinned
`in memory, properly
`aligned, and large
`enough to more than amortize
`the cost of creating
`the map-
`ping, making
`it the proper choice.
`version of
`is a modified
`The benchmark we are using
`and can
`Bonnie which
`reports additional CPU utilization
`loop on individual
`to reduce
`the overhead of pro-
`cess creation. Bonnie, written by Tim Bray and available
`on the Internet, performs several tests to measure I/O oper-
`ation overhead and throughput; we have concentrated
`on throughput.
`files were written or read repeatedly
`the total amount of data was one gigabyte.
`6.1 Write Performance
`Table 1 lists the measured and predicted performance of file
`writes on VISA-attached
`disk drives. Our write
`is CPU-limited
`at 72 Mbps when using Myrinet, with
`Table 2 lists the measured write
`kernel profiling disabled.
`of other configurations
`and subsystems
`for com-
`parison. Writing
`to the file system buffer cache (tech-


`VISA configuration
`Mvrinet UDP
`1 thru
`Table 1: VISA Write Rates. Throughput
`in Mbps.
`nically, virtual memory pages, under SunOS) achieves ap-
`115 Mbps
`(direct measurement of this number
`is difficult). We measured
`the write
`of a file
`system on a physical SCSI disk (a Seagate ST31200W) as
`only 22 Mbps, but extrapolation
`from the CPU consumption
`to CPU saturation
`suggests a throughput
`of 110 Mbps can
`be achieved
`the presence of adequately
`fast disks
`and SCSI buses). Our current VISA
`is only 65% of the estimated SCSI throughput.
`The CPU utilization
`figures in tables 3 and 4 were mea-
`sured using a kernel compiled with gprof profiling enabled.
`The kernel subsystem
`figures are calculated
`by assigning
`each kernel
`function profiled
`to one of several groups. The
`raw data and
`tables of which
`functions were assigned
`which groups are available
`for verification.
`These numbers
`are used to estimate
`the potential performance gains for dif-
`ferent optimizations.
`It can be seen from these CPU utilization
`for bcopy,
`the networking CPU cost is only about
`half the file system CPU cost and comparable
`to the “mis-
`cellaneous” kernel
`As described above, when using Myrinet under SunOS,
`write carries a penalty of an unnecessary data copy, and the
`1OObT ethernet driver appears
`to have a similar problem.
`We estimate
`the performance
`if this copy can be
`at approximately
`89 Mbps on Myrinet,
`or ap-
`81% of the maximum estimated SCSI through-
`put, as follows:
`147.5 - 16.4 = 131.1 CPU seconds
`measured VISA UDP after
`the gprof overhead;
`131.1- 24.7 = 106.4 CPU seconds after removing
`the bcopy.
`by the measured bandwidth,
`72 * 131/106 = 89
`for our estimate of the bcopy-less CPU saturation
`point. The other numbers are estimated similarly.
`For reasons explained
`in section 7.1, we have also pre-
`the performance
`using a UDP which consumes 40%
`fewer CPU cycles in conjunction with a bcopy-less write.
`show that sending 1GB using
`TCP consumed approximately
`30 seconds more CPU
`on send than UDP. We added this 30 seconds to the 131 sec-
`onds (after removing gprof overhead)
`to write 1GB
`on a file system built on a VISA-attached
`disk. From
`we estimate
`the TCP
`to be 58 Mbps. Approx-
`imately one-third
`of the increase
`in CPU
`is in calcu-
`the TCP checksum,
`the rest is sequence management
`with acknowledgements
`and timer processing. The
`TCP” estimate assumes elimination
`of checksum and bcopy,
`but no reduction
`in the other overhead.
`We recorded
`the distribution
`of command sizes on one
`run of Bonnie writing one gigabyte. A total of
`9,409 write commands were sent to the device for a total of
`2,064,821 blocks (l.O08GB),
`format, newfs, mount,
`and 1GB of file write with metadata updates.
`the application
`always writes
`in chunks of 8KB,
`tuned NFS
`to FS buffer cache
`(VM pages)
`host memory
`NIC memory via Sbus
`SCSI through FS Q19% CPU
`UDP blast
`TCP blast
`Table 2: Write Rate Comparison
`benchmark application
`file system code
`Myricom driver bcopy
`other networking
`gprof profiling code
`miscellaneous kernel
`time % CPU
`Table 3: CPU Utilization
`to Write 1GB on UDP VISA
`benchmark application
`time % CPU
`Table 4: CPU Utilization
`to Write 1GB on an esp SCSI Bus


`VISA configuration
`Myrinet UDP 085% CPU
`Myrinet UDP @lOO%CPU
`1OObT UDP @82% CPU
`1OObT UDP @lOO% CPU (est.)
`without bcopy
`faster UDP
`faster TCP
`Table 5: VISA Read Rates. Throughput
`in Mbps.
`the operating system’s write-behind mechanism, called kluster,
`coalesces the smaller
`individual writes
`into larger ones when
`it can. As a result, more than 85% of the total data written
`is in commands
`than 1OOKB. This reduces the collec-
`I/O operation overhead and wastes fewer disk revolu-
`tions than smaller operations.
`Read Performance
`is read sequentially,
`For the read
`tests, a 25MB
`buffer cache is flushed using a SunOS
`is reread.
`This process
`is repeated until 1GB of data
`has been read. The file system
`is clearly effectively detect-
`ing sequential activity
`and reading ahead of the process,
`the result
`that almost all operations are 56KB
`This slightly
`exceeds our protocol window,
`idle time and reducing our throughput.
`Table 5 lists the measured and estimated
`several VISA configurations,
`and table 6 lists the measured
`and estimated
`for related subsystems. Our VISA
`read throughput
`is 60 Mbps at 85% CPU utilization,
`which we infer the CPU will saturate at 71 Mbps, essentially
`the same as our write rate. This is a lower percentage, 53%,
`of the SCSI potential

