`
`Robert Horst
`3ware, Inc.
`701 E. Middlefield Rd.
`Mountain View, CA 94043
`
`Abstract
`
`2. The Historical Argument
`
`This paper addresses a key issue that arises when
`attaching storage devices directly to IP networks: the
`perceived need for hardware acceleration of the TCP/IP
`networking stack. While many implicitly assume that
`acceleration is required, the evidence shows that this
`In the past, network
`conclusion is not well founded.
`accelerators have had mixed success, and the current
`economic justification for hardware acceleration is poor
`given the low cost of iiost CPU cycles. The YO load for
`many applications is dominated by disk latency, not
`transfer rate, and hardware protocol accelerators have
`little effect on the I/O performance in these environments.
`Application benchmarks were run on an IP storage
`subsystem to measure performance and CPU utilization
`on Email, database,Jir'e serving, and backup applications.
`The results show that good performance can be obtained
`without protocol acceieration.
`
`1. Introduction
`
`The growing popularity of gigabit Ethernet has
`prompted
`increasing
`interest
`in using standard IP
`networks to attach storage devices to servers. These
`Ethernet Storage Area Networks
`(E-SANS), have
`significant advantages in cost and management ease
`compared with Fibre Channel SANS. Some IP storage
`products are already on the market, and work
`to
`standardize the protocols is progressing in the 1P Storage
`working group of the [ETF [ 11.
`Networks customized to storage networking, such as
`Fiber Channel, wen: developed
`largely due to the
`perception that standard networking protocols are too
`heavyweight for attaching storage. Conventional wisdom
`says that IP storage is impractical without special purpose
`NICs to accelerate the TCP/IP protocol stack. This papers
`shows that the need for hardware acceleration is largely a
`myth. Several different lines of reasoning show that the
`future of storage networking will rely heavily on storage
`devices connected to servers without special purpose
`hardware accelerators.
`
`There are many historical examples of hardware
`accelerators to offload processing tasks from the primary
`CPU. Some examples, such as graphics processors, have
`been
`successful, but
`the history
`of
`successful
`communications processors is filled with examples of
`unmet expectations.
`Examples of fi-ont-end communications processors
`date from the early days of mainframe computing.
`In
`many systems, the primary CPU was accompanied by an
`I/O processor to offload low-level protocol operations.
`However, it has become increasingly difficult for this type
`of architecture to deliver real performance gains given the
`rapid pace of technology evolution.
`A specific recent example is the Intel 1'0 (Intelligent
`I/O) initiative. The idea was to have a communications
`processor, such as an Intel i960, on the motherboard to
`serve as an I/O processor to offload and isolate the CPU
`from its attached 1/0 devices. At the time the initiative
`started, the i960 embedded processor was adequate to the
`task, but its performance did not increase at the same rate
`as the main CPU. If the performance does not keep up, at
`some point an accelerator becomes a decelerator.
`Somewhere in between, performance is about equal with
`or without the attached processor, but the development
`and support costs become a burden. The accelerator is
`usually a different CPU architecture than the main CPU,
`and it usually has a different software development
`environment. Maintaining two such environments is
`costly, and even if they were identical, there is overhead
`for inventing and testing the software interface between
`the processors. The software development cost eventually
`kills the front-end processor architecture, until the next
`generation of engineers rediscovers the idea and repeats
`the cycle.
`Some may argue that the problem was that the
`accelerators should have been optimized hardware instead
`of embedded programmable processors. Unfortunately,
`every protocol worthy of acceleration continues to evolve,
`and it is difficult to stay ahead of the moving target. The
`new protocols proposed for IP storage, iSCS1 and iFCP,
`are far from stable, and even after the standards have been
`formally approved, there will likely be a long series of
`enhancements and bug fixes.
`It seems extremely
`
`0-7695-1432-4/01 $10.00 (D 2001 IEEE
`
`194
`
`Alacritech, Ex. 2300 Page 1
`
`
`
`premature to commit hardware to accelerating these
`protocols.
`There may be more cause for optimism for general-
`purpose TCP acceleration, but history has not been kind
`to companies attempting
`this
`idea either. Many
`companies were unsuccessful in getting products to
`market. One company, Alacritech, is currently marketing
`a product for fast Ethernet acceleration [2], but this
`product did not have a gigabit accelerator at the time
`when
`the general-purpose gigabit NIC market was
`developing. This example points out the difficulty in
`keeping up with the blistering progress in networking and
`CPU speeds. Once the NIC vendors have skimmed off
`the most beneficial ideas (such as checksum offload,
`interrupt coalescing and jumbo frame support), there may
`not be enough performance gain for the special purpose
`accelerator vendors
`to compensate
`for
`the added
`development and product costs.
`Accelerators for IPsec have shown a similar trend.
`They were initially popular, but are now disappearing
`because host-based software is improving and the CPUs
`are getting faster. Also, as processing needs are better
`understood, the most time consuming operations are
`gradually incorporated into standard CPUs and their
`support chips.
`
`3. The Economic Argument
`
`The economic argument for protocol acceleration is
`based on the premise that the computing power in the NIC
`is less expensive than computing power in the host.
`Examined closely, this premise is on shaky ground. Until
`the accelerator can be a single-chip ASIC, it will
`inevitably have multiple components
`including a
`processor, memory, and network interface components.
`The size of the accelerator market may not warrant
`development of a fully integrated solution, and the parts
`cost may increase as a result. Low volume also affects
`the amortized development cost. Today, Fiber Channel
`host bus adaptors sell in the range of $600 to $1000 while
`higher volume Gigabit Ethernet card costs have fallen to
`less than $150. The 4-port 100BASE-T Alacritech TCP
`accelerator sells for a list price of $699, compared to a
`standard gigabit 1000BASE-T NIC that sells for under
`$150 and will soon be a standard feature of most servers.
`The accelerator cost must be weighed against the cost
`of running the protocol in software in the main processor.
`The cost of using the main processors depends on
`assumptions of the incremental cost of processing, and the
`amount of CPU required to
`implement a storage
`networking protocol. Today, the Dell Poweredge 1400SC
`server is offered in single and dual processor versions.
`The cost to upgrade to a second 800MHz CPU is $399.
`With this system, the cost of the second processor and
`1000BASE-T card ($149) is much less than the $699 cost
`of the 4-port Alacritech NIC.
`
`Comparing the merit of the alternatives requires an
`estimate of how much processing power (percent CPU
`utilization) it takes to run the storage and TCP protocols
`in the main processor.
`This percentage varies by
`operating system, because some systems have more
`efficient TCP and IP storage driver implementations that
`avoid extra data copies. Merits of different solutions also
`depend on the I/O workload and the storage protocol
`efficiency. Conventional wisdom says that it takes one
`high-speed processor of about 800 MHz to stream TCP
`continuously at Gbit rates. If this is the case, then running
`the TCP protocol would take the entire second CPU, but
`would still be a better deal than the accelerator NIC, even
`if that accelerator offloaded 100% of the protocol (not
`likely). Moreover, the balance shifts more towards the
`second CPU as real processing workloads are examined,
`because it
`is extremely rare for an application to
`simultaneously stream I/O at full speed and compute at
`full speed. Also, when a complete application is
`considered, the second processor can be used for other
`tasks at times when 110 workloads are low, and may
`accelerate
`those compute-intensive phases of
`the
`application.
`An I/O accelerator has other drawbacks as well. It
`takes up an extra valuable PCI slot in systems with built-
`in GbE. The accelerator reduces the choices in NICs, and
`may lack other features needed by the server, such as data
`encryption. There is usually a time lag in availability of
`new features and performance improvements in hardware
`accelerators compared with software solutions. Over
`time, CPU performance will improve at a greater rate than
`the accelerator due to larger market for general-purpose
`microprocessors.
`Similar economic arguments apply to System Area
`Network such as ServerNet, Scalable Coherent Interface
`(SCI), Giganet, Myrinet and Infiniband. While these
`networks have demonstrated greatly improved bandwidth
`and latency relative to Ethernet, none have yet been
`widely accepted in mainstream computing markets. They
`have had the same difficulty demonstrating enough
`benefit to mainstream applications to justify their extra
`cost.
`
`4. The Disk Argument
`
`The preceding discussion assumed a system spends
`100% of its time transferring a maximum I/O streaming
`rate. Real applications have much different behavior due
`to the mix of sequential and random I/O.
`Today’s disk drives can transfer sequential blocks at
`20-40 MB/s, but can perform only about 100 I/Os (seeks)
`per second [3]. If each seek transfers a 4 Kbyte block,
`then the random transfer rate is just 0.4 MB/s, or nearly
`100 times slower than the sequential rate. Most real
`applications access disk storage via a file system with
`data organized in noncontiguous pages, or via a database
`
`195
`
`Alacritech, Ex. 2300 Page 2
`
`
`
`with relatively small record sizes. These applications
`transfer at closer to tlhe random rate than the sequential
`rate.
`When storage is connected through IP networks
`without acceleration, CPU consumption for the IP stack is
`determined primarily by the number data copies and the
`total amount of data moved. When an application is doing
`mostly random I/O, there is little data moving through the
`IP stack, keeping CPU consumption low. This type of
`application gets little benefit from hardware acceleration.
`On Oct. 6, 2000, Compaq set a new transaction
`processing record an the
`industry-standard TPC-C
`benchmark [4]. The record was 505,302 transactions per
`minute obtained using 2568 disk drives and 24 8-
`processor servers. As!juming about 0.5 IOs/sec/tpmC [SI,
`each disk performs 98 I/O accesses per second, keeping
`the disks very busy. Measured from the CPU side, each
`of the 192 CPUs performs 1315 10s per second. If each
`IO averages 8 KB, the average data rate is just 10.3 MB/s
`per CPU. If this benchmark had been run with IP storage
`and no protocol acceli:ration, the relatively low data rate
`would have consumed only a few percent of the CPU
`time. The net result is that the low cost of IP storage
`subsystems should give better TPMC price/performance
`than SCSI or Fibre Channel solutions.
`A similar case can be made for scientific applications.
`A 1997 study by Rich Martin measured the sensitivity of
`ten different scientific applications with respect to
`communication bandwidth and latency [6]. This study
`shows that many scientific applications do very little 110
`and are much less sensitive to communications bandwidth
`latency or overhead,
`than
`to
`indicating
`that
`the
`applications are ,primarily CPU bound, not I/O bound.
`One might think CPIJ-bound applications are ideal for
`protocol offloading, but if the amount of IO is small
`compared to the amount of computation, very little of the
`run time is subject to acceleration. Also, many scientific
`programs perform their IO at the beginning and end, with
`primary computation in between. During the I/O phases,
`the CPUs are often lightly loaded. The net result is that
`when IP storage is used for the backing store for scientific
`applications, network acceleration should show little
`performance gain.
`
`5. Measuremenl: results
`
`When 3ware began to implement the first native IP
`storage products, the preceding arguments were the only
`ones available to guide the decision on whether to
`implement the client-side protocol in the main CPU or in
`a hardware accelerator.
`3ware chose to develop a
`protocol over TCP/IF that could be implemented very
`efficiently in software.
`IP storage products began shipping in late 2000, and
`include the Palisade 400 and Palisade 100 products. OS
`driver support includi:s Linux, Windows NT, Windows
`
`2000, Solaris and Macintosh [7,8]. Now that the products
`are complete, measurements of real applications validate
`the original design decisions. Tests of Email servers,
`database, file server and backup programs show that the
`CPU overhead
`is not excessive and application
`performance is typically no different than with locally
`attached SCSI or Fiber Channel storage. Preliminary
`benchmark results for these applications are given below:
`
`5.1 Email Server
`
`Email comprises over half of the disk storage of many
`organizations. The Email data often resides on drives
`internal to the server, or on nearby SCSI RAID arrays.
`This type of configuration is prone to run out of storage
`quickly, and it is desirable to seamlessly scale the storage
`as needed. IP storage provides an ideal solution. A test
`of Microsoft Exchange was run to determine how well IP
`storage would serve this application.
`
`Test setup
`Server: Dual 866 MHz Pentium 111
`OS: Windows NT 4.0
`Mail server: Microsoft Exchange
`Storage: 3ware Palisade 400 IP Storage
`640 GB RAID 0 array
`Clients: 1400 clients simulated with
`5 client systems
`
`Results
`
`Response time: 187 ms avg. per mail message
`CPU Utilization: 13%
`
`These results are quite impressive, even measured
`against typical response time results with internal SCSI
`arrays. The low CPU utilization means that hardware
`acceleration of TCP or the IP storage protocol would have
`shown no significant performance difference.
`
`5.2 Database Server
`
`IP storage will give scalability to existing 'and future
`database applications. Databases require the block level
`access that is native to IP storage and not generally
`available in network-attached storage (NAS) boxes that
`communicate using a file protocol such as NFS or CIFS.
`A database asset 'tracking application was run on a
`database server connected through a Gigabit Ethernet
`switch to a collection of client machines:
`
`196
`
`Alacritech, Ex. 2300 Page 3
`
`
`
`Test setup
`
`Server: Dual 866 MHz Xeon
`OS: Windows NT 4.0
`Database: Oracle 8
`Storage: 3ware Palisade 400 IP Storage
`640 GB RAID 0 array
`Clients: Database transactions simulated with
`14 client machines
`
`Results
`
`1.8 Million transactions per hour (500 TPS)
`Zero transaction timeouts
`
`In this test, even higher transaction rates would be
`possible with more clients. The client load was not quite
`high enough to saturate the server.
`These results show that database applications run very
`well on IP storage, and that there would gain little benefit
`from hardware acceleration. Very high transaction
`volumes can be supported, and no transaction timeouts
`were seen. This is a key point, as experienced database
`administrators know that some of the most expensive
`storage
`subsystems available
`today are prone
`to
`transaction timeout problems.
`
`This test uses Intel Iometer to simulate the load that
`would be seen by a file server; no clients were actually
`simulated. The low CPU utilization at the file server
`again shows that IP storage can be quite effective even
`without hardware acceleration. Performance is limited
`only by the number of seeks per second that can be done
`by the drives. The random write performance is higher
`than would be expected from eight physical drives,
`because this test takes advantage of caching in the IP
`storage unit.
`
`5.4 Backup
`
`important
`the most
`is often cited as
`Backup
`application that requires high streaming rates to or from a
`storage subsystem. However, in most installations,
`backups are not done during peak processing times, but in
`times when the system is lightly loaded or quiescent.
`Almost by definition, the CPU load during backup is not a
`major concern. A test of the Windows 2000 backup
`application was done to test the time to backup the system
`files (C: drive) to either to either an IP Storage device or a
`network shared volume acting as a NAS device.
`
`5.3 File Server
`
`Test setup
`
`IP storage can provide scalable storage behind a file
`server. This type of file sharing takes advantage of
`standard servers acting as the file server, allowing
`administrators
`to use
`familiar operating
`system
`administration tools to manage the pool of shared storage.
`Small reads and writes were run on a Windows 2000
`machine to show that the overhead for IP storage does not
`affect the ability to serve many simultaneous file requests.
`
`Test setup
`
`Server: Dual 866 MHz Pentium I11
`OS: Windows 2000
`IiO load generator: Iometer
`Queue depth: 8
`Storage: 3ware Palisade 400 IP Storage
`640 GB RAID 0 array
`
`Results
`
`Random 2K Reads
`806 IOPS
`14 YO CPU utilization
`9.9 ms average response
`Random 2K Writes:
`1760 IOPS
`27 YO CPU utilization
`4.6 ms average response
`
`Backup client:
`500 MHz Pentium Ill running Windows 2000
`Backed up data: 1.6 GB data on “C:” drive
`Backup target:
`IP Storage: 3ware Palisade 400 IP Storage
`640 GB RAID 0 array
`NAS: Windows 2000 file sharing
`on 500 MHz Pentium Ill
`
`Results
`
`IP storage backup: 3 minutes 3 seconds
`NAS backup: 3 minutes 3 1 seconds
`
`The backup was significantly faster on the IP storage
`unit that could take advantage of multiple disk arms
`seeking simultaneously. In this test, the IP Storage unit
`was not busy and could have handled many simultaneous
`backups. Some other backup experiments were severely
`limited by disk performance of the system being backed
`up.
`It is clear that IP storage is well suited as a high-
`performance backup device. The huge advantage in
`access time relative
`to tape will make 1P storage
`increasingly attractive for local and remote online
`backups [9].
`
`197
`
`Alacritech, Ex. 2300 Page 4
`
`
`
`5.5 Other factors affecting performance
`
`All of the preceding tests were configured with
`Gigabit Ethernet (1000BASE-T) and with RAID level
`zero. Additional tests were run to evaluate how those
`parameters affect perftormance.
`Figures
`la and
`l b show performance and CPU
`utilization on a workstation I/O load running on three
`different configurations of hardware RAID connected
`through IP. The RAlD 0 configuration has data striped
`across 8 drives. RAID 10 has data striped across four
`mirrored pairs of drives, and RAID5 rotates 64K parity
`blocks across the 8 drives, yielding 7 drives of useful
`capacity.
`
`Test setup
`
`5.00 1
`4.50
`s
`4.00
`c 3.50
`.:
`3.00
`2 .- 2.50
`3 2.00
`2 1.50
`1.00
`0.50
`
`lometer Workstation Access
`
`&RAID
`
`10
`
`1
`
`2 4 8 16 64 128 256
`Queue Depth
`
`Figure 1 b. Effect of RAID level on CPU
`utilization of IP-attached storage.
`
`Server: Dual 1GHz Pentium 111
`OS: Windows NT 4.0
`I/O load generator: Iometer
`Storage: 3ware Palisade 400 IP Storage
`RAID 0,5 and 10
`Controller write caching disabled
`Drive write caching enabled
`Workstation Access Pattern:
`Block size: 8KB
`80%) Read, 20% Write
`80%) Random, 20% Sequential
`
`lometer Workstation Access
`
`In these tests, queue depth simulates the amount of I/O
`that the system has outstanding at a point in time. Greater
`queue depth allows more pipelining in the I/O system and
`results in better performance. Queue depth can be
`increased either by one application issuing non-waited
`110, or through multiple applications issuing independent
`I/O operations.
`The performance graph in Figure l a shows that RAID
`0 and 10 can perform more I/O per second than RAID 5,
`mostly due to the parity calculations required to do
`updates on RAID 5.
`In all cases, the performance is
`determined primarily by the drives or disk controller, not
`by the TCP/IP stack or small latencies introduced by the
`IP connection. The utilization graphs show that CPU
`consumption closely tracks the I/O throughput. Total
`CPU utilization never exceeds 5 YO for these tests.
`The effect of network speed is illustrated by the graphs
`in Figure 2a and 2b. This test uses the same test setup as
`Figure 1, but tests both 100 Mbit and 1 Gbit network
`connections on an IP storage unit configured with 8 drives
`in RAID 0 mode. Iometer is used to generate a sequential
`workload consisting of sequential 256K writes.
`0.00 I I
`2 4 8 16 64 128 256
`1
`Queue Depth
`
`350.00
`300.00
`250.00
`g 200.00
`E 0 150.00
`100.00
`50.00
`
`Figure la. Effect of RAID level on 110
`performance of IP-attached storage.
`
`198
`
`Alacritech, Ex. 2300 Page 5
`
`
`
`limited by the'RAID processing of the controller or by a
`slow network link. The large sequential workload shows
`a case where the 62 MB/s limit of the IP storage unit is
`encountered.
`Some cases may be limited by the CPU processing
`power of the host processor, especially systems running
`operating systems with
`less-optimized TCP stacks.
`However, the range of benchmarks shown here indicates
`that few real applications are likely to be affected by host
`CPU processing limits.
`
`6. The future of IP acceleration
`
`Many companies are now working on IP 'storage
`products that will conform to the emerging iSCSI
`specification. The Palisade IP storage unit will also
`support
`the
`iSCSI standard when
`it
`is complete.
`However, it is important to note that iSCSI has been
`developed under the assumption that it would never or
`rarely be implemented without hardware acceleration.
`That has led to some design decisions that could make it
`unsuitable for efficient software-only implementations.
`In the future, all Ethernet ports may incorporate iSCS1
`acceleration hardware, but until that point, it will be
`important to have efficient software storage protocols that
`can take advantage of the LAN infrastructure already in
`place. This gives customers the best of both worlds; they
`can take advantage of IP storage immediately, but be
`assured that they can take advantage of future iSCSI
`hardware acceleration for the rare cases when hardware
`acceleration proves necessary.
`
`7. Conclusion
`
`This paper is not meant to imply that there are no
`applications that can benefit from hardware protocol
`accelerators. There are likely to be some specific
`applications where the benefit exceeds the cost, but
`measurement data demonstrates that those cases are likely
`to be the exception, not the rule.
`IP storage with
`soflware-implemented client site protocols should work
`well in the majority of applications. The large advantages
`in cost, ease of management, and standard network
`hardware should outweigh any small performance gains
`from adding a protocol accelerator.
`Information
`technology managers may prefer to spend their energy
`and money on servers that that stay on the commodity
`price/performance curve, and on storage products that
`leverage the same commodity volumes in drives and
`networks.
`
`70.00
`60.00
`
`50.00
`I
`cn 40.00
`z
`5 30.00
`
`20.00
`
`10.00
`
`0.00
`
`256K Sequential Write
`
`4 1000BASE-T
`
`I
`
`1 2
`
`4 8 16 64 128 256
`Queue Depth
`
`Figure 2a. Effect of network speed on 110
`performance of IP-attached storage.
`
`256K Sequential Write
`
`50.00
`45.00
`s
`40.00
`r 35.00
`E 30.00
`0
`2 .- 25.00
`5 20.00
`2 15.00
`10.00
`5.00
`0.00
`
`+ 1000BASE-T
`
`1 2
`
`4 8 16 64 128 256
`Queue Depth
`
`Figure 2b. Effect of network speed on CPU
`utilization of IP-attached storage.
`
`The sequential write workload shows a streaming
`write performance of 61.8 MB/s at a CPU utilization of
`45% with the gigabit link, and 11.2 MB/s at 9%
`utilization for the 100 Mbit link.
`The preceding data shows that performance may be
`limited at different points depending on the application
`and the system configuration. Applications with many
`short I/O accesses, such as database and Email, are
`primarily limited by the seek time of the disk drives.
`With more sequential applications, performance can be
`
`199
`
`Alacritech, Ex. 2300 Page 6
`
`
`
`8. Acknowledgements
`
`The author aclknowledges Ken Neas, Angus
`MacGreigor, Andrew Rosen and Chris Meyer for
`providing many of the measurement results, and the many
`hardware and software engineers who contributed to the
`design of the 3ware IP storage products. Thanks also to
`Jim Gray for his valuable comments.
`
`References
`
`[I]
`
`J. Satran et. al., “iSCSI,” IPS working group of IETF,
`Internet draft http:l/www.ietf.orglinternet-draftsldraf
`ips-iscsi-07. txt.
`[2] Karen J. Bannan & Rob Schenk, “Alacritech 100x4 Quad-
`Port Server Adapter A NIC in time ...,” PC Magazine, April
`18,2001
`[3] L. Chung, J. Gray, B. Worthington, R. Horst, “Study of
`Random and Sequential IO on Windows 2000TM”,
`http://research.microsoft.com/BARC/Sequential-IO/.
`[4] Transaction Procesiiing Council TPC-C results,
`www. tpc . org.
`[5] C. Levine, “TPC Banchmarks,” Western Institute of
`Computer Science, Stanford, CA, August 6, 1999.
`[6] R. Martin, A. Vahdat, D. Culler, and T. Anderson., “Effects
`of Communication Latency, Overhead, and Bandwidth in a
`Cluster Architectus:,” Proceedings of the 1997
`International Symposium on Computer Architecture
`(ISCA), June 1997.
`[7] “The 3ware Network Storage Unit”, Technical White
`Paper, 3ware Inc., , http://www.3ware.com.
`[8] R. Horst, “Storage Networking: The Killer Application for
`Gigabit Ethernet,” timDirect Business Intelligence
`Newsletter, http://a,ww.dmreview.com. April 20, 2001.
`Jim Gray, Prashant Shenoy, “Rules of Thumb in Data
`Engineering” Microsoft Research Technical Report MS-
`TR-99-100, Reviseld March 2000,
`http://research.microsoft.com/-Grayl
`
`[9]
`
`200
`
`Alacritech, Ex. 2300 Page 7
`
`