`
`Yi-Chun Chu and Toby J. Teorey
`
`Electrical Engineering and Computer Science Department
`The University of Michigan
` Ann Arbor, MI 48109-2122, USA
`
`Abstract
`
`communication
`The performance of host
`subsystems is an important research topic in
`computer networks.1 Performance metrics such as
`throughput, delay, and packet loss are important
`indices to observe the system behavior. Most
`research in this area is conducted by experimental
`measurement; far less attention is paid to the
`analytic modeling approach. The well-known
`complexity
`and dynamic nature of
`the
`Transmission Control Protocol/Internet Protocol
`(TCP/IP) make the performance modeling of
`communication subsystems extremely difficult.
`The purpose of this study is to analyze and model
`the overhead in Unix communication subsystems.
`The overhead is caused by protocol processing as
`well as kernel functions for fair allocation of
`system resources. Our approach is to build
`analytic models of communication overhead for
`sending and receiving a message. The analytic
`models
`can be
`applied
`to
`analyze
`the
`communication overhead for Internet information
`systems, such as the Internet web servers, or
`software servers built above midwares, such as the
`Distributed Computing Environment/Remote
`Procedure Call (DCE/RPC) servers, that require
`intensive network I/O.
`
`1 Introduction
`
`With recent advances in networking technology,
`many services that used to reside in a host system
`are now provided in a distributed fashion. Distrib-
`
`1This work is funded by IBM Canada Ltd. Labora-
`tory Centre for Advanced Studies, Toronto.
`
`uted file service is a good example of this. Provid-
`ing
`services
`across
`networks
`increases
`communication costs: requests and results must be
`transported between client and server machines.
`Intensive overhead caused by network I/O limits
`the capacity of many distributed servers. This
`makes the study of host communication sub-
`systems an important research topic.
`Earlier research closely examined the overhead
`generated in host communication subsystems
`[3,5,6]. The overhead is caused by both protocol-
`specific processing and operating system (OS)
`activities such as data movement, context switch-
`ing, and interrupt handling. Careful analysis of the
`overhead breakdown can improve the design of
`communication subsystems. However, it cannot
`reveal how server machines behave under heavy
`network traffic [10]. This question has received
`more attention recently because many distributed
`servers, such as the World-Wide Web (WWW)
`servers, have generally experienced performance
`problems in response time and service availability
`[11,12].
`To answer the above question, we need an ana-
`lytic solution to study the server system behavior
`under a varied network load. In this paper, we try
`to develop analytic models of communication
`overhead for both sending and receiving a mes-
`sage. These models can be applied to estimate the
`communication overhead in the distributed server
`systems.
`The rest of this paper is organized as follows. In
`Section 2, we introduce the software architecture
`of communication subsystems derived from the
`Berkeley Software Distribution (BSD) Unix. The
`overhead breakdown is organized according to the
`software and protocol layers. The queueing delay
`
`1
`
`INTEL EX.1248.001
`
`
`
`stream
`socket
`
`Application Layer
`
`datagram
`socket
`
`socket receive buffer
`
`socket send buffer
`
`socket receive buffer
`
`Socket Layer
`
`Protocol Layer
`
`TCP/UDP
`
`IP
`
`protocol input queue
`
`Network Interface
`Layer
`
`interface send queue
`
`Figure 1. Software Architecture of Unix Communication Subsystems
`
`in communication subsystems is also described
`here. In Section 3, analytic models for communi-
`cation overhead are developed, along with a
`detailed analysis of the critical path for sending
`and receiving a message. Both transport layer ser-
`vices, TCP and User Datagram Protocol (UDP),
`are considered in our study. In Section 4 we
`present two case studies: the Internet web server
`and the DCE/RPC server with our analytic mod-
`els. Conclusions and future work are outlined in
`Section 5.
`
`2 Unix Communication
`Subsystems
`
`The Unix communication subsystem is divided
`into three software layers: 1) the socket layer, 2)
`
`the protocol layer, and 3) the network-interface
`layer [8]. The software architecture is shown in
`Figure 1.
`The socket layer hides the complexity of net-
`work communication and provides an abstract
`interface similar to a generic I/O device. The pro-
`tocol layer covers protocol-specific processing in
`the transport layer (TCP/UDP) and the network
`layer (IP). The network-interface layer is mainly
`concerned with
`the
`link-layer encapsulation/
`decapsulation and driving the transmission media.
`The TCP/IP protocol specification puts no restric-
`tion on the layered structure of network software.
`Most implementations, however, put the code in
`the kernel with tightly integrated software layers
`for efficiency considerations [2].
`
`2
`
`INTEL EX.1248.002
`
`
`
`The specific communication subsystem we
`studied is a DEC Alpha AXP workstation running
`the OSF/1 1.0 [1]. The workstation is attached to a
`department LAN with an Ethernet adaptor. The
`implementation of the OSF/1 network software
`follows the design of the 4.3 BSD Reno release
`[8].
`
`2.1 The Processing Overhead
`
`In communication subsystems, processing over-
`head can be caused by data-touching operations,
`such as data movement and checksum computa-
`tion, as well as non-data-touching operations, such
`as context switching and interrupt handling.
`Before developing any analytic models to describe
`the system behavior, a thorough understanding of
`the overhead is necessary. The major kinds of
`overheads discovered in communication sub-
`systems are described below.
`
`Data Movement
`
`Data movement is a principal overhead in the
`communication subsystems. This overhead is sig-
`nificant because memory bandwidth has not kept
`pace with the speed of microprocessors [3,6].
`Generally, two data movements are needed in both
`sending and receiving paths. First, data has to be
`copied between user space and kernel space for
`protection reasons. This work is done by CPU,
`and we denote them as Muk(m) and Mku(m) sepa-
`rately (m is the size of data to be moved).1 The
`other data movement is between kernel space and
`network adaptor buffer (which is in I/O space).
`The real work can be done either by CPU (pro-
`gramming I/O) or by Direct Memory Access
`(DMA). Which does the work depends on the
`hardware I/O architecture, and it also depends on
`if the network adaptor has DMA capability. In the
`system we studied, the data movement from ker-
`nel to adaptor is done by PIO, Mka(m); but the data
`movement in the reverse direction is done by
`DMA, Mak(m).
`
`Checksum Computation
`
`1Since both Muk(m) and Mku(m) are memory to mem-
`ory copy, we use Mmm(m) to represent Muk(m) or
`Mku(m).
`
`Network communication relies on checksums to
`preserve end-to-end data integrity. In the Internet
`protocols, a 16-bit checksum field is used for error
`detection in IP header (20 bytes), UDP datagram
`(12-byte IP pseudo header, 8-byte UDP header
`and UDP data), and TCP segment (12-byte IP
`pseudo header, 20-byte TCP header and TCP data)
`[4,15]. We denote the overhead of checksum com-
`putation as CS(m). Both checksum computation
`and data movement are data-touching operations;
`the overhead, hence, grows linearly with the data
`size to be processed.
`
`Protocol-specific Processing
`
`Protocol-specific processing contributes different
`overhead in each protocol layer. Measurement
`results show that the overhead tends to be fixed if
`the checksum computation is not included [6].
`Therefore, we can use constants to represent the
`fixed part of overhead for protocol-specific pro-
`cessing in each layer, and we denote them sepa-
`rately as TCPin, TCPout, UDPin, UDPout, IPin, and
`IPout.
`
`Demultiplexing
`
`Demultiplexing is a table-lookup operation in the
`transport layer. It searches protocol control blocks
`(PCBs) for the socket connection associated with
`an incoming packet. Most implementation derived
`from the BSD Unix uses a link-list structure with a
`one-entry cache2 (for the latest lookup result) to
`improve performance [9]. The search cost depends
`on the number of socket connections in the sys-
`tem. Here we consider it as part of the fixed over-
`head in transport-layer input routines, TCPin and
`UDPin. However, it has been shown that the over-
`head can grow significantly in a busy Internet
`information server that has peak connections for
`more than a thousand [10].
`
`Interrupt Handling
`
`Network communication generates two device
`hardware interrupts: the receiving interrupt and
`the transmission-complete interrupt. The overhead
`for interrupt handling receives less attention than
`other processing overhead. This is probably
`
`2The one-entry cache is called “1-behind cache.”
`
`3
`
`INTEL EX.1248.003
`
`
`
`Socket Layer
`
`Protocol Layer
`
`Table 1. Overhead Breakdown in Software Layers and Network Adaptor
`Overhead
`Description
`data copy from user space to kernel space
`Muk(m)
`data copy from kernel space to user space
`Mku(m)
`checksum computation
`CS(m)
`TCP protocol-specific processing
`TCPin/TCPout
`UDP protocol-specific processing
`UDPin/UDPout
`IP protocol-specific processing
`IPin/IPout
`data copy from kernel space to I/O space (adaptor)
`Mka(m)
`data copy from I/O space (adaptor) to kernel space
`Mak(m)
`link-layer processing
`ETHout
`transmit complete interrupt
`Is
`receive interrupt and link-layer processing
`Ir
`packet transmission time
`Tx(m)
`packet reception time
`Rx(m)
`
`Network Interface
`Layer
`
`Network Adaptor
`
`Device
`CPU
`CPU
`CPU
`CPU
`CPU
`CPU
`CPU
`DMA
`CPU
`CPU
`CPU
`Adaptor
`Adaptor
`
`because of its asynchronous nature and the diffi-
`culty for measuring it. However, careful analysis
`of interrupt handling in the network device driver
`reveals how it critically affects the performance
`during heavy network traffic [14]. In the system
`we studied, the overhead of the receiving inter-
`rupt, denoted as Ir, covers the entire link-layer
`processing. It does not include the data movement
`overhead from network adaptor to kernel, Mka(m)
`(which is done by DMA). The overhead for the
`transmission-complete
`interrupt
`involves data
`movement from kernel to network adaptor (which
`is done by PIO) and the initiation of next packet
`transmission. Hence, it is further divided into a
`fixed part, denoted as Is, and a varied part, denoted
`as Mka(m).
`
`Context Switch
`
`Socket system calls for sending or receiving a
`message are synchronous. As a consequence, it
`blocks the current running process and might
`cause a context switch. Incoming packet process-
`ing, which is driven by asynchronous interrupt,
`will “wakeup” the blocked receiving process at the
`final stage. Here we denote the fixed overhead for
`a context switch as C.
`
`Transmission Time
`
`Transmission time depends on the speed of trans-
`mission media the workstation is attached to.
`Transmission speed can vary several orders of
`magnitude, e.g. from 10 Mb/s (Ethernet) to 622
`
`Mb/s (ATM). Packet transmission and reception
`are also data-touching operations; we denote the
`overhead as Tx(m) and Rx(m). Generally, it takes
`equal time to transmit or receive a packet.
`
`Others
`
`Other kinds of overheads not listed above, such as
`mbufs allocation, are not significant to our analy-
`sis. As a result, we treat them as part of the fixed
`overhead in protocol-specific processing.
`
`2.2 Overhead Breakdown in
`Software Layers
`
`The breakdown of processing overhead in com-
`munication subsystems is categorized in Table 1.
`Overhead for data-touching operations and for
`non-data-touching operations is organized accord-
`ing to the software layers to which it applies.
`The overhead breakdown help us see where the
`overhead is generated. In Section 3, we use this
`table to develop analytic models of communica-
`tion overhead by detailed analysis of sending and
`receiving paths.
`
`2.3 Queueing Delay in
`Communication Subsystems
`
`The processing overhead described above does not
`account for the entire delay accumulated in com-
`munication subsystems. There is a queueing delay
`
`4
`
`INTEL EX.1248.004
`
`
`
`introduced by buffers or queues within or between
`the software layers. These queues and buffers are
`also shown in Figure 1 and described below.
`
`Socket Send Buffer
`
`The socket send buffer holds data not sent yet or
`sent but not acknowledged by the receiving end.
`Since UDP does not provide flow control or reli-
`able message delivery, a UDP message is never
`placed into the socket send buffer.1 For TCP, the
`queueing delay is determined by its flow-control
`algorithms such as the slow start and congestion
`avoidance.
`
`Socket Receive Buffer
`
`The socket receive buffer is used to hold data
`received, but not yet delivered to the application.
`The queueing delay, hence, depends on how
`quickly the receiving process can accept the data.
`For TCP, its flow-control algorithm will prevent
`the sender from sending more data when the
`buffer is full. For UDP, which has no flow control,
`any message that arrives when the buffer is full is
`simply dropped.
`
`Protocol Input Queue (IP Queue)
`
`The protocol input queue holds IP datagrams
`delivered by the network interface that are waiting
`for protocol layer input processing. This queue
`normally will not build up unless there are burst
`packets arriving at the network adaptor. IP input
`routine is scheduled as an asynchronous software
`interrupt in the kernel. This software interrupt is
`posted by the receiving interrupt handler. It has a
`lower priority than the device hardware interrupt.
`IP input routine is usually scheduled immediately
`to process incoming IP datagrams after the receiv-
`ing hardware interrupt returns.
`
`Interface Send Queue
`
`Outgoing packets that wait to be transmitted by
`the network adaptor are placed in the interface
`send queue. The queueing delay depends on the
`
`1Although UDP message is never copied into the
`socket send buffer, the buffer size restricts the maxi-
`mum size of UDP messages can be sent.
`
`Medium Access Control (MAC) protocol and the
`bandwidth of the transmission media.
`
`Developing analytic models to estimate delay
`accumulated in the communication subsystems is
`a challenging task. Several factors make it
`extremely complicated. First, incoming packet
`processing is divided into two stages in the kernel
`and scheduled as asynchronous activities with two
`different priorities. This applies to both TCP and
`UDP. Second, the dynamics of transport layer pro-
`cessing, such as TCP, is sensitively influenced by
`its flow-control algorithms. This turns out to be an
`end-to-end issue that we have to also consider
`how quickly the remote peer can accept the packet
`and the end-to-end network latency.
`We cannot currently develop analytic models
`about the delay accumulated in communication
`subsystems because further study is required to
`capture the end-to-end dynamics in TCP. For now,
`we develop a mean value model of the overhead
`delay to be used as the service demands for queue-
`ing models of delay in the future.
`
`3 Analytic Model for Overall
`Communication Overhead
`
`In Section 2, we introduced the different catego-
`ries of overhead generated in communication sub-
`systems. In this section, we use them to develop
`analytic models of the overall overhead for send-
`ing and receiving a message. Since TCP has a
`much richer transport functionality than UDP
`does, it is impractical to use a single model to
`describe both of them. Four overhead models,
`and
`TCPsend(m),
`TCPrecv(m), UDPsend(m),
`UDPrecv(m), are built, with m denoting the size of
`the message to be sent.
`
`3.1 Processing Overhead for
`Sending and Receiving a Packet
`in the Bottom Layer
`
`We analyze the bottom layer first because both
`TCP and UDP employ the same processing steps
`in this layer. The bottom layer corresponds to the
`link layer or the MAC sublayer in the Open Sys-
`tem Interconnection (OSI) reference model.
`
`5
`
`INTEL EX.1248.005
`
`
`
`Table 2. Breakdown of Sending Overhead in Bottom Layers
`Calling Sequence
`Processing Overhead
`ether_output ()
`Link-layer encapsulation
`enoutput ()
`enstart ()
`en_senddone ()
`
`Network Interface
`Layer
`
`Network Adaptor
`
`Data copy from kernel space to I/O space (by PIO)
`Packet transmission
`
`Table 3. Breakdown of Receiving Overhead in Bottom Layers
`Calling Sequence
`Processing Overhead
`ether_input ()
`Link-layer decapsulation
`en_recv ()
`en_srecv () or en_lrecv ()
`
`Network Interface
`Layer
`
`Network Adaptor
`
`Packet reception and data copy from I/O space to kernel
`space (by DMA)
`
`Within the communication subsystems, the bot-
`tom layer is the network interface layer.
`For sending a packet, control enters the bottom
`layer when
`IP makes an output
`request,
`ether_output(), to the interface chosen by the
`routing algorithm. The network interface layer
`encapsulates the datagram in its link-layer format
`and places the outgoing packet in the interface
`send queue. If the network adaptor is already
`active, the control returns directly; otherwise, it
`returns after starting the network adaptor for trans-
`mission by calling enstart(). After the transmis-
`sion is finished, the network adaptor generates a
`hardware interrupt for transmission completion.
`The interrupt-handling routine, en_senddone(),
`removes the next packet from the interface send
`queue, copies it to the adaptor buffer, and restarts
`the adaptor.
`The processing overhead, hence, is equal to:
`
`
`
`ETHout Mka m(+
`
`)
`
`+
`
`I s Tx m(
`+
`
`)
`
`This formula includes overhead for link-layer
`encapsulation, ETHout, overhead for data move-
`ment from kernel to adaptor, Mka(m), interrupt
`service for transmission complete, Is, and packet
`transmission time, Tx(m). The data movement is
`done by the CPU in enstart(). It takes place before
`the control returns if the network adaptor is not
`active; otherwise, it occurs later during the inter-
`rupt service interval if the network adaptor is
`busy. The sending overhead breakdown in the bot-
`tom layer is shown in Table 2.
`
`Upon receiving a packet, the network adaptor
`first DMAs the packet from adaptor to kernel.
`When the DAM is complete, a receiving hardware
`interrupt (SPLINET) is generated by the network
`adaptor. For receiving a packet, control starts from
`interrupt service routine, en_srecv() or
`the
`en_lrecv(). The first step is link-layer decapsula-
`tion. Next, the packet is placed in the protocol
`input queue, and a software interrupt (SPLIMP) is
`posted to initiate higher-layer protocol processing
`later. Before the interrupt returns, it restarts the
`adaptor to receive the next packet.
`The processing overhead, hence, is equal to:
`Rx m(
`) Mak m(
`+
`
`)
`
`+
`
`I r
`
`This formula includes packet reception time,
`Rx(m), data movement from adaptor to kernel,
`Mak(m), and interrupt service time, Ir. Since the
`entire
`link-layer processing
`is accomplished
`within the interrupt service interval, the interrupt
`service time includes the link-layer decapsulation
`overhead. The receiving overhead breakdown is
`shown in Table 3.
`
`3.2 Processing Overhead for
`Sending a UDP Message
`
`The UDP sending path is a sequence of kernel
`subroutine calls traversing down the software lay-
`ers. The overhead breakdown and calling
`sequence in upper software layers are shown in
`Table 4.
`
`6
`
`INTEL EX.1248.006
`
`
`
`Table 4. Overhead Breakdown in UDP Sending Path
`Calling Sequence
`Processing Overhead
`sendmsg ()
`sosend ()
`udp_output ()
`ip_output ()
`
`Socket Layer
`
`Protocol Layer
`
`Data copy from user space to kernel space
`UDP checksum
`IP header checksum and fragmentation of large UDP datagram
`
`Table 5. Overhead Breakdown in UDP Receiving Path
`Calling Sequence
`Processing Overhead
`recvmsg ()
`soreceive ()
`udp_input ()
`ipintr ()
`
`Socket Layer
`
`Protocol Layer
`
`Data copy from kernel space to user space
`UDP checksum
`IP header checksum and reassembly of IP fragments
`
`3.3 Processing Overhead for
`Receiving a UDP Message
`
`For the UDP receiving path, a control sequence
`consists of hardware interrupt, software interrupt,
`and “upcall” to kernel subroutines that traverse up
`the software layers. The overhead breakdown and
`calling sequence in the upper software layers are
`shown in Table 5.
`Control enters the kernel from the receiving
`hardware interrupt (SPLINT). Link-layer decap-
`sulation is done in the interrupt service. Control
`enters the protocol layer through the software
`interrupt (SPLIMP) with the IP input routine,
`ipintr(), as its interrupt handler. For large UDP
`messages, the reassembly of IP fragments is
`accomplished in this routine. As a result, the total
`overhead below the UDP layer is the product of
`the fragmentation factor and the receiving over-
`head for an IP datagram of size MTU. The UDP
`input routine, udp_input(), contributes a fixed
`protocol-specific overhead and overhead for
`checksum computation. After that, the socket
`receiving-routine, soreceive(), “wakes up” the
`receiving process blocked by system call,
`recvmsg(), and this introduces a context switch
`overhead. The only significant overhead in the
`socket layer is the data movement overhead from
`kernel space to user space.
`From the above analysis, we derive the total
`overhead for receiving a UDP message of size m:
`
`Control enters the kernel from system call
`sendmsg() in the socket interface. The only signif-
`icant overhead in the socket layer is data move-
`ment from user space to kernel space. The UDP
`output routine, udp_output(), contributes a fixed
`protocol-specific overhead and the overhead for
`checksum computation. Processing overhead in
`the network layer can be complicated if fragmen-
`tation of a large IP datagram to fit into path Maxi-
`mum Transmission Unit (MTU) is required. As a
`result, the total overhead below the UDP layer is
`the product of
`the
`fragmentation
`factor
`m 8+
` and the sending overhead for
`=
`f
`-------------------------
`MTU 20–
`an IP datagram of size MTU.
`From the above analysis, we derive the total
`overhead for sending a UDP message of size m:
`) UDPout CS m 20+(
`+
`+
`
`
`IPout ET Hout Mka MTU 14++ (+
`
`}
`
`)
`
`)
`
`I s
`
`) Muk m(
`UDPsend m(
`=
`m 8+
`-------------------------
`MTU 20–
`(
`Tx MTU 14+
`
`+{
`
`)
`
`+
`
`+
`
`This formula includes context switch overhead, C,
`overhead for data movement from user space to
`kernel space, Muk(m), UDP protocol-specific
`overhead, UDPout, overhead for checksum com-
`putation, CS(m+20), and the total overhead below
`the UDP layer. Applying the fragmentation factor
`m 8+
`, we get:
`=
`f
`-------------------------
`MTU 20–
`
`34 f
`) Mka m 8+ +(
`
`) Mmm m(
`UDPsend m(
`
`+
`) Tx m 8( + +
`CS m(
`
`
`) UDPout
`34 f
`+
`+
`+
`(
`)
`+
`+
`+
`f
`IPout ET Hout
`I s
`
`)
`
`7
`
`INTEL EX.1248.007
`
`·
`·
`@
`
`
`m 8+
`UDPrecv m(
`-------------------------
`MTU 20–
`(
`) Mak MTU 14+
`{
`Rx MTU 14+
`(
`+
`UDPin CS m 20+(
`) Mku m(
`+
`
`) C+
`+
`
`)
`
`=
`
`+
`
`)
`
`+
`
`I r
`
`+
`
`IPin
`
`}
`
`Before we derive the total overhead for sending
`a TCP message, we first consider the cost for
`sending a TCP segment and receiving an ACK.
`Following the similar analysis in UDP, the over-
`head for sending a TCP segment with size MSS is:
`
`)
`(
`32+
`+
`=
`+
`IPout
`TCPout CS MSS
`SEGsend
`) Tx MTU 14+(
`(
`
`ET Hout Mka MTU 14+
`+
`+
`+
`
`)
`
`+
`
`I s
`
`The checksum computation includes the TCP
`header (20 bytes) and the IP pseudo header (12
`bytes). Similarly, the overhead for receiving an
`ACK is:
`
`ACK recv
`CS 32(
`+
`
`)
`
`=
`
`Rx 54(
`
`) Mak 54(
`+
`
`)
`
`+
`
`I r
`
`+
`
`+
`IPin TCPin
`
`where the ACK packet is 54 bytes long (20-byte
`TCP header, 20-byte IP header, and 14-byte Eth-
`ernet header).
`To simplify our analysis, we assume there is no
`packet loss (no retransmission) and there is always
`an acknowledgment for each segment sent (no
`ACK compression). Hence, the total overhead for
`sending a TCP message of size m is equal to:
`
`TCPsend m(
`m
`-----------
`MSS
`
`+
`
`)
`
`) Muk m(
`=
`{
`
`+
`SEGsend ACK recv
`
`}
`
`If we let the segmentation factor
`get:
`
`g
`
`=
`
`m
`-----------
`MSS
`
`, we
`
`This formula includes the total overhead below
`the UDP layer, UDP protocol-specific overhead,
`UDPin, overhead for checksum computation,
`CS(m+20), overhead for data movement from ker-
`nel space to user space, Mku(m), and context
`switch overhead, C. Applying the same fragmen-
`tation factor f, we get:
`
`
`) C Mmm m(
`UDPrecv m(
`
`) Mak m 8+ +(
`+
`+
`) Tx m 8( + +
`CS m(
`
`
`) UDPin
`34 f
`+
`+
`+
`(
`)
`I r+
`+
`f
`IPin
`
`34 f
`
`)
`
`3.4 Processing Overhead for
`Sending a TCP Message
`
`The TCP sending path is a sequence of kernel sub-
`routine calls that traverses down the software lay-
`ers. The
`calling
`sequence
`and overhead
`breakdown in the upper software layers are shown
`in Table 6. The overhead breakdown in the proto-
`col layer is different from the UDP in two ways.
`First, the breakdown of a large message to fit into
`path MTU is done in TCP instead of IP. Second,
`there is also overhead for receiving TCP acknowl-
`edgment (ACK) associated with the message sent.
`
`Table 6. Overhead Breakdown in TCP Sending Path
`Calling Sequence
`Processing Overhead
`write ()
`sosend ()
`tcp_output ()
`ip_output ()
`
`Socket Layer
`
`Protocol Layer
`
`Data copy from user space to kernel space
`TCP checksum (message sent in unit no larger than MSS)
`IP header checksum
`
`Table 7. Overhead Breakdown in TCP Receiving Path
`Calling Sequence
`Processing Overhead
`read ()
`soreceive ()
`tcp_input ()
`ipintr ()
`
`Socket Layer
`
`Protocol Layer
`
`Data copy from kernel space to user space
`TCP checksum
`IP header checksum
`
`8
`
`INTEL EX.1248.008
`
`·
`·
`@
`·
`
`
`) Mmm m(
`TCPsend m(
`)
`) Mka m 54g+(
`
`+
`=
`)
`
`) CS m 64g+(
`
`) Tx m 108g+(
`Mak 54g(
`
`+
`+
`+
`(
`+
`+
`+
`+
`+
`TCPin TCPout
`IPin
`IPout ET Hout
`g
`)
`+
`+
`I r
`I s
`
`3.5 Processing Overhead for
`Receiving a TCP Message
`
`For receiving a TCP message, the overhead break-
`down and calling sequence in the upper software
`layers are shown in Table 7. Following the similar
`approach in the TCP sending path, we derive the
`total overhead for receiving a TCP message of
`size m:
`
`TCPrecv m(
`m
`-----------
`MSS
`
`+
`
`)
`
`=
`
`C Mku m(
`+
`
`)
`
`{
`
`+
`SEGrecv ACK send
`
`}
`
`Applying the same segmentation factor, we get:
`
`)
`TCPrecv m(
` +(C Mmm m( ) Mak m 54g
`
`
`+
`+
`=
`)
`
`) CS m 64g+(
`
`) Tx m 108g+(
`Mka 54g(
`
`+
`+
`+
`(
`+
`+
`+
`+
`+
`TCPin TCPout
`IPin
`IPout ET Hout
`g
`)
`+
`+
`I r
`I s
`
`)
`
`3.6 Scheduling Issue in
`Communication Subsystems
`
`One important characteristic of communication
`subsystems is that packet reception receives a
`higher priority than does packet transmission. This
`imbalance is because incoming packet processing
`is interrupt-driven, which has a higher priority
`than do other kernel activities. There is a similar
`situation in input processing: the MAC layer has a
`higher priority than the protocol layer. For server
`machines under a heavy network load, outgoing
`packets probably suffer in throughput from this
`imbalance in scheduling of packet processing.
`Another important characteristic of communi-
`cation subsystems is that the OS has no way to
`control the load offered to it because it has no con-
`trol over the number of clients, or over their
`aggressiveness [10]. Although flow control can
`restrict the data that arrives over an existing con-
`nection, it cannot control the rate of requests for
`new connections.
`
`For an overloaded server, these two characteris-
`tics cause the OS to spend more time on incoming
`packet processing than on the rest of request ser-
`vices. The direct consequences for application
`performance are throughput drop and increased
`response time. The overloaded behavior of com-
`munication subsystems, therefore, deserves more
`investigation to achieve better application perfor-
`mance.
`
`4 Case Studies
`
`In this section, we present two case studies, the
`Internet web server and the DCE/RPC server, to
`analyze the communication overhead with the
`analytic models. Both applications require inten-
`sive network communication to handle the enor-
`mous request and reply messages. As a result, a
`careful analysis of the communication overhead
`can help us investigate the performance problems
`caused by heavy network load.
`
`4.1 The Internet Web Server
`
`The World Wide Web (WWW) provides a quick
`and easy access to retrieve a large variety of infor-
`mation across the Internet. For busy Internet web
`servers, the general performance problems experi-
`enced are long response time and short-term ser-
`vice unavailability. A common solution for these
`problems is to off-load the enormous requests
`from a single machine to replicated servers [11].
`Researches also identified the inefficiency in the
`HyperText Transfer Protocol (HTTP)1
`itself,
`which uses separate TCP connections for each
`request. An enhanced HTTP, which uses a single
`TCP connection for all data exchange, reduces the
`response time caused by the round-trip network
`latency [12]. In this subsection, we apply the ana-
`lytic models to examine the communication over-
`head caused by HTTP requests.
`HTTP relies on TCP to provide reliable mes-
`sage delivery. The protocol itself is quite simple: a
`TCP connection is established for retrieving a
`remote document and torn down after the docu-
`
`1HTTP is an application layer protocol for web cli-
`ents and web servers to exchange data.
`
`9
`
`INTEL EX.1248.009
`
`·
`·
`·
`
`
`At the server machine, the communication over-
`head caused by this HTTP transaction is approxi-
`mately:
`
`+
`+
`=
`ACK send ACK recv
`SY N recv
`HTT Preq
`) TCPsend 1852(
`)
`TCPrecv 1146(
`
`
`+
`+
`+
`+
`+
`+
`FI N send ACK recv FIN recv ACK send
`
`By applying the analytic models (from Section 3.4
`and 3.5) for sending and receiving a TCP mes-
`sage, we get:
`
`+
`+
`=
`ACK send ACK recv
`SY N recv
`HTT Preq
`)
`(
`+
`)
`3
`Muk 1146(
`
`+
`+
`SEGrecv ACK send
`(
`C Mku 1853(
`
`)
`4
`+
`+
`+
`+
`SEGsend ACK recv
`+
`+
`+
`+
`FI N send ACK recv FIN recv ACK send
`
`)
`
`Our models overestimate the number of ACKs
`sent and received because of the delayed acknowl-
`edgment effect (the ACK compression) that is
`present in most TCP implementations. We esti-
`mate that it requires 15 hardware interrupts (8 Ir
`and 7 Is) to service this HTTP transaction. For a
`busy web server with a peak request rate of about
`60 requests per second [11], it will generate
`approximately 900
`interrupts per
`second.
`Researches have shown the cost of interrupt ser-
`vice is not trivial in modern computer systems.
`Therefore, the avoidance of the congestive col-
`lapse or “livelock”1 is an important issue in com-
`munication subsystem design [10,13].
`
`4.2 The DCE/RPC
`
`The Open Software Foundation’s Distributed
`Computing Environment (OSF/DCE) is a plat-
`form that facilitates interoperability of distributed
`applications in a heterogeneous environment.
`DCE relies on the remote procedure call (RPC) as
`its communication paradigm to construct distrib-
`uted applications based on client/server architec-
`ture. Detailed analysis of the DCE/RPC has shown
`the significant communication overhead for trans-
`porting request and reply messages between client
`and server machines [7]. In this subsection, we
`apply the analytic models to analyze the commu-
`nication overhead of a null RPC.
`
`1A livelocked server spends most of its resources on
`non-productive operations, such as rejecting new con-
`nections or aborting partially-completed ones [10].
`
`ment is received. We can divide an HTTP transac-
`tion into five steps:
`
` 1. The client establishes a TCP connection to
`the server,
` 2. The client sends a request message,
` 3. The server retrieves the document according
`to the request,
` 4. The server sends the document in a reply
`message, and
` 5. The server tears down the TCP connection.
`
`A schematic diagram of an HTTP transaction
`from live data obtained by the tcpdump is shown
`in Figure 2. The first three packets are for TCP
`connection establishment. The request message is
`sent through three TCP segments and the replay
`message is sent through four TCP segments. The
`last four packets are for connection teardown. The
`rest of the packets are acknowledgment.
`
`Web Server
`
`Web Client
`
`request
`packets
`
`connection
`teardown
`
`S Y N
`
`SYN+ACK
`
`A C K
`1 : 5 3 7 ( 5 3 6 )
`
`ACK(537)
`
`5 3 7 : 1 0 7 3 ( 5 3 6 )
`1 0 7 3 : 1 1 4 7 ( 7 4 )
`
`1:513(512)+ACK(1147)
`513:1025(512)
`
`A C K ( 1 0 2 5 )
`
`1025:1537(512)
`1537:1853(316)+FIN
`
`A C K ( 1 8 5 4 )
`F I N
`
`ACK
`
`connection
`establishment
`
`reply
`packets
`
`Time
`
`Figure 2. A Schematic HTTP Transaction
`
`10
`
`INTEL EX.1248.010
`
`·
`·
`
`
`The software layering of the DCE/RPC is shown
`in Figure 3. It is built above UDP/IP. Since there
`is no flow control or reliable message delivery in
`UDP, an extra RPC layer is needed to supplement
`the transport layer functions required. The proto-
`col software is implemented as a dynamically-
`linked library, the RPC runtime. The flow-control
`mechanism in the RPC layer is a combined win-
`dow scheme similar to TCP. The initial window
`size is set to 4 KB, and then is doubled after it
`receives each acknowledgment, up to a maximum
`of 32 KB.
`
`client app.
`
`client stub
`
`server app.
`
`server stub
`
`RPC runtime
`
`RPC layer
`
`RPC runtime
`
`UDP
`
`IP
`
`UDP
`
`IP
`
`RPC request path
`
`RPC reply path
`
`Figure 3. Software Layering of the DCE/RPC
`
`The major overhead intro