`~~PACKARD
`
`Afterburner: Architectural Support
`for High-Performance Protocols
`
`Chris Dalton, Greg Watson, Dave Banks,
`Costas Calamvolris, Aled Edwards, John Lumley
`Networks & Communications Laboratories
`HP Laboratories Bristol
`HPL-93-46
`July, 1993
`
`network interfaces,
`TCP/IP, Gb/s
`networks, network
`protocols
`
`Current workstations are often unable to make link(cid:173)
`level bandwidth available to user applications. We
`argue that this poor performance is caused by unneces(cid:173)
`sary copying of data by the various network protocols.
`We describe three techniques that can reduce the num(cid:173)
`ber of copies performed, and we explore one - the single
`copy technique - in further detail.
`We present a novel network-independent card, called
`Afterburner, that can support a single-copy stack at
`rates up to 1 Gbit/s. We describe the modifications that
`were made to the current implementations of protocols
`in order to achieve a single copy between application
`buffers and the network card. Finally, we give the
`measured performance obtained by applications using
`TCP/IP and the Afterburner card for
`large data
`transfers.
`
`Internal Accession Date Only
`© Copyright Hewlett-Packard Company 1993
`
`
`
`WISTRON CORP. EXHIBIT 1027.001
`
`
`
`Afterburner: Architectural Support for
`High-performance Protocols
`
`Many researchers have observed that while the link level rates of some networks are now
`in the Gbit/s range, the effective throughput between remote applications is usually an
`order of magnitude less. A number of components within computing systems have been
`postulated as the cause of this imbalance. Several years ago the transport and network
`,
`protocols came under great scrutiny as they were considered to be 'heavyweight' and thus
`computationally expensive. This line of thought encouraged many researchers to explore
`ways to execute protocols in parallel, or to design new 'lightweight' protocols. Other sources
`of problems were thought to be poor protocol implementations, high overheads associated
`with operating system functions, and a generally poor interface between applications and
`the network services.
`
`Clark et al. [2) suggested that even heavyweight protocols, such as the widely used TCP /IP
`protocol combination, could be extremely efficient if implemented sensibly. More recently,
`.Jacobson has shown that most TCP /IP packets can be processed by fewer than 100 in(cid:173)
`It is now widely believed that while a poor implementation will impede
`structions [4).
`performance, protocols such as TCP are not inherent limiting factors.
`
`One reason many implementations fail to achieve high throughput is that they access user
`data several times between the instant the data are generated and the instant the data are
`transmitted on the network. In the rest of this paper we analyse this behaviour in a widely(cid:173)
`used implementation of TCP, and consider three proposals for improving its performance.
`We describe our experimental implementation of one of these proposals, which uses novel
`hardware together with a revised implementation of the protocol. To conclude, we present
`measurements of the system's performance.
`
`The bottleneck: copying data
`
`We believe that the speed of protocol implementations in current workstations limited not
`by their calculation rate, but by how quickly they can move data. This section first reviews
`the design of a popular protocol implementation, then examines its behaviour with reference
`to workstation performance.
`
`The conventional implementation
`
`Our example is the HP-UX implementation of TCP /IP, which, like several others, is derived
`from the 4.3BSD system [7). This overview focuses on how it treats data, and is rather brief.
`
`Figure I shows the main stages through which the implementation moves data. On the
`left are listed the functions which move data being transmitted; on the right are those for
`received data. Curved arrows represent copies from one buffer to another; straight arrows
`
`1
`
`
`
`WISTRON CORP. EXHIBIT 1027.002
`
`
`
`Afterburner: Architectural Support for High-performance Protocols
`
`show other significant reads and writes.
`
`Transmit
`
`Receive
`
`Producer~
`
`~Consumer
`
`Program buffer
`
`send()
`
`(
`
`•OCV{)
`tcp_output() ~1 .... __ K_e_rn_e_l_b_uff_e_r----.1 _'._. tcp_;,p"0
`
`)
`
`dri···
`
`( I
`
`I ) drim
`
`.
`
`Interface
`
`.
`
`Figure 1: Data movements in a typical TCP/IP implementation
`
`Transmission Producer is a program which has a connection to another machine via a
`stream socket. It has generated a quantity of data in a buffer, and calls the send function to
`transmit it.
`
`Send begins by copying the data into a kernel buffer. The amount of data depends on
`the program - not on the network packet size - and it may be located anywhere in the
`program's data space. The copy allows Producer to reuse its buffer immediately, and gives the
`networking code the freedom to arrange the data into packets and manage their transmission
`as it sees fit.
`
`T cp_output gathers a quantity of data from the kernel buffer and begins to form it into
`a packet. Where possible, this is done using references rather than copying. However,
`tcp_output does have to calculate the packet's checksum and include it in a header; this
`entails reading the entire packet.
`
`Eventually, the network interface's device driver receives the list of headers and data pointers.
`It copies the data to the interface, which transmits it to the network.
`
`Reception The driver copies an incoming packet into a kernel buffer, then starts it moving
`through the protocol receive functions. Most of these only look at the headers.
`
`T cp_input, however, reads all the data in the packet to calculate a checksum to compare with
`the one in the header. It places valid data in a queue for the appropriate socket, again using
`pointers rather than copying.
`
`2
`
`
`
`WISTRON CORP. EXHIBIT 1027.003
`
`
`
`After/Jurner: Arcl1itectural Support for High-performance Protocols
`
`Some time later, the program Consumer calls the function recv, which copies data from the
`kernel buffer into a specified area. As with send, Consumer may request any amount of data,
`regardless of the network packet size, and direct the data anywhere in its data space.
`
`Where does the time go?
`
`The standard implementation of TCP /IP copies data twice and reads it once in moving it
`between the program and the network. Clearly, the rate at which a connection can convey
`data is limited by the rates at which the system can perform these basic operations.
`
`As an example, consider a system on which the Producer program is sending a continu(cid:173)
`ous stream of data using TCP. Our measurements show that an HP 9000/730 workstation
`can copy data from a buffer in cache to one not in the cache at around 50 Mbyte/s1
`, or
`19 nanoseconds per byte. The rate for copying data from memo~y to the network interface
`is similar. The checksum calculation proceeds at around 127 Mbyte/s, or 7.6 ns per byte.
`All of these operations are limited by memory bandwidth, rather than processor speed.
`
`Each byte of an outgoing packet, then, takes at least 45.6 ns to process: the fastest this
`implementation of TCP /IP can move data is about 21 Mbyte/s (176 Mbits/s). Overheads
`such as protocol handling and operating system functions will ensure it never realizes this
`rate.
`
`Several schemes for increasing TCP throughput try to eliminate the checksum calculation.
`Jacobson [5] has shown that some processors, including the HP 9000/700, are able to calculate
`the checksum while copying the data without reducing the copy rate. Others add support
`for the calculation to the interface hardware. Still others propose simply dispensing with the
`checksum in certain circumstances.
`
`Our figures, however, suggest that for transmission, the checksum calculation accounts for
`only about one-sixth of the total data manipulation time: getting rid of it increases the
`upper bound to around 25 Mbyte/s (211 Mbits/s). Each data copy, on the other hand,
`takes more than a third of the total. Eliminating one copy would increase the data handling
`rate to more than 36 Mbyte/s (301 Mbits/s), and removing both a copy and the checksum
`calculation would increase it to 50 Mbyte/s (421 Mbits/s). Clearly, there are considerable
`rewards for reducing the number of copies the stack performs.
`
`For a better idea of the effect the changes would have in practice, we need to include the
`other overheads incurred in sending packets. In particular, we need to consider the time
`taken by each call to send, and the time needed to process each packet in addition to moving
`the data. On a 9000/730, these are roughly 40 µs and 110 µs respectively. These times
`are large, but include overheads such as context switches, interrupts, and processing TCP
`acknowledgements.
`
`1 We use the convention that Kbyte and Mbyte denote 2 10 and 220 bytes respectively, but Mbit and Gbit denote 106 and
`109 bits.
`
`3
`
`
`
`WISTRON CORP. EXHIBIT 1027.004
`
`
`
`Afterburner: Architectural Support for High-performance Protocols
`
`Table 1 gives estimates of TCP throughput for three implementations: the conventional
`one, one without a separate checksum calculation ("two-copy" for short), and one using
`just a single copy operation. The estimates assume a stream transmission using 4 Kbyte
`packets, with each call to send also writing 4 Kbytes. Even with such small packets and
`large per-packet overheads, the single-copy approach is significantly faster.
`
`Implementation
`
`Conventional
`Two-copy
`Single-copy
`
`Time per packet (µs)
`send{) packet data ~ ta)
`40
`110
`187
`337
`40
`110
`156
`306
`40
`110
`78
`228
`
`Throughput
`(Mbyte/s)
`11.6
`12.8
`17.1
`
`Table 1: Estimated TCP transmission rates for three implementations
`
`Analysing the receiver in the same way gives similar results, as shown in table 2. The main
`differences from transmission are that copying data from the interface to memory is slower,
`at around 32 Mbyte/s, or 30 ns per byte, and that the overheads of handling an incoming
`packet and the recv system call are also smaller, approximately 90 µsand 15 µs respectively.
`
`Implementation
`
`Conventional
`Two-copy
`Single-copy
`
`Time per packet (µs)
`recv() packet data Total
`15
`90
`256
`361
`15
`90
`193
`298
`15
`90
`124
`229
`
`Throughput
`(Mbyte/s)
`10.8
`13.1
`17.1
`
`Table 2: Estimated TCP reception rates for three implementations
`
`Before we consider the single-copy approach in more detail, we examine the trends in two rel(cid:173)
`evant technologies: memory bandwidth and CPU performance. Memory bandwidth affects
`the transmission of every byte and, for large packets, is arguably the limiting factor. CPU
`performance determines the time to execute the protocols for each packet, but this effort
`is independent of the length of the packet. (A more detailed look at the effect of memory
`systems is given by Druschel et al. [3] in this issue.)
`
`Over the past few years main memory (Dynamic RAM) has been getting faster at the rate of
`about 7% per annum whereas CPU ratings in terms of instructions per second have increased
`by about 50% per annum. We believe that reducing the number of data copies in protocol
`implementations will yield significant benefits as long as this trend continues.
`
`4
`
`
`
`WISTRON CORP. EXHIBIT 1027.005
`
`
`
`Afterburner: Architec;tural Support for Hig/1-performauce Protocols
`
`Minimizing data movement
`
`Several suggestions have been made to reduce the number of times that application data
`must be accessed [3]. This section describes three of them: copy-on-write, page remapping
`and single-copy.
`
`Copy-on-write When a program sends data, the system makes the memory pages that
`contain the data read-only. The data go to the network interface directly from the program's
`buffer. The pages are made read-write again once the peer process has acknowledged the
`data.
`
`The program is able to continue, but if it tries to write to the same buffer before the data
`are sent, the Memory Manager blocks the program and the networking code copies the data
`into a system buffer. (In a variation called sleep-on-write, the Memory Manager forces the
`process to wait until the transmission is complete.)
`
`Copy-on-write needs changes to the system's Memory Manager as well as the networking
`code. For best performance, programs should be coded not to write to buffers that contain
`data in transit.
`
`Page remapping The system maintains a set of buffers for incoming data. The network
`interface splits incoming packets, placing the headers in one buffer and the data in another
`starting at a memory page boundary. When a program receives the data, if the buffer it
`supplies also starts on a page boundary, the Memory Manager exchanges it for the buffer
`containing the data by remapping the corresponding pages, i.e., by editing the system's
`virtual memory tables.
`
`Page remapping needs changes to both the memory management and networking code. In
`addition, the network interface hardware has to be able to interpret incoming packets weli
`enough to find the headers and data. Application programs must be written to use suitably
`aligned buffers.
`
`It needs a
`Single-copy This technique works when both sending and receiving data.
`dedicated area of memory which the processor and network interface share without affecting
`each other's performance.
`
`When a program sends data, the networking code copies the data immediately into a buffer
`in the dedicated area. The various protocol handling routines prefix their headers to the
`data in the buffer, then the network interface transmits the whole packet in one operation.
`
`The interface places incoming packets in buffers in this area before informing the network
`code of their arrival. The data remain in the dedicated buffer until a program asks to receive
`
`5
`
`
`
`WISTRON CORP. EXHIBIT 1027.006
`
`
`
`Afterburner: Architectural Support for High-performance Protocols
`
`them, when they are copied into the program's buffer.
`
`The single-copy technique only affects the system's networking code. Significantly, user
`programs get the full benefit without being altered in any way.
`
`All three techniques can reduce the number of copy operations needed by a protocol imple(cid:173)
`mentation. All need hardware support of some kind, and their relative effectiveness depends
`on the characteristics of that support and of the data traffic being handled. In our view,
`single-copy has three distinct advantages over the others1 for general TCP traffic. First,
`it affects only the networking code in the system. Second, it speeds up both sending and
`receiving data. Third, most existing programs benefit without being recoded or recompiled.
`
`The Afterburner Card
`
`Van Jacobson has proposed WITLESS 2 [4], a network interface designed to support single(cid:173)
`copy implementations of protocols such as TCP. We previously built an FDDI interface,
`Medusa [l], as a test of the WITLESS architecture. The results were excellent, and we
`decided to adapt the design to support link rates up to 1 Gbit/s.
`
`It occupies a slot in the
`Afterburner is designed for the HP 700 series of workstations.
`workstation's fast graphics bus, and is mapped into the processor's memory. Figure 2 shows
`Afterburner's architecture.
`
`The focus of the card is a buffer built from three-port Video RAMs (VRAMs). One port
`provides random access to the buffer for the workstation's CPU; the other two are high-speed
`serial ports connected to fast 1/0 pipes. To the CPU, the VRAM is one large buffer, but
`Afterburner itself treats the VRAM as a set of distinct, equal-sized blocks. The block size
`is set by software when the card is initialised, and ranges from 2 Kbytes to 64 Kbytes.
`
`Sending and receiving data The CPU and Afterburner communicate with each other
`mainly through four FIFOs: two for transmission (Tx) and two for reception (Rx). An
`entry in one of these FIFOs is a descriptor which specifies a block of VRAM and tells how
`many words of information it contains. Descriptors in the Tx_Free and Rx_Free FIFOs
`identify blocks of VRAM available for use; those in the Tx_Ready and Rx_Ready indicate
`data waiting to be processed.
`
`To transmit a message, the CPU writes it into the VRAM starting at a block boundary, then
`puts the appropriate descriptor into the Tx_Ready FIFO. Afterburner takes the entry from
`the FIFO and streams the message from the VRAM to the Tx_Data FIFO. When finished,
`Afterburner places the block address into the Tx_Free FIFO.
`
`Similarly, when a message arrives on the Rx_Data FIFO, Afterburner takes the first entry
`
`2 Workstation lnterface That's Low-cost, Efficient, Scalable, and Stupid
`
`6
`
`
`
`WISTRON CORP. EXHIBIT 1027.007
`
`
`
`Afterburner: Architectural Support for High-performance Protocols
`
`Sta1us and
`Conuol registers
`
`/'--~ Wortswion
`Interface
`
`llO Bus
`
`t--t-""> Tx data FIFO ..___ _ _,__ Link Adapter
`Tx data path
`
`V ' - - - " 'J I Mbyte VRAM
`Buffer Memory
`..__ _ __ __."------!
`
`Rx data FIFO V'-----. Link Adapcer
`Rx da.ta palh
`
`Link Adaptor Conuol lnterface
`(Interrupts, read/write, reset)
`
`V'----" Link Adapcer
`Conuol lnterface
`
`Figure 2: Afterburner Block Diagram
`
`from the Rx_Free FIFO, and streams the data into the corresponding block of VRAM. At
`the end of the message, Afterburner fills in the descriptor with the message's length, then
`puts it in the Rx-Ready FIFO. When the CPU is ready, it takes the descriptor from the
`Rx_Ready FIFO, processes the data in the block, then returns the descriptor to the Rx_Free
`FIFO. The CPU has to prime the Rx_Free FIFO with some descriptors before Afterburner
`is able to receive data.
`
`Large messages Although Afterburner allows VRA.l\·1 blocks up to 64 Kbytes in size, it
`provides a mechanism for handling large packets that is better suited to the wide range of
`message sizes in typical IP traffic. Packets can be built from an arbitrary number of VRAM
`blocks.
`
`In addition to specifying a block and the size of its payload, the descriptor in a FIFO contains
`a flag to indicate whether the next block in the queue belongs to the same message. To send
`a message larger than a VRAM block, the CPU writes the data in several free blocks - they
`then puts the descriptors in the right order into the Tx_Ready
`need not be contiguous -
`FIFO, setting the "continued" flag in all but the last. Afterburner transmits the contents
`of the blocks in order as a single message. Long incoming messages are handled in a similar
`
`7
`
`
`
`WISTRON CORP. EXHIBIT 1027.008
`
`
`
`Afterburner: Architectural Support for High-performance Protocols
`
`way.
`
`Interrupts When Afterburner has received data, it has at some point to interrupt the host.
`Because interrupts can be expensive - several hundred instructions - it is important for the
`card to signal only when there is useful work to do. Afterburner provides several options.
`The simplest is to interrupt when Rx_Ready becomes non-empty. When long packets are
`common, the most useful is to interrupt when Rx_Ready contains a complete message, i.e.,
`Rx_Ready contains a block with the "continued" bit not set.
`
`The card is also able to interrupt the host when it has transmitted a block (the Tx_Free
`FIFO becomes non-empty) or when it has transmitted all it had to do (the Tx..R.eady FIFO
`becomes empty). Normally, the card would be configured to interrupt only for incoming
`data.
`
`Link Adapters So far, we have not mentioned the connection to the physical network.
`When we began to design Afterburner there was no obvious choice for a network operating
`at up to 1 Gbit/s. Rather, there were several possibilities. So, Afterburner is not designed
`for a particular LAN: it has no MAC or Physical layer devices. Instead, it provides a simple
`plug-in interface to a number of "Link Adapters", each designed to connect to a particular
`network.
`
`The interface consists of three connectors. One provides a simple address and data bus
`for the host CPU and Link Adapter to communicate directly. The other two connect the
`Adapter to Afterburner's input and output streams. When Afterburner and an Adapter are
`mated, the combined unit fits into the workstation as a single card.
`
`To date, three link adapters have been designed: one for HIPPI (8], one for ATM, and one
`for Jetstream, an experimental Gbit/s LAN developed at HP Labs in Bristol.
`
`One substantial benefit of the separation between Afterburner and the link adapter is that
`network interfaces do not need to be redesigned for each new generation of workstation.
`Only the Afterburner card needs to be redesigned and replaced, and typically, the redesign
`affects only the workstation interface.
`
`A Single-copy Implementation of TCP-IP
`
`This section describes an implementation of TCP /IP that uses the features provided by
`Afterburner to reduce the movement of data to a single copy. The changes we describe were
`made to the networking code in the 8.07 release of HP-UX, itself derived from that in the
`4.3BSD system.
`
`Ours is not a complete reimplementation of TCP, but simply adds a single-copy path to
`the existing TCP code. We did this for prn.ctica.I reasons, but a. :;ide-effect is that protocol
`
`8
`
`
`
`WISTRON CORP. EXHIBIT 1027.009
`
`
`
`Afterburner: Architectural Support for Higl1-performance Protocols
`
`processing hasn't changed: any changes in performance are mainly from changes in data
`handling.
`
`The principles of the single-copy implementation are simple: put the data on the card as early
`as possible, leave it there as long as possible, and don't touch it in between. Figure 3 gives a
`very simple view of the single-copy route. Compared with the conventional implementation
`(figure 1 ), the socket functions send and receive do most of the work, including the one data
`movement. The other protocol functions handle only a small amount of control information,
`represented by the dotted lines.
`
`Transmit
`
`Producer ~ I
`
`.
`
`Program buffer
`
`Receive
`
`I _.... Consume•
`
`.
`
`.. nd()
`
`(
`
`....____ _
`
`__..)
`
`<ec•()
`
`tcp_output()
`
`Interface buffer
`
`.. ··'"'
`.......
`driver·
`
`tcp_input()
`~
`
`···driver
`
`Figure 3: The single-copy approach
`
`In the remainder of this section, we give an overview of the main features of our single(cid:173)
`copy stack compared with the standard one. We also discuss a number of issues \\"hich have
`emerged during the course of the work.
`
`Data structures - mbufs and clusters
`
`The networking code keeps data in objects called mbufs. An mbuf can hold about 100 bytes
`of data, but in another form, called a cluster, it can hold several Kilobytes. Most networking
`data structures, including packets under construction and the kernel buffer in figure 1, consist
`of linked lists, or chains, of mbufs and clusters. The system provides a set of functions for
`handling mbufs, e.g, for making a copy of a chain, or for trimming the data in a chain to a
`particular size.
`
`Normally, clusters are fixed-size blocks in an area of memory reserved by the operating
`system. We enhanced the mbuf-handling code to treat blocks of Afterburner's VRAM buffer
`as clusters. Code is able to tell normal clusters from single-copy ones.
`
`Single-copy clusters carry additional information, for example, the checksum of the data in
`the cluster. However, the main difference in handling them is that, in general, code should
`not try to change their size nor move their contents. Either of these operations would imply
`having to make an extra pass over the data, either to copy it, or to recalculate a checksum.
`
`9
`
`
`
`WISTRON CORP. EXHIBIT 1027.010
`
`
`
`Afterburner: Architectural Support for High-performance Protocols
`
`This characteristic also has an effect on the behaviour of the protocol which we discuss
`further below.
`
`We have had to alter several of the mbuf functions in the kernel to take these differences
`into account. To support the use of the on-card memory as clusters, we have written a small
`number of functions. The most important is a special copy routine, functionally equivalent
`to the BSD function bcopy. It is optimised for moving data over the 1/0 bus, and also
`optionally uses the card's built-in unit to calculate the IP checksum of the data it moves.
`Another function converts a single-copy cluster into a chain of normal clusters and mbufs; it
`also calculates the checksum.
`
`Sending data
`
`As before, Producer has already established a socket connection with a program on another
`machine, and calls the socket send function to transmit it.
`
`The socket layer - send Send decides, from information kept about the connection,
`to follow the single-copy course. It therefore obtains a single-copy cluster and copies the
`data from Producer's buffer into it, leaving just enough room at the beginning of the cluster
`for headers from the protocol functions. The amount of space needed is a property of the
`connection and is fixed when the connection is established. It depends on the TCP and IP
`options in use3 • The copy also calculates the checksum of the data, and send caches this in
`the cluster along with the length of the data and its position in the stream being sent.
`
`The data in the single-copy cluster are now physically on the interface card. Logically,
`however, the cluster is still in the the send socket buffer - the queue of data waiting to be
`transmitted on this connection. Often, the last mbuf in the queue is a single-copy cluster
`with some room in it, so, when possible, send tries to fill the cluster at the end of the socket
`buffer before obtaining a new one.
`
`TCP _output
`In general, the send socket buffer is a mixture of normal mbufs and both
`normal and single-copy clusters. To build a packet, tcp_output assembles a new chain of
`mbufs that are either copies of mbufs in the socket buffer or references to clusters there. It
`also ensures that the packet's data is either in normal mbufs and clusters or in a single-copy
`cluster - never both.
`
`T cp_output sets the size of normal packets based on its information about the connection,
`and collects only as much as it needs from the socket buffer. Conversely, it treats single-copy
`clusters as indivisible units, and sets the size of the packet to be that of the cluster4
`•
`
`3 In future, using Afterburner's ability to fonn packets from groups of VRAM blocks will remove the need to leave this space
`4 This can have wtdesirable effects, which are discussed w1der "Issues".
`
`10
`
`
`
`WISTRON CORP. EXHIBIT 1027.011
`
`
`
`Afterburner: Architectural Support for High-performance Protocols
`
`As in the normal stack, tcp_output builds the header in a separate (normal) mbuf, and prefixes
`it to the single-copy cluster. An important part of the header is the packet checksum, which
`covers both data and header. With normal packets, tcp_output reads all the data in a normal
`packet to calculate the checksum. A single-copy cluster already contains the data's checksum,
`so the calculation involves only the header and some simple arithmetic.
`
`When the TCP header is complete, tcp_output passes the packet to the ip_output and link(cid:173)
`layer functions. These are the same for both normal and single-copy packets, so we shall not
`discuss them here.
`
`The device driver Here, the packet is sent to the network. With single-copy chains, all
`the driver has to do to complete the packet is to copy the various protocol headers at the
`beginning of the chain into the space the send function left at the beginning of the single-copy
`cluster. With chains of normal mbufs, the driver copies the contents of the mbuf chain onto
`the card, using a VRAM block from a small pool reserved for the driver. When ready, the
`driver constructs the descriptor for the packet and writes it to the Tx_Ready FIFO.
`
`Receiving data
`
`Receiving packets is more complicated than sending them. The sender knows everything
`about an outgoing packet except whether it will arrive safely, and it can reasonably expect
`its information to be accurate. The receiver, on the other hand, has to work everything
`out from the contents of the packet, and - until it knows better - it has to assume that
`information may be wrong or incomplete.
`
`The device driver To decide whether the packet should take the single-copy or normal
`route, the driver examines the incoming packet to discover its protocol type, the length of
`the packet, and the length of its headers. There are four cases:
`
`Non-IP packets. The driver copies the entire packet into a chain of normal mbufs.
`
`Small IP packets (less than 100 bytes). The driver creates a chain of two normal
`mbufs: the first contains the link header. the second contains the whole IP packet.
`
`Large TCP /IP packets. The driver creates a chain of three mbufs. The first two
`are normal, and contain the headers. The third is a single-copy cluster - the VRAM
`block containing the packet.
`
`All other IP packets The driver creates a chain of normal mbufs. The first contains
`the link header, the second, the IP and other headers. The remainder contain data.
`
`Small packets are treated specially for several reasons. Many such packets have only one or
`two bytes of payload. e.g, single characters being typed or echoed during a remote login. It's
`
`11
`
`
`
`WISTRON CORP. EXHIBIT 1027.012
`
`
`
`Afterburner: Architectural Support for High-performance Protocols
`
`quicker to process these packets as one mbuf in the conventional stack than it is to process
`a single-copy chain. Also, copying in the data immediately frees the VRAM block for re(cid:173)
`use. Because buffers on the card are a relatively scarce resource, this is important when the
`receiving application is very slow or when the transmitter is sending a rapid stream of short
`messages.
`
`Tcp_input The first thing tcp_input normally does with an incoming packet is to calculate
`its checksum and compare it with the one in the TCP header. This checks the integrity of
`both header and data. It is possible, however, to defer the checksum calculation until later.
`The important thing is to ensure that an erroneous packet doesn't cause tcp_input to change
`the state of some connection.
`
`When it receives a single-copy packet, tcp_input checks the header for three things: that the
`packet is for an established connection; that the packet simply contains data, not control
`information that would change the state of the connection; and that the data in the packet
`are the next in sequence on the connection. T cp_input converts any packet that fails one of
`these tests into normal mbufs, calculating the checksum in the process, then processes it as
`usual.
`
`A single-copy packet that passed the tests is easy for tcp_input to handle. It calculates the
`checksum of the packet's TCP header, and stores it and a small amount of information from
`the header in the cluster, then appends the cluster to the appropriate receive socket buffer.
`Even if one is eventually found to be in error, it won't have changed the connection state.
`
`These tests are slight extensions of ones already in the conventional stack. T cp_input imple(cid:173)
`ments a feature called "header prediction" that tests most fields in the TCP header against
`a set of expected values. Packets which match are able to be processed quickly; all others,
`including those that alter the state of the connection or that require special processing, take
`a slower route. In typical stream connections, the only packets needing special treatment
`are those which establish or close the connection.
`
`The socket layer - recv This is the most intricate area of the single-copy code. As well
`as copying data from the socket buffer into a buffer in the program, recv has to verify the
`data are correct, acknowledge data, manage data the program has not yet asked for, and
`keep the information about the state of the connection up to date. To complicate matters,
`the receive socket buffer is a mixture of normal and single-copy mbufs.
`
`In the simplest case, the socket buffer contains one single-copy cluster, and the program asks
`recv for as much data as it can provide. Recv copies all the data from the cluster to the
`program's buffer, calculating the checksum as it does so. It then compares the result with
`the checksum tcpJnput placed in the cluster header. If the two match, it removes the cluster
`from the socket buffer and updates the socket and TCP control information. This causes
`
`12
`
`
`
`WISTRON CORP. EXHIBIT 1027.013
`
`
`
`Afterburner: Arcl1itectural Support for High-performance Protocols
`
`TCP to acknowledge the new data in due course. Should the checksum test fail, recv restores
`the buffer as far as possible to its original conditi