`
`AND SYSTEM FOR PROTOCOL PROCESSING
`
`Provisional Patent Application Under 35 U.S.C. § 111 (b)
`
`Inventors: (cid:9)
`
`Laurence B. Boucher
`Stephen E. J. Blightman
`Peter K. Craft
`David A. Higgin
`Clive M. Philbrick
`Daryl D. Starr
`
`Assignee: (cid:9)
`
`Alacritech Corporation
`
`1 (cid:9) Background of the Invention
`
`Network processing as it exists today is a costly and inefficient use of system resources.
`A 200 MHz Pentium-Pro is typically consumed simply processing network data from a
`100Mb/second-network connection. The reasons that this processing is so costly are
`described here.
`
`1.1 (cid:9) Too Many Data Moves
`
`When network packet arrives at a typical network interface card (NIC), the MC moves
`the data into pre-allocated network buffers in system main memory. From there the data
`is read into the CPU cache so that it can be checksummed (assuming of course that the
`protocol in use requires checksums. Some, like IPX, do not.). Once the data has been
`fully processed by the protocol stack, it can then be moved into its final destination in
`memory. Since the CPU is moving the data, and must read the destination cache line in
`before it can fill it and write it back out, this involves at a minimum 2 more trips across
`the system memory bus. In short, the best one can hope for is that the data will get
`moved across the system memory bus 4 times before it arrives in its final destination. It
`can, and does, get worse. If the data happens to get invalidated from system cache after it
`has been checksummed, then it must get pulled back across the memory bus before it can
`be moved to its final destination. Finally, on some systems, including Windows NT 4.0,
`the data gets copied yet another time while being moved up the protocol stack. In NT
`4.0, this occurs between the miniport driver interface and the protocol driver interface.
`This can add up to a whopping 8 trips across the system memory bus (the 4 trips
`described above, plus the move to replenish the cache, plus 3 more to copy from the
`miniport to the protocol driver). That's enough to bring even today's advanced memory
`busses to their knees.
`
`Provisional Pat. App. of Alacritech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label # EH756230105US
`
`Alacritech, Ex. 2023.001
`
`
`
`1.2 (cid:9) Too Much Processing by the CPU
`
`In all but the original move from the NIC to system memory, the system CPU is
`responsible for moving the data. This is particularly expensive because while the CPU is
`moving this data it can do nothing else. While moving the data the CPU is typically
`stalled waiting for the relatively slow memory to satisfy its read and write requests. A
`CPU, which can execute an instruction every 5 nanoseconds, must now wait as long as
`several hundred nanoseconds for the memory controller to respond before it can begin its
`next instruction. Even today's advanced pipelining technology doesn't help in these
`situations because that relies on the CPU being able to do useful work while it waits for
`the memory controller to respond. If the only thing the CPU has to look forward to for
`the next several hundred instructions is more data moves, then the CPU ultimately gets
`reduced to the speed of the memory controller.
`
`Moving all this data with the CPU slows the system down even after the data has been
`moved. Since both the source and destination cache lines must be pulled into the CPU
`cache when the data is moved, more than 3k of instructions and or data resident in the
`CPU cache must be flushed or invalidated for every 1500 byte frame. This is of course
`assuming a combined instruction and data second level cache, as is the case with the
`Pentium processors. After the data has been moved, the former resident of the cache will
`likely need to be pulled back in, stalling the CPU even when we are not performing
`network processing. Ideally a system would never have to bring network frames into the
`CPU cache, instead reserving that precious commodity for instructions and data that are
`referenced repeatedly and frequently.
`
`But the data movement is not the only drain on the CPU. There is also a fair amount of
`processing that must be done by the protocol stack software. The most obvious expense
`is calculating the checksum for each TCP segment (or UDP datagram). Beyond this,
`however, there is other processing to be done as well. The TCP connection object must
`be located when a given TCP segment arrives, IP header checksums must be calculated,
`there are buffer and memory management issues, and finally there is also the significant
`expense of interrupt processing which we will discuss in the following section.
`
`5
`
`1.3 (cid:9) Too Many Interrupts
`
`A 64k SMB request (write or read-reply) is typically made up of 44 TCP segments when
`running over Ethernet (1500 byte MTU). Each of these segments may result in an
`interrupt to the CPU. Furthermore, since TCP must acknowledge all of this incoming
`data, it's possible to get another 44 transmit-complete interrupts as a result of sending out
`the TCP acknowledgements. While this is possible, it is not terribly likely. Delayed
`ACK timers allow us to acknowledge more than one segment at a time. And delays in
`interrupt processing may mean that we are able to process more than one incoming
`network frame per interrupt. Nevertheless, even if we assume 4 incoming frames per
`input, and an acknowledgement for every 2 segments (as is typical per the ACK-every-
`other-segment property of TCP), we are still left with 33 interrupts per 64k SMB request.
`
`Interrupts tend to be very costly to the system. Often when a system is interrupted,
`important information must be flushed or invalidated from the system cache so that the
`interrupt routine instructions, and needed data can be pulled into the cache. Since the
`
`Provisional Pat. App. of Alacritech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label # EH756230105US
`
`Alacritech, Ex. 2023.002
`
`
`
`• (cid:9)
`
`CPU will return to its prior location after the interrupt, it is likely that t information
`flushed from the cache will immediately need to be pulled back into the cache.
`
`IP
`
`What's more, interrupts force a pipeline flush in today's advanced processors. While the
`processor pipeline is an extremely efficient way of improving CPU performance, it can
`be expensive to get going after it has been flushed.
`
`Finally, each of these interrupts results in expensive register accesses across the
`peripheral bus (PCI). This is discussed more in the following section.
`
`1.4 (cid:9)
`
`Inefficient Use of the Peripheral Bus (PCI)
`
`We noted earlier that when the CPU has to access system memory, it may be stalled for
`several hundred nanoseconds. When it has to read from PCI, it may be stalled for many
`microseconds. This happens every time the CPU takes an interrupt from a standard NIC.
`The first thing the CPU must do when it receives one of these interrupts is to read the
`MC Interrupt Status Register (ISR) from PCI to determine the cause of the interrupt. The
`most troubling thing about this is that since interrupt lines are shared on PC-based
`systems, we may have to perform this expensive PCI read even when the interrupt is not
`meant for us!
`
`There are other peripheral bus inefficiencies as well. Typical NICs operate using
`descriptor rings. When a frame arrives, the MC reads a receive descriptor from system
`memory to determine where to place the data. Once the data has been moved to main
`memory, the descriptor is then written back out to system memory with status about the
`received frame. Transmit operates in a similar fashion. The CPU must notify that MC
`that it has a new transmit. The MC will read the descriptor to locate the data, read the
`data itself, and then write the descriptor back with status about the send. Typically on
`transmits the MC will then read the next expected descriptor to see if any more data
`needs to be sent. In short, each receive or transmit frame results in 3 or 4 separate PCI
`reads or writes (not counting the status register read).
`
`2 (cid:9) Summary of the Invention
`
`Alacritech was formed with the idea that the network processing described above could
`be offloaded onto a cost-effective Intelligent Network Interface Card (INIC). With the
`Alacritech INIC, we address each of the above problems, resulting in the following
`advancements:
`1. The vast majority of the data is moved directly from the INIC into its final
`destination. A single trip across the system memory bus.
`2. There is no header processing, little data copying, and no checksumming required by
`the CPU. Because of this, the data is never moved into the CPU cache, allowing the
`system to keep important instructions and data resident in the CPU cache.
`3. Interrupts are reduced to as little as 4 interrupts per 64k SMB read and 2 per 64k
`SMB write.
`4. There are no CPU reads over PCI and there are fewer PCI operations per receive or
`transmit transaction.
`
`In the remainder of this document we will describe how we accomplish the above.
`Provisional Pat. App. of Alacritech, Inc. (cid:9)
`Inventors Laurence B. Boucher et al.
`Express Mail Label # EH756230105US
`
`3
`
`Alacritech, Ex. 2023.003
`
`
`
`(cid:9) •
`
`2.1 Perform Transport Lev Processing on the INIC
`Lev
`
`In order to keep the system CPU from having to process the packet headers or checksum
`the packet, we must perform this task on the INIC. This is a daunting task. There are
`more than 20,000 lines of C code that make up the FreeBSD TCP/IP protocol stack.
`Clearly this is more code than could be efficiently handled by a competitively priced
`network card. Furthermore, as we've noted above, the TCP/IP protocol stack is
`complicated enough to consume a 200 MHz Pentium-Pro. Clearly in order to perform
`this function on an inexpensive card, we need special network processing hardware as
`opposed to simply using a general purpose CPU.
`
`2.1.1 Only Support TCP/IP
`
`In this section we introduce the notion of a "context". A context is required to keep track
`of information that spans many, possibly discontiguous, pieces of information. When
`processing TCP/IP data, there are actually two contexts that must be maintained. The
`first context is required to reassemble IP fragments. It holds information about the status
`of the IP reassembly as well as any checksum information being calculated across the IP
`datagram (UDP or TCP). This context is identified by the IP_ID of the datagram as well
`as the source and destination IP addresses. The second context is required to handle the
`sliding window protocol of TCP. It holds information about which segments have been
`sent or received, and which segments have been acknowledged, and is identified by the
`IP source and destination addresses and TCP source and destination ports.
`
`If we were to choose to handle both contexts in hardware, we would have to potentially
`keep track of many pieces of information. One such example is a case in which a single
`64k SMB write is broken down into 44 1500 byte TCP segments, which are in turn
`broken down into 131 576 byte IP fragments, all of which can come in any order (though
`the maximum window size is likely to restrict the number of outstanding segments
`considerably).
`
`Fortunately, TCP performs a Maximum Segment Size negotiation at connection
`establishment time, which should prevent IP fragmentation in nearly all TCP
`connections. The only time that we should end up with fragmented TCP connections is
`when there is a router in the middle of a connection which must fragment the segments to
`support a smaller MTU. The only networks that use a smaller MTU than Ethernet are
`serial line interfaces such as SLIP and PPP. At the moment, the fastest of these
`connections only run at 128k (ISDN) so even if we had 256 of these connections, we
`would still only need to support 34Mb/sec, or a little over three lObT connections worth
`of data. This is not enough to justify any performance enhancements that the INIC
`offers. If this becomes an issue at some point, we may decide to implement the MTU
`discovery algorithm, which should prevent TCP fragmentation on all connections (unless
`an ICMP redirect changes the connection route while the connection is established).
`
`With this in mind, it seems a worthy sacrifice to not attempt to handle fragmented TCP
`segments on the INIC.
`
`UDP is another matter. Since UDP does not support the notion of a Maximum Segment
`Size, it is the responsibility of IP to break down a UDP datagram into MTU sized
`
`Provisional Pat. App. of Alacritech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label # EH756230105US
`
`Alacritech, Ex. 2023.004
`
`
`
`packets. Thus, fragmented UDP datagrams are very common. The most common UDP
`application running today is NFSV2 over UDP. While this is also the most common
`version of NFS running today, the current version of Solaris being sold by Sun
`Microsystems runs NFSV3 over TCP by default. We can expect to see the NFSV2/UDP
`traffic start to decrease over the coming years.
`
`In summary, we will only offer assistance to non-fragmented TCP connections on the
`INIC.
`
`2.1.2 Don't handle TCP "exceptions"
`
`As noted above, we won't provide support for fragmented TCP segments on the INIC.
`We have also opted to not handle TCP connection and breakdown. Here is a list of other
`TCP "exceptions" which we have elected to not handle on the INIC:
`
`Fragmented Segments —Discussed above.
`
`Retransmission Timeout — Occurs when we do not get an acknowledgement for
`previously sent data within the expected time period.
`
`Out of order segments — Occurs when we receive a segment with a sequence number
`other than the next expected sequence number.
`
`FIN segment — Signals the close of the connection.
`
`Since we have now eliminated support for so many different code paths, it might seem
`hardly worth the trouble to provide any assistance by the card at all. This is not the case.
`According to W. Richard Stevens and Gary Write in their book "TCP/IP Illustrated
`Volume 2", TCP operates without experiencing any exceptions between 97 and 100
`percent of the time in local area networks. As network, router, and switch reliability
`improve this number is likely to only improve with time.
`
`Provisional Pat. App. of Alacritech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label # EH756230105US
`
`Alacritech, Ex. 2023.005
`
`
`
`2.1.3 Two modes of operation
`
`So the next question is what to do about the network packets that do not fit our criteria.
`The answer is to use two modes of operation: One in which the network frames are
`processed on the INIC through TCP and one in which the card operates like a typical
`dumb MC. We call these two modes fast-path, and slow-path. In the slow-path case,
`network frames are handed to the system at the MAC layer and passed up through the
`host protocol stack like any other network frame. In the fast path case, network data is
`given to the host after the headers have been processed and stripped.
`
`INIC
`
`NetBIOS
`
`TCP
`
`IP
`
`MAC
`
`PHYSICAL
`
`Ethernet
`
`CLIENT
`
`TDI
`
`TCP
`
`IP
`
`MAC
`
`FAST-PATH (cid:9)
`
`SLOW-PATH
`
`PCI
`
`The transmit case works in much the same fashion. In slow-path mode the packets are
`given to the INIC with all of the headers attached. The INIC simply sends these packets
`out as if it were a dumb MC. In fast-path mode, the host gives raw data to the INIC
`which it must carve into MSS sized segments, add headers to the data, perform
`checksums on the segment, and then send it out on the wire.
`
`2.1.4 The TCB cache
`
`Consider a situation in which a TCP connection is being handled by the card and a
`fragmented TCP segment for that connection arrives. In this situation, it will be
`necessary for the card to turn control of this connection over to the host.
`
`This introduces the notion of a Transmit Control Block (TCB) cache. A TCB is a
`structure that contains the entire context associated with a connection. This includes the
`source and destination IP addresses and source and destination TCP ports that define the
`connection. It also contains information about the connection itself such as the current
`send and receive sequence numbers, and the first-hop MAC address, etc. The complete
`set of TCBs exists in host memory, but a subset of these may be "owned" by the card at
`any given time. This subset is the TCB cache. The INIC can own up to 256 TCBs at any
`given time.
`
`TCBs are initialized by the host during TCP connection setup. Once the connection has
`achieved a "steady-state" of operation, its associated TCB can then be turned over to the
`INIC, putting us into fast-path mode. From this point on, the INIC owns the connection
`until either a FIN arrives signaling that the connection is being closed, or until an
`
`Provisional Pat. App. of Alacritech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label # EH756230105US
`
`Alacritech, Ex. 2023.006
`
`
`
`exception occurs which the INIC is not designed to handle (such as an out of order
`
`•
`
`segment). When any of these conditions occur, the INIC will then flush the TCB back to
`host memory, and issue a message to the host telling it that it has relinquished control of
`the connection, thus putting the connection back into slow-path mode. From this point
`on, the INIC simply hands incoming segments that are destined for this TCB off to the
`host with all of the headers intact.
`
`Note that when a connection is owned by the INIC, the host is not allowed to reference
`the corresponding TCB in host memory as it will contain invalid information about the
`state of the connection.
`
`2.1.5 TCP hardware assistance
`
`When a frame is received by the INIC, it must verify it completely before it even
`determines whether it belongs to one of its TCBs or not. This includes all header
`validation (is it IP, IPV4 or V6, is the IP header checksum correct, is the TCP checksum
`correct, etc). Once this is done it must compare the source and destination IP address and
`the source and destination TCP port with those in each of its TCBs to determine if it is
`associated with one of its TCBs. This is an expensive process. To expedite this, we have
`added several features in hardware to assist us. The header is fully parsed by hardware
`and its type is summarized in a single status word. The checksum is also verified
`automatically in hardware, and a hash key is created out of the IP addresses and TCP
`ports to expedite TCB lookup. For full details on these and other hardware optimizations,
`refer to the INIC Hardware Specification sections (Heading 8).
`
`With the aid of these and other hardware features, much of the work associated with TCP
`is done essentially for free. Since the card will automatically calculate the checksum for
`TCP segments, we can pass this on to the host, even when the segment is for a TCB that
`the INIC does not own.
`
`2.1.6 TCP Summary
`
`By moving TCP processing down to the INIC we have offloaded the host of a large
`amount of work. The host no longer has to pull the data into its cache to calculate the
`TCP checksum. It does not have to process the packet headers, and it does not have to
`generate TCP ACKs. We have achieved most of the goals outlined above, but we are not
`done yet.
`
`2.2 (cid:9) Transport Layer Interface
`
`This section defines the INIC's relation to the hosts transport layer interface (Called TDI
`or Transport Driver Interface in Windows NT). For full details on this interface, refer to
`the Alacritech TCP (ATCP) driver specification (Heading 4).
`
`2.2.1 Receive
`
`Simply implementing TCP on the INIC does not allow us to achieve our goal of landing
`the data in its final destination. Somehow the host has to tell the INIC where to put the
`data. This is a problem in that the host can not do this without knowing what the data
`
`Provisional Pat. App. of Alacritech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label # EH756230105US
`
`Alacritech, Ex. 2023.007
`
`
`
`actually is. Fortunately, NT has provided a mechanism by which a transport driver can
`"indicate" a small amount of data to a client above it while telling it that it has more data
`to come. The client, having then received enough of the data to know what it is, is then
`responsible for allocating a block of memory and passing the memory address or
`addresses back down to the transport driver, which is in turn responsible for moving the
`data into the provided location.
`
`We will make use of this feature by providing a small amount of any received data to the
`host, with a notification that we have more data pending. When this small amount of data
`is passed up to the client, and it returns with the address in which to put the remainder of
`the data, our host transport driver will pass that address to the INIC which will DMA the
`remainder of the data into its final destination.
`
`Clearly there are circumstances in which this does not make sense. When a small amount
`of data (500 bytes for example), with a push flag set indicating that the data must be
`delivered to the client immediately, it does not make sense to deliver some of the data
`directly while waiting for the list of addresses to DMA the rest. Under these
`circumstances, it makes more sense to deliver the 500 bytes directly to the host, and
`allow the host to copy it into its final destination. While various ranges are feasible, it is
`currently preferred that anything less than a segment's (1500 bytes) worth of data will be
`delivered directly to the host, while anything more will be delivered as a small piece
`which may be128 bytes, while waiting until receiving the destination memory address
`before moving the rest.
`
`The trick then is knowing when the data should be delivered to the client or not. As
`we've noted, a push flag indicates that the data should be delivered to the client
`immediately, but this alone is not sufficient. Fortunately, in the case of NetBIOS
`transactions (such as SMB), we are explicitly told the length of the session message in the
`NetBIOS header itself. With this we can simply indicate a small amount of data to the
`host immediately upon receiving the first segment. The client will then allocate enough
`memory for the entire NetBIOS transaction, which we can then use to DMA the
`remainder of the data into as it arrives. In the case of a large (56k for example) NetBIOS
`session message, all but the first couple hundred bytes will be DMA'd to their final
`destination in memory.
`
`But what about applications that do not reside above NetBIOS? In this case we can not
`rely on a session level protocol to tell us the length of the transaction. Under these
`circumstances we will buffer the data as it arrives until A) we have receive some
`predetermined number of bytes such as 8k, or B) some predetermined period of time
`passes between segments or C) we get a push flag. If after any of these conditions occur
`we will then indicate some or all of the data to the host depending on the amount of data
`buffered. If the data buffered is greater than about 1500 bytes we must then also wait for
`the memory address to be returned from the host so that we may then DMA the
`remainder of the data.
`
`2.2.2 Transmit
`
`The transmit case is much simpler. In this case the client (NetBIOS for example) issues a
`TDI Send with a list of memory addresses which contain data that it wishes to send along
`Provisional Pat. App. of Alacritech, Inc. (cid:9)
`Inventors Laurence B. Boucher et al.
`Express Mail Label # EH756230105US
`
`8
`
`Alacritech, Ex. 2023.008
`
`
`
`e
`
`with the length. The host can t en pass this list of addresses and length ff to the INIC.
`The INIC will then pull the data from its source location in host memory, as it needs it,
`until the complete TDI request is satisfied.
`
`2.2.3 Affect on interrupts
`
`Note that when we receive a large SMB transaction, for example, that there are two
`interactions between the INIC and the host. The first in which the INIC indicates a small
`amount of the transaction to the host, and the second in which the host provides the
`memory location(s) in which the INIC places the remainder of the data. This results in
`only two interrupts from the INIC. The first when it indicates the small amount of data
`and the second after it has finished filling in the host memory given to it. A drastic
`reduction from the 33/64k SMB request that we estimate at the beginning of this section.
`
`On transmit, we actually only receive a single interrupt when the send command that has
`been given to the INIC completes.
`
`2.2.4 Transport Layer Interface Summary
`
`Having now established our interaction with Microsoft's TDI interface, we have achieved
`our goal of landing most of our data directly into its final destination in host memory.
`We have also managed to transmit all data from its original location on host memory.
`And finally, we have reduced our interrupts to 2 per 64k SMB read and 1 per 64k SMB
`write. The only thing that remains in our list of objectives is to design an efficient host
`(PCI) interface.
`
`2.3 (cid:9) Host (PCI) Interface
`
`In this section we define the host interface. For a more detailed description, refer to the
`"Host Interface Strategy for the Alacritech INIC" section (Heading 3).
`
`2.3.1 Avoid PCI reads
`
`One of our primary objectives in designing the host interface of the INIC was to
`eliminate PCI reads in either direction. PCI reads are particularly inefficient in that they
`completely stall the reader until the transaction completes. As we noted above, this could
`hold a CPU up for several microseconds, a thousand times the time typically required to
`execute a single instruction. PCI writes on the other hand, are usually buffered by the
`memory-bus<=>PCI-bridge allowing the writer to continue on with other instructions.
`This technique is known as "posting".
`
`2.3.1.1 Memory-based status register
`
`The only PCI read that is required by most NICs is the read of the interrupt status
`register. This register gives the host CPU information about what event has caused an
`interrupt (if any). In the design of our INIC we have elected to place this necessary status
`register into host memory. Thus, when an event occurs on the INIC, it writes the status
`register to an agreed upon location in host memory. The corresponding driver on the host
`reads this local register to determine the cause of the interrupt. The interrupt lines are
`
`Provisional Pat. App. of Alacritech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label # EH756230105US
`
`Alacritech, Ex. 2023.009
`
`
`
`held high until the host clears e interrupt by writing to the INIC's Interrupt Clear
`Register. Shadow registers are maintained on the INIC to ensure that events are not lost.
`
`IP
`
`2.3.1.2 Buffer Addresses are pushed to the INIC
`
`Since it is imperative that our INIC operate as efficiently as possible, we must also avoid
`PCI reads from the INIC. We do this by pushing our receive buffer addresses to the
`INIC. As mentioned at the beginning of this section, most NICs work on a descriptor
`queue algorithm in which the NIC reads a descriptor from main memory in order to
`determine where to place the next frame. We will instead write receive buffer addresses
`to the INIC as receive buffers are filled. In order to avoid having to write to the INIC for
`every receive frame, we instead allow the host to pass off a pages worth (4k) of buffers in
`a single write.
`
`2.3.2 Support small and large buffers on receive
`
`In order to reduce further the number of writes to the INIC, and to reduce the amount of
`memory being used by the host, we support two different buffer sizes. A small buffer
`contains roughly 200 bytes of data payload, as well as extra fields containing status about
`the received data bringing the total size to 256 bytes. We can therefore pass 16 of these
`small buffers at a time to the INIC. Large buffers are 2k in size. They are used to
`contain any fast or slow-path data that does not fit in a small buffer. Note that when we
`have a large fast-path receive, a small buffer will be used to indicate a small piece of the
`data, while the remainder of the data will be DMA'd directly into memory. Large
`buffers are never passed to the host by themselves, instead they are always accompanied
`by a small buffer which contains status about the receive along with the large buffer
`address. By operating in the manner, the driver must only maintain and process the small
`buffer queue. Large buffers are returned to the host by virtue of being attached to small
`buffers. Since large buffers are 2k in size they are passed to the INIC 2 buffers at a time.
`
`2.3.3 Command and response buffers
`
`In addition to needing a manner by which the INIC can pass incoming data to us, we also
`need a manner by which we can instruct the INIC to send data. Plus, when the INIC
`indicates a small amount of data in a large fast-path receive, we need a method of passing
`back the address or addresses in which to put the remainder of the data. We accomplish
`both of these with the use of a command buffer. Sadly, the command buffer is the only
`place in which we must violate our rule of only pushing data across PCI. For the
`command buffer, we write the address of command buffer to the INIC. The INIC then
`reads the contents of the command buffer into its memory so that it can execute the
`desired command. Since a command may take a relatively long time to complete, it is
`unlikely that command buffers will complete in order. For this reason we also maintain a
`response buffer queue. Like the small and large receive buffers, a page worth of response
`buffers is passed to the INIC at a time. Response buffers are only 32 bytes, so we have to
`replenish the INIC's supply of them relatively infrequently. The response buffers only
`purpose is to indicate the completion of the designated command buffer, and to pass
`status about the completion.
`
`Provisional Pat. App. of Alacritech, Inc. (cid:9)
`Inventors Laurence B. Boucher et al.
`Express Mail Label # EH756230105US
`
`10
`
`Alacritech, Ex. 2023.010
`
`
`
`•
`
`2.4 Examples
`
`In this section we will provide a couple of examples describing some of the differing data
`flows that we might see on the Alacritech INIC.
`
`2.4.1 Fast-path 56k NetBIOS session message
`
`Let's say a 56k NetBIOS session message is received on the INIC. The first segment will
`contain the NetBIOS header, which contains the total NetBIOS length. A small chunk of
`this first segment is provided to the host by filling in a small receive buffer, modifying
`the interrupt status register on the host, and raising the appropriate interrupt line. Upon
`receiving the interrupt, the host will read the ISR, clear it by writing back to the INIC's
`Interrupt Clear Register, and will then process its small receive buffer queue looking for
`receive buffers to be processed. Upon finding the small buffer, it will indicate the small
`amount of data up to the client to be processed by NetBIOS. It will also, if necessary,
`replenish the receive buffer pool on the INIC by passing off a pages worth of small
`buffers. Meanwhile, the NetBIOS client will allocate a memory pool large enough to
`hold the entire NetBIOS message, and will pass this address or set of addresses down to
`the transport driver. The transport driver will allocate an INIC command buffer, fill it in
`with the list of addresses, set the command type to tell the INIC that this is where to put
`the receive data, and then pass the command off to the INIC by writing to the command
`register. When the INIC receives the command buffer, it will DMA the remainder of the
`NetBIOS data, as it is received, into the memory address or addresses designated by the
`host. Once the entire NetBIOS transaction is complete, the INIC will complete the
`command by writing to the response buffer with the appropriate status and command
`buffer identifier.
`
`In this example, we have two interrupts, and all but a couple hundred bytes are DMA'd
`directly to their final destination. On PCI we have two interrupt status register writes,
`two interrupt clear register writes, a command register write, a command read, and a
`response buffer write.
`
`With a standard NIC this would result in an estimated 30 interrupts, 30 interrupt register
`reads, 30 interrupt clear writes, and 58 descriptor reads and writes. Plus the data will get
`moved anywhere from 4 to 8 times across the system memory bus.
`
`2.4.2 Slow-path receive
`
`If the INIC receives a frame that does not contain a TCP segment for one of its TCB's, it
`simply passes it to the host as if it were a dumb NIC.