`
`United States Patent and Trademark Office
`
`November 03, 2004
`
`THIS IS TO CERTIFY THAT ANNEXED HERETO IS A TRUE COPY FROM
`THE RECORDS OF THE UNITED STATES PA TENT AND TRADEMARK
`OFFICE OF THOSE PAPERS OF THE BELOW IDENTIFIED PATENT
`APPLlCA TION THAT MET THE REQUlREMENTS TO BE G RANTED A
`FILING DA TE UNDER 35 USC 111.
`
`APPLICA TlON NUMBER: 601061,809
`FILJNG DA TE: October 14, 1997
`
`By Authority of the
`COMMISSIONER OF PA TENTS AND T RADEMARJ{S
`
`Certifying Officer
`
`ALA00138383
`
`Ex.1031.001
`
`DELL
`
`
`
`
`
`
`
`
`
`INTELLIGENT NETWORK fNTERF ACE CARD
`
`AND SYSTEM FOR PROTOCOL PROCESSING
`
`Provisional Patent Application Under 35 U.S.C. § t 11 (b)
`
`Inventors:
`
`Laurence B. Boucher
`Stephen E. J. Blightman
`Peter K. Craft
`David A. Higgin
`Clive M. Philbrick
`Daryl D. Starr
`
`Assignee:
`
`Alacritecb Corporation
`
`1 Background of the Invention
`
`Network processing as it exists today is a costly and inefficient use of system resources.
`A 200 MHz Pentium-Pro is typically consumed simply processing network data from a
`1 OOMb/second-network connection. The reasons that this processing is so costly are
`described here.
`
`1.1 Too Many Data Moves
`
`When network packet arrives at a typical network interface card (NIC), the NJC moves
`the data into pre-allocated network buffers in system main memory. From there the data
`is read into the CPU cache so that it can be checkswnmed (assuming of course ~bat the
`protocol in use requires checksums. Some, like IPX, do not.). Once the data has been
`fully processed by the protocol stack, it can then be moved into its final destination in
`memory. Since the CPU is moving the data, and must read the destination cache line in
`before it can fill it and write it back out, this involves at a minimum 2 more trips across
`the system memory bus. In short, the best one can hope for is that the data will.get
`moved across the system memory bus 4 times before it arrives in irs final destination. It
`can, and does, get worse. If the data happens to get invalidated from system cache after it
`has been checksummed, then it must get pulled back across the memory bus before it can
`be moved to its final destination. Finally, on some systems, including Windows NT 4.0,
`the data gets copied yet another time while being moved up the protocol stack. In NT
`4.0, th.is occurs between the m.iniport driver interface and the protocol driver interface.
`This can add up to a whopping 8 trips across the system memory bus (the 4 trips
`described above, plus the move to replenish the cache, plus 3 more to copy from the
`miniport to the protocol driver). That's enough to bring even today's advanced memory
`busscs to their knees.
`
`Provisional Pat. App. of Afacritech, lac.
`Inventors Laurence B. Boucher et al.
`Express Mail label II EH75623010SUS
`
`ALA00138387
`
`Ex.1031.005
`
`DELL
`
`
`
`1.2 Too Much Processing by the CPU
`
`In all but the original move from the NIC to system memory, the system CPU is
`responsible for moving the data. This is particularly expensive because while the CPU is
`moving this data it can do nothing else. While moving the data the CPU is typically
`stalled waiting for the relatively slow memory to satisfy its read and write requests. A
`CPU, which can execute an instruction every 5 nanoseconds, must now wait as lt>ng as
`several hundred nanoseconds for the memory controller to respond before it can begin its
`next instruction. Even today's advanced pipelining technology doesn't help in ~ese
`situations because that relies on the CPU being able to do useful work while it waits for
`the memory controller to respond. If the only thing the CPU has to look forward to for
`the next several hundred instructions is more data moves, then the CPU ultimately gets
`reduced to the speed of the memory controller.
`
`Moving all this data with the CPU slows the system down even after the data has been
`moved. Since both the source and destination cache lines must be pulled into the CPU
`cache when the data is moved, more than 3k of instructions and or data resident ·in the
`CPU cache must be flushed or invalidated for every 1500 byte frame. This is of course
`assuming a combined instruction and data second level cache, as is the case with the
`Pentium processors. After the data has been moved, the former resident of the cache will
`likely need to be pulled back in, stalling the CPU even when we are not performing
`network processing. Ideally a system would never have to bring network frames into the
`CPU cache, instead reserving that precious commodity for instructions and data that are
`referenced repeatedly and frequently.
`
`But the data movement is not the only drain on the CPU. There is also a fair amount of
`processing that must be done by the protocol stack software. The most obvious expense
`is calculating the checksum for each TCP segment (or UDP datagram). Beyond this,
`however, there is other processing to be done as well. The TCP COIUlection object must
`be located when a given TCP segment arrives, IP header checksums must be calculated,
`there are buffer and memory management issues, and finally there is also the significant
`expense of interrupt processing which we will discuss in the following section.
`
`1.3 Too Many Interrupts
`
`A 64k SMB request (write or read-reply) is typically made up of 44 TCP segments when
`running over Ethernet (1500 byte MTU). Each of these segments may result in·an
`interrupt to the CP.U. Furthermore, since TCP must acknowledge all of this incoming
`data, it's possible to get another 44 transmit-complete interrupts as a result of sending out
`the TCP acknowledgements. While this is possible, it is not terribly likely. Delayed
`ACK timers allow us to acknowledge more than one segment at a time. And delays in
`interrupt processing may mean that we are able to process more than one incoming
`network frame per interrupt Nevertheless, even if we assume 4 incoming frames per
`input, and an acknowledgement for every 2 segments (as is typical per the ACK-every(cid:173)
`other-segment property ofTCP), we are still left with 33 interrupts per 64k SMB request.
`
`Interrupts tend to be very costly to the system. Often when a system is interrupted,
`important infonnation must be flushed or invalidated from the system cache so that the
`interrupt routine instructions, and needed data can be pulled into the cache. Since the
`
`Provisional Pat. App. of AJacritcch, lnc.
`Inventors Laurence B. Boucher ct al.
`Express Mail Label# EH75623010SUS
`
`2
`
`"
`
`ALA00138388
`
`Ex.1031.006
`
`DELL
`
`
`
`-~
`
`CPU will return to its prior location after the interrupt. it is likely that the infonnation
`flushed from the cache will immediately need to be pulled back into the cache.
`
`What's more, interrupts force a pipeline flush in today's advanced processors. While the
`processor pipeline is an extremely efficient way of improving CPU perfonnance, it can
`be expensive to get going after it has been flushed.
`
`Finally, each of these interrupts results in expensive register accesses across the
`peripheral bus (PCI). This is discussed more in the following section.
`
`1.4
`
`Inefficient Use of the Peripheral Bus (PCI)
`
`We noted earlier that when the CPU has to access system memory, it may be stalled for
`several hundred nanoseconds. When it has to read from PCI, it may be stalled for many
`microseconds. This happens every time the CPU takes an interrupt from a stand.ard NIC.
`The first thing the CPU must do when it receives one of these interrupts is to read the
`NIC Interrupt Status Register (ISR) from PCI to determine the cause of the inter,:upt. The
`most troubling thing about this is that since interrupt lines are shared on PC-based
`systems, we may have to perform this expensive Per read even when the interrupt is not
`meant for us!
`
`There are other peripheral bus inefficiencies as well. Typical NICs operate using
`descriptor rings. When a frame arrives, the NlC reads a receive descriptor from system
`memory to determine where to place the data. Once the data has been moved to main
`memory, the descriptor is then written back out to system memory with status about the
`received frame. Transmit operates in a similar fashion. The CPU must notify that NlC
`that it has a new transmit The NIC will read the descriptor to locate the data, read the
`data itself, and then write the descriptor back with status about the send. Typically on
`transmits the NIC will then read the next expected descriptor to see if any more data
`needs to be sent. In short, each receive or transmit frame results in 3 or 4 separate PCI
`reads or writes (not counting the status register read).
`
`2 Summary of the Invention
`
`Alacritech was formed with the idea that the network processing described above could
`be offloaded onto a cost-effective Intelligent Network Interface Card (INIC). With the
`Alacritech INIC, we address each of the above problems, resulting in the following
`advancements:
`l. The vast majority of the data is moved directly from the INIC into its final
`destination. A single trip across the system memory bus.
`2. There is no header processing, little data copying, and no checksumming required by
`the CPU. Because of this, the data is never moved into the CPU cache, allowing the
`system to keep imponant instructions and data resident in the CPU cache.
`3. Intenupts are reduced to as little as 4 interrupts per 64k SMB read and 2 per 64k
`SMB write.
`4. There are no CPU reads over PCI and there are fewer PCl operations per receive or
`transmit transaction.
`
`In the remainder of this document we will describe how we accomplish the above.
`
`Provisional Pat .. App. of Alacritech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label# EH756230105US
`
`3
`
`ALA00138389
`
`Ex.1031.007
`
`DELL
`
`
`
`Cli c
`0
`0\
`""' co
`0 w
`•
`
`2.1 Perform Transport Level Processing on the INIC
`
`In order to keep the system CPU from having to process the packet headers or checksum
`the packet, we must perform this task on the INIC. This is a daunting task. Th~e are
`more than 20,000 lines of C code that make up the FreeBSD TCP/IP protocol stick.
`Clearly this is more code than could be efficiently handled by a competitively priced
`network card. Furthermore, as we've noted above, the TCP/IP protocol stack is .
`complicated enough to consume a 200 MHz Pentium-Pro. Clearly in order to perform
`this function on an inexpensive card, we need special network processing hardware as
`opposed to simply using a general purpose CPU.
`
`2. 1.1 Only Support TCP/IP
`
`In this section we introduce the notion of a "context". A context is required to keep track
`of information that spans many, possibly discontiguous, pieces of information. When
`processing TCP/IP data, there are actually two contexts that must be maintained. The
`first context is required to reassemble IP fragments. It holds information about the status
`of the IP reassembly as well as any checksum information being calculated across the IP
`datagram (UDP or TCP). This context is identified by the IP _ID of the datagram as well
`as the source and destination IP addresses. The second context is required to hahdle the
`sliding window protocol of TCP. It holds information about which segments have been
`sent or received, and which segments have been acknowledged, and is identified by the
`IP source and destination addresses and TCP source and destination ports.
`
`lf we were to choose to handle both contexts in hardware, we would have to potentially
`keep track of many pieces of information. One such example is a case in which a single
`64k SMB write is broken down into 44 1500 byte TCP segments, which are in turn
`broken down into 131 576 byte IP fragments., all of which can come in any order (though
`the maximwn window size is likely to restrict the number of outstanding segments
`considerably).
`
`Fortunately, TCP performs a Maximum Segment Size negotiation at connection
`establishment time, which should prevent IP fragmentation in nearly all TCP
`connections. The only time that we should end up with fragmented TCP conn~tions is
`when there is a router in the middle of a connection which must fragment the s<;gments to
`support a smaller MTU. The only networks that use a smaller MTU than Ethernet are
`serial line interfaces such as SLIP and PPP. At the moment, the fastest of these
`connections only run at 128k (ISON) so even if we had 256 of these connections, we
`would still only need to support 34Mb/sec, or a little over three lObT connections worth
`of data. This is not enough to justify any performance enhancements that the INIC
`offers. If this becomes an issue at some point, we may decide to implement the MTU
`discovery algorithm, which should prevent TCP fragmentation on all connections (unless
`an ICMP redirect changes the connection route while the connection is established).
`
`With this in mind, it seems a worthy sacrifice to not attempt to handle fragmented TCP
`segments on the INIC.
`
`UDP is another matter. Since UDP does not support the notion of a Maximum Segment
`Size, it is the responsibility of IP to break down a UDP datagram into MTU sized
`
`Provisional Pat .. App. of Alacrilech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label# EH756230105US
`
`4
`
`ALA00138390
`
`Ex.1031.008
`
`DELL
`
`
`
`packets. Thus, fragmented UDP datagrams are very common. The most common UDP
`application running today is NFSV2 over UDP. While this is also the most common
`version of NFS running today, the current version ofSolaris being sold by Sun
`Microsystems runs NFSV3 over TCP by defaulL We can expect to see the NFSV2/UDP
`traffic start to decrease over the coming years.
`
`In summary, we will only offer assistance to non-fragmented TCP connections on the
`INIC.
`
`2.1.2 Don't handle TCP "exceptions"
`
`As noted above, we won't provide support for fragmented TCP segments on the INIC.
`We have also opted to not handle TCP connection and breakdown. Here is a list of other
`TCP "exceptions" which we have elected to not handle on the INIC:
`
`Fragmented Segments - Discussed above.
`
`Retransmission Timeout - Occurs when we do not get an acknowledgement for
`previously sent data within the expected time period.
`
`Out of order segments - Occurs when we receive a segment with a sequence number
`other than the next expected sequence number.
`
`FIN segment - Signals the close of the connection.
`
`•
`
`Since we have now eliminated support for so many different code paths, it might seem
`hardly worth the trouble to provide any assistance by the card at all. This is not the case .
`According to W. Richard Stevens and Gary Write in their book "TCP/IP Illustrated
`Volume 2", TCP operates without experiencing any exceptions between 97 and 100
`percent of the time in local area networks. As network, router, and switch reliability
`improve this number is likely to only improve with time.
`
`Provisional Pat. App. of Alacritech, lnc.
`Invent.ors Laurence B. Boucher et al.
`Express Mail Label# EH756230105US
`
`5
`
`ALA00138391
`
`Ex.1031.009
`
`DELL
`
`
`
`
`
`exception occurs which the INIC is not designed to handle (such as an out of order
`segment). When any of these conditions occur, the INIC will then flush the TCB back to
`host memory, and issue a message to the host telling it that it has relinquished c~ntrol of
`the connection, thus putting the connection back into slow-path mode. From thi~ point
`on, the IN1C simply hands incoming segments that are destined for this TCB off: to the
`host with all of the headers intact.
`
`Note that when a connection is owned by the INIC, the host is not allowed to reference
`the corresponding TCB in host memory as it will contain invalid information about the
`state of the connection.
`
`2.1.5 TCP hardware assistance
`
`When a frame is received by the IN1C, it must verify it completely before it even
`detennines whether it belongs to one of its TCBs or not. This includes all header
`validation (is it IP, IPV 4 or V 6, is the IP header checksum correct, is the TCP checksum
`correct, etc). Once this is done it must compare the source and destination IP aqdress and
`the source and destination TCP port with those in each of its TCBs to determine if it is
`associated with one of its TCBs. This is an expensive process. To expedite thi~, we have
`added several features in hardware to assist us. The header is fully parsed by hardware
`and its type is summarized in a single status word. The checksum is also verified
`automatically in hardware, and a hash key is created out of the IP addresses and'. TCP
`ports to expedite TCB lookup. For full details on these and other hardware optimizations,
`refer to the INIC Hardware Specification sections (Heading 8).
`
`I!
`
`With the aid of these and other hardware features, much of the work associated with TCP
`is done essentially for free. Since the card will automatically calculate the checksum for
`TCP segments, we can pass this on to the host, even when the segment is for a TCB that
`the INIC does not own.
`
`2.1.6 TCP Summary
`
`By moving TCP processing down to the INIC we have offioaded the host of a large
`amount of work. The host no longer bas to pull the data into its cache to calculate the
`TCP checksum. It does not have to process the packet headers, and it does not have to
`generate TCP ACKs. We have achieved most of the goals outlined above, but we are not
`done yet.
`
`2.2 Tra.nsport Layer Interface
`
`This section defines the INIC's relation to the hosts transport layer interface (Called TDI
`or Transport Driver Interface in Windows NT). For full details on this interface, refer to
`the Alacritech TCP (ATCP) driver specification (Heading 4).
`
`2.2.1 Receive
`
`Simply implementing TCP on the INIC does not allow us to achieve our goal oflanding
`the data in its final destination. Somehow the host has to tell the IN1C where to put the
`data. This is a problem in that the host can not do this without knowing what the data
`
`Provisional PaL App. of Alacritech, Inc.
`lnventors Laurence B. Boucher et al.
`Express Mail Label# EH756230I05US
`
`7
`
`ALA00138393
`
`Ex.1031.011
`
`DELL
`
`
`
`.,
`
`actually is. Fortunately, NT has provided a mechanism by which a transport driver can
`"indicate" a small amount of data to a client above it while telling it that it has more data
`to come. The client, having then received enough of the data to know what it is,!is then
`responsible for allocating a block of memory and passing the memory address or
`addresses back down to the transport driver, which is in turn responsible for movtng the
`data into the provided location.
`
`We will make use of this feature by providing a small amount of any received data to the
`host, with a notification that we have more data pending. When this small amount of data
`is passed up to the client, and it returns with the address in which to put the rem:iinder of
`the data, our host transport driver will pass that address to the INIC which will DMA the
`remainder of the data into its final destination.
`
`Clearly there are circumstances in which this does not make sense. When a small amount
`of data (500 bytes for example), with a push flag set indicating that the data must be
`delivered to the client immediately, it does not make sense to deliver some of the data
`directly while waiting for the list of addresses to DMA the rest. Under these
`circumstances, it makes more sense to deliver the 500 bytes directly to the host, and
`allow the host to copy it into its final destination. While various ranges are feasible, it is
`currently preferred that anything less than a segment's (1500 bytes) worth of data will be
`delivered directly to the host, while anything more will be delivered as a small piece
`which may be128 bytes, while waiting until receiving the destination memory address
`before moving the rest.
`
`The trick then is knowing when the data should be delivered to the client or not. As
`we've noted, a push flag indicates that the data should be delivered to the client
`immediately, but this alone is not sufficient. Fortunately, in the case ofNetBIOS
`transactions (such as SMB), we are explicitly told the length of the session message in the
`NetBIOS header itself With this we can simply indicate a small amount of data to the
`host immediately upon receiving the first segment. The client will then allocat~ enough
`memory for the entire NetBIOS transaction, which we can then use to DMA the
`remainder of the data into as it arrives. In the case ofa large (56k for example) NetBIOS
`session message, all but the first couple hundred bytes will be DMA'd to their final
`destination in memory.
`
`But what about applications that do not reside above NetBIOS? In this case we can not
`rely on a session level protocol to tell us the length of the transaction. Under these
`circumstances we will buffer the data as it arrives until A) we have receive some
`predetermined number ofbytes such as 8k, or B) some predetermined period of time
`passes between segments or C) we get a push flag. If after any of these conditions occur
`we will then indicate some or all of the data to the host depending on the amount of data
`buffered. If the data buffered is greater than about 1500 bytes we must then also wait for
`the memory address to be returned from the host so that we may then DMA the
`remainder of the data.
`
`2.2.2 Transmit
`
`The transmit case is much simpler. In this case the client (NetBIOS for example) issues a
`TDI Send with a list of memory addresses which contain data that it wishes to send along
`
`Provisional Pat. App. of Alacritech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label# EH75623010SUS
`
`8
`
`ALA00138394
`
`Ex.1031.012
`
`DELL
`
`
`
`,
`
`with the length. The host can then pass this list of addresses and length off to the INIC.
`The INIC will then pull the data from its source location in host memory, as it needs it,
`until the complete TOI request is satisfied.
`
`2.2.3 Affect on interrupts
`
`Note that when we receive a large SMB transaction, for example, that there are two
`interactions between the INIC and the host. The first in which the INIC indicates a small
`amount of the transaction to the host, and the second in which the host provides the
`memory location(s) in which the JNIC places the remainder of the data This results in
`only two interrupts from the INIC. The first when it indicates the small amount of data
`and the second after it has finished filling in the host memory given to it. A drastic
`reduction from the 33/64k SMB request that we estimate at the beginning of this section.
`
`On transmit, we actually only receive a single interrupt when the send command that has
`been given to the INIC completes.
`
`2.2.4 Transport Layer Interface Summary
`
`Having now established our interaction with Microsoft's TDI interface, we have achieved
`our goal of landing most of our data directly into its final destination in host memory.
`We have also managed to transmit all data from its original location on host memory.
`And finally, we have reduced our interrupts to 2 per 64k SMB read and 1 per 64k SMB
`write. The only thing that remains in our list of objectives is to design an efficient host
`(PCI) interface.
`
`2.3 Host (PCI) Interface
`
`In this section we define the host interface. For a more detailed description, refer to the
`"Host Interface Strategy for the Alacritech JNIC" section (Heading 3).
`
`2.3.1 Avoid PCI reads
`
`One of our primary objectives in designing the host interface of the INIC was to
`eliminate PC! reads in either direction. PC! reads are particularly inefficient in that they
`completely stall the reader until the transaction completes. As we noted above, this could
`hold a CPU up for several microseconds, a thousand times the time typically required to
`execute a single instruction. PCl writes on the other hand, are usually buffered by the
`memory-bus¢> PCI-bridge allowing the writer to continue on with other instructions.
`This technique is known as "posting".
`
`2.3.1.1 Memory-based status register·
`
`The only PCI read that is required by most NICs is the read of the interrupt status
`register. This register gives the host CPU infonnation about what event has caused ~
`inteilllpt (if any). In the design of our INIC we have elected to place this necessary status
`register into host memory. Thus, when an event occurs on the INIC, it writes the status
`register to an agreed upon location in host memory. The correspondihg driver on the host
`reads this local register to determine the cause of the interrupt. The interrupt lines are
`
`Provisional Pat. App. of Alacritech. Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label II EH756230105US
`
`9
`
`ALA00138395
`
`Ex.1031.013
`
`DELL
`
`
`
`held high until the host clears the interrupt by writing to the INIC's Interrupt Clear
`Register. Shadow registers are maintained on the INIC to ensure that events are )lot lost.
`
`2.3.1.2 Buffer Addresses are pushed to the INIC
`
`Since it is imperative that our INIC operate as efficiently as possible, we must also avoid
`PCl reads from the INlC. We do this by pushing our receive buffer addresses to the
`INIC. A$ mentioned at the b eginning of this section, most NICs work on a descriptor
`queue algorithm in which the NIC reads a descriptor from main memory in order to
`detennine where to place the next frame. We will instead write receive buffer addresses
`to the INIC as receive buffers are filled. In order to avoid having to write to the lNJC for
`every receive frame, w e instead allow the host to pass off a pages worth (4k) of buffers in
`a single write.
`
`2.3.2 Support small and large buffers on receive
`
`In order to reduce further the number of writes to the INIC, and to reduce the amount of
`memory being used by the host, we support two different buffer sizes. A small buffer
`contains roughly 200 bytes of data payload, as well as extra fields containing status about
`the received data bringing the total size to 256 bytes. We can therefore pass 16 of these
`small buffers at a time to the INIC. Large buffers are 2k in size. They are used to
`contain any fast or slow-path data that does not fit in a small buffer. Note that when we
`have a large fast-path receive, a small buffer will be used to indicate a small piece of the
`data. while the remainder of the data will be DMA'd directly into memory. Large
`buffers are never passed to the host by themselves, instead lbey are always accompanied
`by a small buffer which contains status about the receive along with the large buffer
`address. By operating in the manner, the driver must only maintain and process the small
`buffer queue. Large buffers are returned to the host by virtu e of being attached to small
`buffers. Since large buffers are 2k in size they are passed to the INIC 2 buffers at a time.
`
`2.3.3 Command and response buffers
`
`tn addition to needing a manner by which the INIC can pass incoming data to us, we also
`need a manner by which we can instruct the INIC to send data. Plus, when the INIC
`indicates a small amount of data in a large fast-path receive, we need a method of passing
`back the address or addresses in which to put the remainder of the data. We accomplish
`both of these with the use of a command buffer. Sadly, the command buffer is the only
`place in which we must violate our rule of only pushing data across PCI. For the
`command buffer, we write the address of command buffer to the INIC. The INIC then
`reads the contents of the command buffer into its memory so that it can execute the
`desired command. Since a command may take a relatively long time to complete, it is
`unlikely that command buffers will complete in order. For this reason we also maintain a
`response buffer queue. Like the small and large receive buffers, a page worth of response
`buffers is passed to the INIC at a time. Response buffers are only 32 bytes, so we have to
`replenish the INIC's supply of them relatively infrequently. The response buffers only
`purpose is to indicate the completion of the designated command buffer, and to pass
`status about the completion.
`
`Provisional Pat. App. of Alacritccb, lnc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label# EH756230105US
`
`10
`
`ALA001 38396
`
`Ex.1031.014
`
`DELL
`
`
`
`2.4 Examples
`
`In this section we will provide a couple of examples describing some of the differing data
`flows that we might see on the Alacritech INJC.
`
`2.4. l Fast-path 56k NetBIOS session message
`
`Let's say a 56k NetBIOS session message is received on the INIC. The first segment will
`contain the NetBIOS header, which contains the total NetBIOS length. A small chunk of
`this first segment is provided to the host by filling in a small receive buffer, modifying
`the interrupt status register on the host, and raising the appropriate interrupt line. Upon
`receiving the interrupt, the host will read the ISR, clear it by writing back to the INIC's
`Interrupt Clear Register, and will then process its small receive buffer queue looking for
`receive buffers to be processed. Upon finding the small buffer, it will indicate the small
`amount of data up to the client to be processed by NetBIOS. It will also, if necessary,
`replenish the receive buffer pool on the INIC by passing off a pages worth of small
`buffers. Meanwhile, the NetBIOS client will allocate a memory pool large eno4gh to
`hold the entire NetBIOS message, and will pass this address or set of addresses Clown to
`the transport driver. The transport driver will allocate an INIC command buffer, fill it in
`with the list of addresses, set the command type to tell the INIC that this is whei:e to put
`the receive data, and then pass the command off to the INIC by writing to the command
`register. When the INIC receives the command buffer, it will DMA the remainder of the
`NetBIOS data, as it is received, into the memory address or addresses designated by the
`host. Once the entire NetBIOS transaction is complete, the INIC will complete the
`command by writing to the response buffer with the appropriate status and command
`buffer identifier.
`
`In this example, we have two interrupts, and all but a couple hundred bytes are DMA'd
`directly to their final destination. On PCI we have two interrupt status register writes,
`two interrupt clear register writes, a command register write, a command read, and a
`response buffer write.
`·
`
`With a standard NIC this would result in an estimated 30 interrupts, 30 interrupt register
`reads, 30 interrupt clear writes, and 58 descriptor reads and writes. Plus the data will get
`moved anywhere from 4 to 8 times across the system memory bus.
`
`2.4.2 Slow-path receive
`
`If the INIC receives a frame that does not contain a TCP segment for one of its TCB's, it
`simply passes it to the host as if it were a dumb NIC. If the frame fits into a small buffer
`(-200 bytes or less), then it simply fills in the small buffer with the data and notifies the
`host. Otherwise it places the data in a large buffer, writes the address of the large buffer
`into a small buffer, and again notifies the host. The host, having received the ~terrupt
`and found the completed small buffer, checks to see if the data is contained in ¢.e small
`buffer, and if not, locates the large buffer. Having found the data, the host will then pass
`the frame upstream to be processed by the standard protocol stack. It must also replenish
`the INIC's small and large receive buffer pool if necessary.
`
`.(
`
`Provisional Pat. App. of Alacritech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label# EH756230105US
`
`11
`
`ALA00138397
`
`Ex.1031.015
`
`DELL
`
`
`
`With the INIC, this will result in one interrupt, one interrupt status register write and one
`interrupt clear register write as well as a possible small and or large receive buffer
`register write. The data will go through the normal path although if it is TCP data then
`the host will not have to perform the checksum.
`
`With a standard NIC this will result in a single interrupt, an interrupt status register read,
`an interrupt clear register write, and a descriptor read and write. The data will get
`processed as it would by the INIC, except for a po:>sible extra checkswn.
`
`2.4.3 Fast-path 400 byte send
`
`In this example, lets assume that the client has a small amount of data to send. It will
`issue the TOI Send to the transport driver which will allocate a command buffer, fill it in
`with the address of the 400 byte send., and set the command to indicate that it is a
`transmit. It will then pass the command off to the INIC by writing to the command
`register. The INIC will then DMA the 400 bytes into its own memory, prepare a frame
`with the appropriate checksums and headers, and send the frame