`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`Application of Laurence B. Boucher et al.
`
`Docket No: ALA-004
`
`Filing Date: August 27, 1998
`
`Express Mail Label No: EE237180694US
`
`Title:
`
`INTELLIGENT NETWORK INTERFACE DEVICE
`AND SYSTEM FOR ACCELERATED COMMUNICATION
`
`August 27, 1998
`
`Box Provisional Application
`Assistant Commissioner for Patents
`Washington, D.C. 20231
`
`Sir:
`
`This is a request for filing the above-referenced, attached PROVISIONAL
`
`APPLICATION FOR PATENT under CFR 1.53(b )(2). The inventors are listed below:
`
`Inventors
`
`Residence
`
`Laurence B. Boucher
`J. Blightman
`Stephen
`Peter K. Craft
`David A. Higgin
`Clive M. Philbrick
`Daryl D. Starr
`
`Saratoga, California
`San Jose, California
`San Francisco, California
`Saratoga, California
`San Jose, California
`Milpitas, California
`
`The application contains a 173 page Specification which includes interspersed
`
`Drawings.
`
`Please address all correspondence to the address below.
`
`Enclosed please find a check in the amount of $150.00 to cover the Filing Fee.
`
`CERTIFICATE OF MAILING
`I hereby certify that this correspondence is being deposited with
`the United States Postal Service, Express Mail Post Office to
`Addressee, Label No. EE237180694US, addressed to: Box
`Provisional Application, Assistant Commissioner for Patents,
`Washington, D.C. 2023 I, on August 27, 1998.
`Date: 8'- 2 7-"!f?
`
`Mark Lauer
`
`Respectfully submitted,
`
`Mark Lauer
`Reg. No. 36,578
`6850 Regional Street
`Suite 250
`Dublin, CA 94568
`Tel: (925) 556-3500
`Fax: (925) 803-8189
`
`INTEL Ex.1092.001
`
`
`
`l
`
`'
`
`INTELLIGENT NETWORK INTERFACE DEVICE
`
`AND SYSTEM FOR ACCELERATED COMMUNICATION
`
`Provisional Patent Application Filed Under 35 U.S.C. § 111 (b)
`
`Inventors:
`
`Laurence B. Boucher
`Stephen E. J. Blightman
`Peter K. Craft
`David A. Higgin
`Clive M. Philbrick
`Daryl D. Starr
`
`Assignee:
`
`Alacritech Corporation
`
`t.
`
`Background of the Invention
`Network processing as it exists today is a costly and inefficient use of system resources.
`A 200 MHz Pentium-Pro is typically consumed simply processing network data from a
`1 OOMb/second-network connection. The reasons that this processing is so costly are described in
`the next few pages.
`
`Too Many Data Moves
`1.1.
`When network packet arrives at a typical network interface card (NIC), the NIC moves
`the data into pre-allocated network buffers in system main memory. From there the data is read
`into the CPU cache so that it can be checksummed (assuming of course that the protocol in use
`requires checksums. Some, like IPX, do not.). Once the data has been fully processed by the
`protocol stack, it can then be moved into its final destination in memory. Since the CPU is
`moving the data, and must read the destination cache line in before it can fill it and write it back
`out, this involves at a minimum 2 more trips across the system memory bus. In short, the best
`one can hope for is that the data will get moved across the system memory bus 4 times before it
`arrives in its final destination. It can, and does, get worse. If the data happens to get invalidated
`from system cache after it has been checksummed, then it must get pulled back across the
`memory bus before it can be moved to its final destination. Finally, on some systems, including
`Windows NT 4.0, the data gets copied yet another time while being moved up the protocol stack.
`In NT 4.0, this occurs between the miniport driver interface and the protocol driver interface.
`This can add up to a whopping 8 trips across the system memory bus (the 4 trips described
`above, plus the move to replenish the cache, plus 3 more to copy from the miniport to the
`protocol driver). That's enough to bring even today's advanced memory busses to their knees.
`
`Provisional Pat. App. of Alacritech Corp., Inventors Laurence B. Boucher et al.
`Express Mail Label No. EE237180694US
`
`INTEL Ex.1092.002
`
`
`
`Too Much Processing by the CPU
`1.2.
`In all but the original move from the NIC to system memory, the system CPU is
`responsible for moving the data. This is particularly expensive because while the CPU is moving
`this data it can do nothing else. While moving the data the CPU is typically stalled waiting for
`the relatively slow memory to satisfy its read and write requests. A CPU, which can execute an
`instruction every 5 nanoseconds, must now wait as long as several hundred nanoseconds for the
`memory controller to respond before it can begin its next instruction. Even today's advanced
`pipelining technology doesn't help in these situations because that relies on the CPU being able
`to do useful work while it waits for the memory controller to respond. If the only thing the CPU
`has to look forward to for the next several hundred instructions is more data moves, then the
`CPU ultimately gets reduced to the speed of the memory controller.
`Moving all this data with the CPU slows the system down even after the data has been
`moved. Since both the source and destination cache lines must be pulled into the CPU cache
`when the data is moved, more than 3k of instructions and or data resident in the CPU cache must
`be flushed or invalidated for every 1500 byte frame. This is of course assuming a combined
`instruction and data second level cache, as is the case with the Pentium processors. After the
`data has been moved, the former resident of the cache will likely need to be pulled back in,
`stalling the CPU even when we are not performing network processing. Ideally a system would
`never have to bring network frames into the CPU cache, instead reserving that precious
`commodity for instructions and data that are referenced repeatedly and frequently.
`But the data movement is not the only drain on the CPU. There is also a fair amount of
`processing that must be done by the protocol stack software. The most obvious expense is
`calculating the checksum for each TCP segment (or UDP datagram). Beyond this, however,
`there is other processing to be done as well. The TCP connection object must be located when a
`given TCP segment arrives, IP header checksums must be calculated, there are buffer and
`memory management issues, and finally there is also the significant expense of interrupt
`processing which we will discuss in the following section.
`
`Too Many Interrupts
`1.3.
`A 64k SMB request (write or read-reply) is typically made up of 44 TCP segments when
`running over Ethernet (1500 byte MTU). Each of these segments may result in an interrupt to
`the CPU. Furthermore, since TCP must acknowledge all of this incoming data, it's possible to
`get another 44 transmit-complete interrupts as a result of sending out the TCP
`acknowledgements. While this is possible, it is not terribly likely. Delayed ACK timers allow
`us to acknowledge more than one segment at a time. And delays in interrupt processing may
`mean that we are able to process more than one incoming network frame per interrupt.
`Nevertheless, even if we assume 4 incoming frames per input, and an acknowledgement for
`every 2 segments (as is typical per the ACK-every-other-segment property of TCP), we are still
`left with 33 interrupts per 64k SMB request.
`Interrupts tend to be very costly to the system. Often when a system is interrupted,
`important information must be flushed or invalidated from the system cache so that the interrupt
`routine instructions, and needed data can be pulled into the cache. Since the CPU will return to
`its prior location after the interrupt, it is likely that the information flushed from the cache will
`immediately need to be pulled back into the cache.
`
`Provisional Pat. App. of Alacritech Corp., Inventors Laurence B. Boucher et al.
`Express Mail Label No. EE237180694US
`
`2
`
`INTEL Ex.1092.003
`
`
`
`What's more, interrupts force a pipeline flush in today's advanced processors. While the
`processor pipeline is an extremely efficient way of improving CPU performance, it can be
`expensive to get going after it has been flushed.
`Finally, each of these interrupts results in expensive register accesses across the
`peripheral bus (PCI). This is discussed more in the following section.
`
`Inefficient Use of the Peripheral Bus (PCI)
`t.4.
`We noted earlier that when the CPU has to access system memory, it may be stalled for
`several hundred nanoseconds. When it has to read from PCI, it may be stalled for many
`microseconds. This happens every time the CPU takes an interrupt from a standard N1C. The
`first thing the CPU must do when it receives one of these interrupts is to read the NIC Interrupt
`Status Register (ISR) from PCI to determine the cause of the interrupt. The most troubling thing
`about this is that since interrupt lines are shared on PC-based systems, we may have to perform
`this expensive PCI read even when the interrupt is not meant for us!
`There are other peripheral bus inefficiencies as well. Typical NI Cs operate using
`descriptor rings. When a frame arrives, the NIC reads a receive descriptor from system memory
`to determine where to place the data. Once the data has been moved to main memory, the
`descriptor is then written back out to system memory with status about the received frame.
`Transmit operates in a similar fashion. The CPU must notify that NIC that it has a new transmit.
`The NIC will read the descriptor to locate the data, read the data itself, and then write the
`descriptor back with status about the send. Typically on transmits the N1C will then read the
`next expected descriptor to see if any more data needs to be sent. In short, each receive or
`transmit frame results in 3 or 4 separate PCI reads or writes (not counting the status register
`read).
`
`2.
`
`Summary of the Invention
`Alacritech was formed with the idea that the network processing described above could
`be offloaded onto a cost-effective Intelligent Network Interface Card (INIC). With the
`Alacritech IN1C, we address each of the above problems, resulting in the following
`advancements:
`
`The vast majority of the data is moved directly from the INIC into its final destination. A single
`trip across the system memory bus.
`
`There is no header processing, little data copying, and no checksumming required by the CPU.
`Because of this, the data is never moved into the CPU cache, allowing the system to keep
`important instructions and data resident in the CPU cache.
`
`Interrupts are reduced to as little as 4 interrupts per 64k SMB read and 2 per 64k SMB write.
`
`There are no CPU reads over PCI and there are fewer PCI operations per receive or transmit
`transaction.
`
`The remainder of this document will describe how we accomplish the above.
`
`Provisional Pat. App. of Alacritech Corp., Inventors Laurence B. Boucher et al.
`Express Mail Label No. EE237180694US
`
`3
`
`INTEL Ex.1092.004
`
`
`
`Perform Transport Level Processing on the INIC
`2.1.
`In order to keep the system CPU from having to process the packet headers or checksum
`the packet, we must perform this task on the INIC. This is a daunting task. There are more than
`20,000 lines of C code that make up the FreeBSD TCP/IP protocol stack. Clearly this is more
`code than could be efficiently handled by a competitively priced network card. Furthermore, as
`we've noted above, the TCP/IP protocol stack is complicated enough to consume a 200 MHz
`Pentium-Pro. In order to perform this function on an inexpensive card, we need special network
`processing hardware as opposed to simply using a general purpose CPU.
`
`2.t.i. Focus On TCP/IP
`In this section we introduce the notion of a "context". A context is required to keep track
`of information that spans many, possibly discontiguous, pieces of information. When processing
`TCP/IP data, there are actually two contexts that must be maintained. The first context is
`required to reassemble IP fragments. It holds information about the status of the IP reassembly
`as well as any checksum information being calculated across the IP datagram (UDP or TCP).
`This context is identified by the IP _ID of the datagram as well as the source and destination IP
`addresses. The second context is required to handle the sliding window protocol of TCP. It
`holds information about which segments have been sent or received, and which segments have
`been acknowledged, and is identified by the IP source and destination addresses and TCP source
`and destination ports.
`If we were to choose to handle both contexts in hardware, we would have to potentially
`keep track of many pieces of information. One such example is a case in which a single 64k
`SMB write is broken down into 44 1500 byte TCP segments, which are in tum broken down into
`131 576 byte IP fragments, all of which can come in any order (though the maximum window
`size is likely to restrict the number of outstanding segments considerably).
`Fortunately, TCP performs a Maximum Segment Size negotiation at connection
`establishment time, which should prevent IP fragmentation in nearly all TCP connections. The
`only time that we should end up with fragmented TCP connections is when there is a router in
`the middle of a connection which must fragment the segments to support a smaller MTU. The
`only networks that use a smaller MTU than Ethernet are serial line interfaces such as SLIP and
`PPP. At the moment, the fastest of these connections only run at 128k (ISDN) so even if we had
`256 of these connections, we would still only need to support 34Mb/sec, or a little over three
`1 ObT connections worth of data. This is not enough to justify any performance enhancements
`that the INIC offers. If this becomes an issue at some point, we may decide to implement the
`MTU discovery algorithm, which should prevent TCP fragmentation on all connections (unless
`an ICMP redirect changes the connection route while the connection is established). With this in
`mind, it seems a worthy sacrifice to not attempt to handle fragmented TCP segments on the
`INIC.
`
`SPX follows a similar framework as TCP, and so the expansion of the INIC to handle
`IPX/SPX messages is straightforward.
`UDP is another matter. Since UDP does not support the notion of a Maximum Segment
`Size, it is the responsibility ofIP to break down a UDP datagram into MTU sized packets. Thus,
`fragmented UDP datagrams are very common. The most common UDP application running
`today is NFSV2 over UDP. While this is also the most common version of NFS running today,
`
`Provisional Pat. App. of Alacritech Corp., Inventors Laurence B. Boucher et al.
`Express Mail Label No. EE237180694US
`
`4
`
`INTEL Ex.1092.005
`
`
`
`the current version of Solaris being sold by Sun Microsystems runs NFSV3 over TCP by default.
`We can expect to see the NFSV2/UDP traffic start to decrease over the coming years.
`In summary, a first embodiment which will be described in detail in this document offers
`assistance to non-fragmented TCP connections on the INIC, while extension of this design to
`process other message protocols, such as SPX/IPX is straightforward.
`
`2.t.2. Don't handle TCP ''exceptions"
`As noted above, we do not support fragmented TCP segments on the initial INIC
`configuration. We have also opted to not handle TCP connection and breakdown. Here is a list
`of other TCP "exceptions" which we have elected to not handle on the INIC:
`Retransmission Timeout - Occurs when we do not get an acknowledgement for
`previously sent data within the expected time period.
`Out of order segments - Occurs when we receive a segment with a sequence number
`other than the next expected sequence number.
`FIN segment- Signals the close of the connection.
`Since we have now eliminated support for so many different code paths, it might seem
`hardly worth the trouble to provide any assistance by the card at all. This is not the case.
`According to W. Richard Stevens and Gary Write in Volume 2 of their book "TCP/IP
`Illustrated", which is incorporated by reference herein, TCP operates without experiencing any
`exceptions between 97 and 100 percent of the time in local area networks. As network, router,
`and switch reliability improve this number is likely to only improve with time.
`
`2.1.3. Two modes of operation
`So the next question is what to do about the network packets that do not fit our criteria.
`The answer is to use two modes of operation: One in which the network frames are processed on
`the INIC through TCP and one in which the card operates like a typical dumb NIC. We call
`these two modes fast-path, and slow-path. In the slow-path case, network frames are handed to
`the system at the MAC layer and passed up through the host protocol stack like any other
`network frame. In the fast path case, network data is given to the host after the headers have
`been processed and stripped.
`
`Host
`
`INIC
`
`NetBIOS
`
`TCP
`
`IP
`
`MAC
`
`CLIENT j ~j ~
`
`FAST-PATH
`
`SLOW-PATH
`
`TDI
`
`TCP
`
`IP
`
`MAC
`
`PHYSICAL
`
`PHYSICAL
`
`Ethernet
`
`PCI
`
`Provisional Pat. App. of Alacritech Corp., Inventors Laurence B. Boucher et al.
`Express Mail Label No. EE237180694US
`
`5
`
`INTEL Ex.1092.006
`
`
`
`The transmit case works in much the same fashion. In slow-path mode the packets are
`given to the IN1C with all of the headers attached. The INIC simply sends these packets out as if
`it were a dumb NIC. In fast-path mode, the host gives raw data to the INIC which it must carve
`into MSS sized segments, add headers to the data, perform checksums on the segment, and then
`send it out on the wire.
`
`2.t.4. The CCB cache
`Consider a situation in which a TCP connection is being handled by the card and a
`fragmented TCP segment for that connection arrives. In this situation, the card turns control of
`this connection over to the host.
`This introduces the notion of a Communication Control Block (CCB) cache. A CCB is a
`structure that contains the entire context associated with a connection. This includes the source
`and destination IP addresses and source and destination TCP ports that define the connection. It
`also contains information about the connection itself such as the current send and receive
`sequence numbers, and the first-hop MAC address, etc. The complete set of CCBs exists in host
`memory, but a subset of these may be "owned" by the card at any given time. This subset is the
`CCB cache. The IN1C can own up to 256 CCBs at any given time.
`CCBs are initialized by the host during TCP connection setup. Once the connection has
`achieved a "steady-state" of operation, its associated CCB can then be turned over to the IN1C,
`putting us into fast-path mode. From this point on, the INIC owns the connection until either a
`FIN arrives signaling that the connection is being closed, or until an exception occurs which the
`INIC is not designed to handle (such as an out of order segment). When any of these conditions
`occur, the INIC will then flush the CCB back to host memory, and issue a message to the host
`telling it that it has relinquished control of the connection, thus putting the connection back into
`slow-path mode. From this point on, the INIC simply hands incoming segments that are destined
`for this CCB off to the host with all of the headers intact.
`Note that when a connection is owned by the IN1C, the host is not allowed to reference
`the corresponding CCB in host memory as it will contain invalid information about the state of
`the connection.
`
`2.t.s. TCP hardware assistance
`When a frame is received by the IN1C, it must verify it completely before it even
`determines whether it belongs to one of its CCBs or not. This includes all header validation (is it
`IP, IPV4 or V6, is the IP header checksum correct, is the TCP checksum correct, etc). Once this
`is done it must compare the source and destination IP address and the source and destination
`TCP port with those in each of its CCBs to determine if it is associated with one of its CCBs.
`This is an expensive process. To expedite this, we have added several features in hardware to
`assist us. The header is fully parsed by hardware and its type is summarized in a single status
`word. The checksum is also verified automatically in hardware, and a hash key is created out of
`the IP addresses and TCP ports to expedite CCB lookup. For full details on these and other
`hardware optimizations, refer to the IN1C hardware specification sections below.
`
`Provisional Pat. App. of Alacritech Corp., Inventors Laurence B. Boucher et al.
`Express Mail Label No. EE237180694US
`
`6
`
`INTEL Ex.1092.007
`
`
`
`With the aid of these and other hardware features, much of the work associated with TCP
`is done essentially for free. Since the card will automatically calculate the checksum for TCP
`segments, we can pass this on to the host, even when the segment is for a CCB that the INIC
`does not own.
`
`2.1.6. TCP Summary
`By moving TCP processing down to the IN1C we have offloaded the host of a large
`amount of work. The host no longer has to pull the data into its cache to calculate the TCP
`checksum. It does not have to process the packet headers, and it does not have to generate TCP
`ACKs. We have achieved most of the goals outlined above, but we are not done yet.
`
`Transport Layer Interface
`2.2.
`This section defines the INIC's relation to the hosts transport layer interface (Called TDI
`or Transport Driver Interface in Windows NT). For full details on this interface, refer to the
`Alacritech TCP (ATCP) driver specification below.
`
`2.2.t. Receive
`Simply implementing TCP on the INIC does not allow us to achieve our goal oflanding
`the data in its final destination. Somehow the host has to tell the INIC where to put the data.
`This is a problem in that the host can not do this without knowing what the data actually is.
`Fortunately, NT has provided a mechanism by which a transport driver can "indicate" a small
`amount of data to a client above it while telling it that it has more data to come. The client,
`having then received enough of the data to know what it is, is then responsible for allocating a
`block of memory and passing the memory address or addresses back down to the transport
`driver, which is in tum responsible for moving the data into the provided location.
`We will make use of this feature by providing a small amount of any received data to the
`host, with a notification that we have more data pending. When this small amount of data is
`passed up to the client, and it returns with the address in which to put the remainder of the data,
`our host transport driver will pass that address to the IN1C which will DMA the remainder of the
`data into its final destination.
`Clearly there are circumstances in which this does not make sense. When a small amount
`of data (500 bytes for example), with a push flag set indicating that the data must be delivered to
`the client immediately, it does not make sense to deliver some of the data directly while waiting
`for the list of addresses to DMA the rest. Under these circumstances, it makes more sense to
`deliver the 500 bytes directly to the host, and allow the host to copy it into its final destination.
`While various ranges are feasible, it is currently preferred that anything less than a segment's
`(1500 bytes) worth of data will be delivered directly to the host, while anything more will be
`delivered as a small piece (which may be 128 bytes), while waiting until receiving the destination
`memory address before moving the rest.
`The trick then is knowing when the data should be delivered to the client or not. As
`we've noted, a push flag indicates that the data should be delivered to the client immediately, but
`this alone is not sufficient. Fortunately, in the case ofNetBIOS transactions (such as SMB), we
`are explicitly told the length of the session message in the NetBIOS header itself. With this we
`can simply indicate a small amount of data to the host immediately upon receiving the first
`
`Provisional Pat. App. of Alacritech Corp., Inventors Laurence B. Boucher et al.
`Express Mail Label No. EE237180694US
`
`7
`
`INTEL Ex.1092.008
`
`
`
`segment. The client will then allocate enough memory for the entire NetBIOS transaction, which
`we can then use to DMA the remainder of the data into as it arrives. In the case of a large (56k
`for example) NetBIOS session message, all but the first couple hundred bytes will be DMA'd to
`their final destination in memory.
`But what about applications that do not reside above NetBIOS? In this case we can not
`rely on a session level protocol to tell us the length of the transaction. Under these circumstances
`we will buffer the data as it arrives until, 1) we have receive some predetermined number of
`bytes such as 8k, or 2) some predetermined period of time passes between segments, or 3) we get
`a push flag. If after any of these conditions occur we will then indicate some or all of the data to
`the host depending on the amount of data buffered. If the data buffered is greater than about
`1500 bytes we must then also wait for the memory address to be returned from the host so that
`we may then DMA the remainder of the data.
`
`2.2.2. Transmit
`The transmit case is much simpler. In this case the client (NetBIOS for example) issues a
`TDI Send with a list of memory addresses which contain data that it wishes to send along with
`the length. The host can then pass this list of addresses and length off to the INIC. The INIC
`will then pull the data from its source location in host memory, as it needs it, until the complete
`TDI request is satisfied.
`
`2.2.3. Affect on Interrupts
`Note that when we receive a large SMB transaction, for example, that there are two
`interactions between the INIC and the host. The first in which the INIC indicates a small amount
`of the transaction to the host, and the second in which the host provides the memory location(s)
`in which the INIC places the remainder of the data. This results in only two interrupts from the
`INIC. The first when it indicates the small amount of data and the second after it has finished
`filling in the host memory given to it. A drastic reduction from the 33/64k SMB request that we
`estimated at the beginning of this section.
`On transmit, we actually only receive a single interrupt when the send command that has
`been given to the INIC completes.
`
`2.2.4. Transport Layer Interface Summary
`Having now established our interaction with Microsoft's TDI interface, we have achieved
`our goal of landing most of our data directly into its final destination in host memory. We have
`also managed to transmit all data from its original location on host memory. And finally, we
`have reduced our interrupts to 2 per 64k SMB read and 1 per 64k SMB write. The only thing
`that remains in our list of objectives is to design an efficient host (PCI) interface.
`
`2.3. Host (PCI) Interface
`In this section we define the host interface. For a more detailed description, refer to the
`"Host Interface Strategy for the Alacritech INIC" section (Heading 3).
`
`Provisional Pat. App. of Alacritech Corp., Inventors Laurence B. Boucher et al.
`Express Mail Label No. EE23 7180694US
`
`8
`
`INTEL Ex.1092.009
`
`
`
`2.3.1. A void PCI reads
`One of our primary objectives in designing the host interface of the INIC was to eliminate
`PCI reads in either direction. PCI reads are particularly inefficient in that they completely stall
`the reader until the transaction completes. As we noted above, this could hold a CPU up for
`several microseconds, a thousand times the time typically required to execute a single
`instruction. PCI writes on the other hand, are usually buffered by the memory-bus¢> PCI-bridge
`allowing the writer to continue on with other instructions. This technique is known as "posting".
`
`2.3.1.1.Memory-based status register
`The only PCI read that is required by most NI Cs is the read of the interrupt status
`register. This register gives the host CPU information about what event has caused an interrupt
`(if any). In the design of our INJC we have elected to place this necessary status register into
`host memory. Thus, when an event occurs on the INIC, it writes the status register to an agreed
`upon location in host memory. The corresponding driver on the host reads this local register to
`determine the cause of the interrupt. The interrupt lines are held high until the host clears the
`interrupt by writing to the INIC's Interrupt Clear Register. Shadow registers are maintained on
`the INIC to ensure that events are not lost.
`
`2.3.1.2.Buffer Addresses are pushed to the INIC
`Since it is imperative that our INIC operate as efficiently as possible, we must also avoid
`PCI reads from the INJC. We do this by pushing our receive buffer addresses to the INIC. As
`mentioned at the beginning of this section, most NI Cs work on a descriptor queue algorithm in
`which the NIC reads a descriptor from main memory in order to determine where to place the
`next frame. We will instead write receive buffer addresses to the INIC as receive buffers are
`filled. In order to avoid having to write to the INIC for every receive frame, we instead allow the
`host to pass off a pages worth (4k) of buffers in a single write.
`
`""'
`
`2.3.2. Support small and large buffers on receive
`In order to reduce further the number of writes to the INIC, and to reduce the amount of
`memory being used by the host, we support two different buffer sizes. A small buffer contains
`roughly 200 bytes of data payload, as well as extra fields containing status about the received
`data bringing the total size to 256 bytes. We can therefore pass 16 of these small buffers at a
`time to the INJC. Large buffers are 2k in size. They are used to contain any fast or slow-path
`data that does not fit in a small buffer. Note that when we have a large fast-path receive, a small
`buffer will be used to indicate a small piece of the data, while the remainder of the data will be
`DMA'd directly into memory. Large buffers are never passed to the host by themselves, instead
`they are always accompanied by a small buffer which contains status about the receive along
`with the large buffer address. By operating in the manner, the driver must only maintain and
`process the small buffer queue. Large buffers are returned to the host by virtue of being attached
`to small buffers. Since large buffers are 2k in size they are passed to the INIC 2 buffers at a time.
`
`2.3.3. Command and response buffers
`In addition to needing a manner by which the INJC can pass incoming data to us, we also
`need a manner by which we can instruct the INIC to send data. Plus, when the INIC indicates a
`
`Provisional Pat. App. of Alacritech Corp., Inventors Laurence B. Boucher et al.
`Express Mail Label No. EE237180694US
`
`9
`
`INTEL Ex.1092.010
`
`
`
`small amount of data in a large fast-path receive, we need a method of passing back the address
`or addresses in which to put the remainder of the data. We accomplish both of these with the use
`of a command buffer. Sadly, the command buffer is the only place in which we must violate our
`rule of only pushing data across PCI. For the command buffer, we write the address of command
`buffer to the INIC. The INIC then reads the contents of the command buffer into its memory so
`that it can execute the desired command. Since a command may take a relatively long time to
`complete, it is unlikely that command buffers will complete in order. For this reason we also
`maintain a response buffer queue. Like the small and large receive buffers, a page worth of
`response buffers is passed to the INIC at a time. Response buffers are only 32 bytes, so we have
`to replenish the INIC's supply of them relatively infrequently. The response buffers only
`purpose is to indicate the completion of the designated command buffer, and to pass status about
`the completion.
`
`Examples
`2.4.
`In this section we will provide a couple of examples describing some of the differing data
`flows that we might see on the INIC.
`
`2.4.I. Fast-path 56k NetBIOS session message
`Let's say a 56k NetBIOS session message is received on the INIC. The first segment will
`contain the NetBIOS header, which contains the total NetBIOS length. A small chunk of this
`first segment is provided to the host by filling in a small receive buffer, modifying the interrupt
`status register on the host, and raising the appropriate interrupt line. Upon receiving the
`interrupt, the host will read the ISR, clear it by writing back to the INIC's Interrupt Clear
`Register, and will then process its small receive buffer queue looking for receive buffers to be
`processed. Upon finding the small buffer, it will indicate the small amount of data up to the
`client to be processed by NetBIOS. It will also, if necessary, replenish the receive buffer pool on
`the INIC by passing off a page