`
`United States Patent and Trademark Office
`
`November 03, 2004
`
`THIS IS TO CERTIFY THAT ANNEXED HERETO IS A TRUE COPY FROM
`THE RECORDS OF THE UNITED STATES PA TENT AND TRADEMARK
`OFFICE OF THOSE PAPERS OF THE BELOW IDENTIFIED PATENT
`APPLlCA TION THAT MET THE REQUlREMENTS TO BE G RANTED A
`FILING DA TE UNDER 35 USC 111.
`
`APPLICA TlON NUMBER: 601061,809
`FILJNG DA TE: October 14, 1997
`
`By Authority of the
`COMMISSIONER OF PA TENTS AND T RADEMARJ{S
`
`Certifying Officer
`
`ALA00138383
`
`
`
`Approved for use through·Ot/31198. OMB 0651-0037
`Patent nnd Trademark Office; U.S. OEP ARTMENT OF COMMERCE
`,...=
`;:> .........:
`PROVISIONAL APPLICATION FOR PATENT COVER SHEET
`•
`""!!:O:i
`(Large EntUy)
`-35§
`._
`r--
`~~~is a request for filing a PROVISIONAL APPLICATION FOR PATENT under37CFR 1.53 (b)(2) •
`I Docket Number I
`I Type a plus sign
`...-;
`.,. p--
`........... =
`ALA--001
`"'~
`~ ~ ; ====
`
`PTO/SB/16 (11195) (Modif.lti l...1tl/
`
`(+)inside Ibis ~
`
`I +
`
`TNVENTOR{1)1APPL1CANT(1)
`
`LJ.STl'IAMll
`
`~NAME
`
`MIDl>L£ llCTllAI.
`
`IUSIDENCE (CITY AMI> UTIUJlSTAn: Olt FOR.EiC:. COUl'llllV)
`
`Boucher
`Blightman
`Craft
`Higgin
`
`Laurence
`Stephen
`Peter
`David
`
`B.
`E.J.
`K.
`A.
`
`Saratoga, California
`San Jose, California
`San Francisco, California
`Saratoga, California
`
`INTELLIGENT NETWORK INTERFACE CARD AND SYSlCM FOR PROTOCOL PROCESSING
`
`TITLE OFTlll: INVENTION (2&o cbancten max)
`
`Mark Lauer
`6850 Regional Street,
`Suite250
`Dublin
`
`S'r .. TE I CA
`
`CO~PONDENCE ADDRESS
`
`Tel: (510)556-3500
`Fax: (510 803-8189
`
`I ~coos I
`
`94568
`
`I CO~-rRY I
`
`USA
`
`ENCLOSED APPUCA TlON PARTS (check all tltat app(y)
`
`Number of Patu
`
`130
`
`Nkmkro/Slt-
`
`S"""lflcatio
`
`Other (s eclfy) . ~~ngs are lncludeil within
`P
`•
`lflcatfon
`
`I
`
`METHOD OF PAYMENT OF FU..ING FEES FOR TJllS l'ROVISlONAL A'PPLlCA TION FOR PATENT (dreck t>ne)
`
`~ Sptcitiutlon
`Included In ~
`I ~ Dr1Win1(s)
`-. ~ A tbeck or money onkr ls tndoSO>d to covu the Ollng ftu
`D T bt Commissioner is bereby authorlud to cbat1t I
`
`lilin& fees a nd credit D<90slt A«Ount Num~
`
`l!J.LING FEE
`
`I AMOUNT
`
`$150-00
`
`The invention wu made by au a gene)' of the Unlud States Government or under a contract with an agency oflhe United States
`~ No.
`0 Yes, tbt name of the U.S. Gon1"11mtnt agency a.nd the Govttllmtnt oontract number
`
`Respectfully sublllltted,
`
`SIGNATURE
`
`Date
`
`Octobet" 141997
`
`TYPED o r PRJNTED NAME Markuutr
`
`REGJSTRA TION NO.
`(if approprliue)
`·_rvi ~ Additional inventors are bel.ng named on separately numbered sheets attached hereto
`
`36,578
`
`USE ONLY FOR FILING A PROVISIONAL APPLICATION FOR PATENT
`SEND TO: Box Provisumal Application, Assistant Commissioner for Patelfts, Washington, 'DC 20231
`
`ALA001 38384
`
`
`
`PROVISIONAL APPLICATION FOR PATENT COVER SHEET
`(Large Entity)
`
`US'Tl<AME
`
`Fll\Stx.ua
`
`Philbrick
`StarT
`
`Clive
`Daryl
`
`MU)DU IMrT1AL
`M.
`0.
`
`iu:smi:xo: (crrv M1> J:l'T'OXRS't,\TI; 01' FOJU3Q< CO'IJNTl<'r)
`
`San Jose, California
`Milpl1as, California
`
`INVE.NTOR(s)/APPLICANT(s)
`
`USE ONLY FOR FILING A PROVISIONAL APPLICATION FOR PATENT
`SEND TO: Box Provisional Application, As:ristant Commissioner for Patents, Washington, ,J>C 20231
`
`!Page 2 of 2)
`
`P18V.RGEIR£V02
`
`ALA00138385
`
`
`
`CERTIFICATE OF MAILING BY "EXPRESS MAIL" (37 CFR 1.10)
`Applicant(s): Laurence B. Boucher et al.
`
`Docket No.
`
`. ALA-001
`
`Serial No.
`
`Filing Date
`
`Examiner
`
`Group Art Unit
`
`Invention:
`
`INTELLIGENT NETWORK INTERFACE CARD AND SYSTEM FOR PROTOCOL PROCESSING
`
`I hereby certify that this PROVISIONAL PATENT APPJJCATIOr:Y, COVER SHEET & CHECK FOR St SO 00
`(ldtntify type of COTTnJH>ffdt1tU}
`
`is being deposited with the United States Postal Service "Express Mail Post Office to Addressee• service under
`
`37 CFR 1.10 in an envelope addressed to: The Assistant Commissioner for Patents, Washington, D.C. 20231 on
`October 14, 1997
`{Dlllt)
`
`Mark Lauer
`(Typ« or Printed Name of Ptnon Mallilfg Co"apo11dt1tce)
`
`(Signature of Pason Ma1Un1 CorraJH>1tdurce)
`
`EH75623Q105US
`
`Note: Each paper must have Its own certificate of malting.
`
`ALA00138386
`
`
`
`INTELLIGENT NETWORK fNTERF ACE CARD
`
`AND SYSTEM FOR PROTOCOL PROCESSING
`
`Provisional Patent Application Under 35 U.S.C. § t 11 (b)
`
`Inventors:
`
`Laurence B. Boucher
`Stephen E. J. Blightman
`Peter K. Craft
`David A. Higgin
`Clive M. Philbrick
`Daryl D. Starr
`
`Assignee:
`
`Alacritecb Corporation
`
`1 Background of the Invention
`
`Network processing as it exists today is a costly and inefficient use of system resources.
`A 200 MHz Pentium-Pro is typically consumed simply processing network data from a
`1 OOMb/second-network connection. The reasons that this processing is so costly are
`described here.
`
`1.1 Too Many Data Moves
`
`When network packet arrives at a typical network interface card (NIC), the NJC moves
`the data into pre-allocated network buffers in system main memory. From there the data
`is read into the CPU cache so that it can be checkswnmed (assuming of course ~bat the
`protocol in use requires checksums. Some, like IPX, do not.). Once the data has been
`fully processed by the protocol stack, it can then be moved into its final destination in
`memory. Since the CPU is moving the data, and must read the destination cache line in
`before it can fill it and write it back out, this involves at a minimum 2 more trips across
`the system memory bus. In short, the best one can hope for is that the data will.get
`moved across the system memory bus 4 times before it arrives in irs final destination. It
`can, and does, get worse. If the data happens to get invalidated from system cache after it
`has been checksummed, then it must get pulled back across the memory bus before it can
`be moved to its final destination. Finally, on some systems, including Windows NT 4.0,
`the data gets copied yet another time while being moved up the protocol stack. In NT
`4.0, th.is occurs between the m.iniport driver interface and the protocol driver interface.
`This can add up to a whopping 8 trips across the system memory bus (the 4 trips
`described above, plus the move to replenish the cache, plus 3 more to copy from the
`miniport to the protocol driver). That's enough to bring even today's advanced memory
`busscs to their knees.
`
`Provisional Pat. App. of Afacritech, lac.
`Inventors Laurence B. Boucher et al.
`Express Mail label II EH75623010SUS
`
`ALA00138387
`
`
`
`1.2 Too Much Processing by the CPU
`
`In all but the original move from the NIC to system memory, the system CPU is
`responsible for moving the data. This is particularly expensive because while the CPU is
`moving this data it can do nothing else. While moving the data the CPU is typically
`stalled waiting for the relatively slow memory to satisfy its read and write requests. A
`CPU, which can execute an instruction every 5 nanoseconds, must now wait as lt>ng as
`several hundred nanoseconds for the memory controller to respond before it can begin its
`next instruction. Even today's advanced pipelining technology doesn't help in ~ese
`situations because that relies on the CPU being able to do useful work while it waits for
`the memory controller to respond. If the only thing the CPU has to look forward to for
`the next several hundred instructions is more data moves, then the CPU ultimately gets
`reduced to the speed of the memory controller.
`
`Moving all this data with the CPU slows the system down even after the data has been
`moved. Since both the source and destination cache lines must be pulled into the CPU
`cache when the data is moved, more than 3k of instructions and or data resident ·in the
`CPU cache must be flushed or invalidated for every 1500 byte frame. This is of course
`assuming a combined instruction and data second level cache, as is the case with the
`Pentium processors. After the data has been moved, the former resident of the cache will
`likely need to be pulled back in, stalling the CPU even when we are not performing
`network processing. Ideally a system would never have to bring network frames into the
`CPU cache, instead reserving that precious commodity for instructions and data that are
`referenced repeatedly and frequently.
`
`But the data movement is not the only drain on the CPU. There is also a fair amount of
`processing that must be done by the protocol stack software. The most obvious expense
`is calculating the checksum for each TCP segment (or UDP datagram). Beyond this,
`however, there is other processing to be done as well. The TCP COIUlection object must
`be located when a given TCP segment arrives, IP header checksums must be calculated,
`there are buffer and memory management issues, and finally there is also the significant
`expense of interrupt processing which we will discuss in the following section.
`
`1.3 Too Many Interrupts
`
`A 64k SMB request (write or read-reply) is typically made up of 44 TCP segments when
`running over Ethernet (1500 byte MTU). Each of these segments may result in·an
`interrupt to the CP.U. Furthermore, since TCP must acknowledge all of this incoming
`data, it's possible to get another 44 transmit-complete interrupts as a result of sending out
`the TCP acknowledgements. While this is possible, it is not terribly likely. Delayed
`ACK timers allow us to acknowledge more than one segment at a time. And delays in
`interrupt processing may mean that we are able to process more than one incoming
`network frame per interrupt Nevertheless, even if we assume 4 incoming frames per
`input, and an acknowledgement for every 2 segments (as is typical per the ACK-every(cid:173)
`other-segment property ofTCP), we are still left with 33 interrupts per 64k SMB request.
`
`Interrupts tend to be very costly to the system. Often when a system is interrupted,
`important infonnation must be flushed or invalidated from the system cache so that the
`interrupt routine instructions, and needed data can be pulled into the cache. Since the
`
`Provisional Pat. App. of AJacritcch, lnc.
`Inventors Laurence B. Boucher ct al.
`Express Mail Label# EH75623010SUS
`
`2
`
`"
`
`ALA00138388
`
`
`
`-~
`
`CPU will return to its prior location after the interrupt. it is likely that the infonnation
`flushed from the cache will immediately need to be pulled back into the cache.
`
`What's more, interrupts force a pipeline flush in today's advanced processors. While the
`processor pipeline is an extremely efficient way of improving CPU perfonnance, it can
`be expensive to get going after it has been flushed.
`
`Finally, each of these interrupts results in expensive register accesses across the
`peripheral bus (PCI). This is discussed more in the following section.
`
`1.4
`
`Inefficient Use of the Peripheral Bus (PCI)
`
`We noted earlier that when the CPU has to access system memory, it may be stalled for
`several hundred nanoseconds. When it has to read from PCI, it may be stalled for many
`microseconds. This happens every time the CPU takes an interrupt from a stand.ard NIC.
`The first thing the CPU must do when it receives one of these interrupts is to read the
`NIC Interrupt Status Register (ISR) from PCI to determine the cause of the inter,:upt. The
`most troubling thing about this is that since interrupt lines are shared on PC-based
`systems, we may have to perform this expensive Per read even when the interrupt is not
`meant for us!
`
`There are other peripheral bus inefficiencies as well. Typical NICs operate using
`descriptor rings. When a frame arrives, the NlC reads a receive descriptor from system
`memory to determine where to place the data. Once the data has been moved to main
`memory, the descriptor is then written back out to system memory with status about the
`received frame. Transmit operates in a similar fashion. The CPU must notify that NlC
`that it has a new transmit The NIC will read the descriptor to locate the data, read the
`data itself, and then write the descriptor back with status about the send. Typically on
`transmits the NIC will then read the next expected descriptor to see if any more data
`needs to be sent. In short, each receive or transmit frame results in 3 or 4 separate PCI
`reads or writes (not counting the status register read).
`
`2 Summary of the Invention
`
`Alacritech was formed with the idea that the network processing described above could
`be offloaded onto a cost-effective Intelligent Network Interface Card (INIC). With the
`Alacritech INIC, we address each of the above problems, resulting in the following
`advancements:
`l. The vast majority of the data is moved directly from the INIC into its final
`destination. A single trip across the system memory bus.
`2. There is no header processing, little data copying, and no checksumming required by
`the CPU. Because of this, the data is never moved into the CPU cache, allowing the
`system to keep imponant instructions and data resident in the CPU cache.
`3. Intenupts are reduced to as little as 4 interrupts per 64k SMB read and 2 per 64k
`SMB write.
`4. There are no CPU reads over PCI and there are fewer PCl operations per receive or
`transmit transaction.
`
`In the remainder of this document we will describe how we accomplish the above.
`
`Provisional Pat .. App. of Alacritech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label# EH756230105US
`
`3
`
`ALA00138389
`
`
`
`Cli c
`0
`0\
`""' co
`0 w
`•
`
`2.1 Perform Transport Level Processing on the INIC
`
`In order to keep the system CPU from having to process the packet headers or checksum
`the packet, we must perform this task on the INIC. This is a daunting task. Th~e are
`more than 20,000 lines of C code that make up the FreeBSD TCP/IP protocol stick.
`Clearly this is more code than could be efficiently handled by a competitively priced
`network card. Furthermore, as we've noted above, the TCP/IP protocol stack is .
`complicated enough to consume a 200 MHz Pentium-Pro. Clearly in order to perform
`this function on an inexpensive card, we need special network processing hardware as
`opposed to simply using a general purpose CPU.
`
`2. 1.1 Only Support TCP/IP
`
`In this section we introduce the notion of a "context". A context is required to keep track
`of information that spans many, possibly discontiguous, pieces of information. When
`processing TCP/IP data, there are actually two contexts that must be maintained. The
`first context is required to reassemble IP fragments. It holds information about the status
`of the IP reassembly as well as any checksum information being calculated across the IP
`datagram (UDP or TCP). This context is identified by the IP _ID of the datagram as well
`as the source and destination IP addresses. The second context is required to hahdle the
`sliding window protocol of TCP. It holds information about which segments have been
`sent or received, and which segments have been acknowledged, and is identified by the
`IP source and destination addresses and TCP source and destination ports.
`
`lf we were to choose to handle both contexts in hardware, we would have to potentially
`keep track of many pieces of information. One such example is a case in which a single
`64k SMB write is broken down into 44 1500 byte TCP segments, which are in turn
`broken down into 131 576 byte IP fragments., all of which can come in any order (though
`the maximwn window size is likely to restrict the number of outstanding segments
`considerably).
`
`Fortunately, TCP performs a Maximum Segment Size negotiation at connection
`establishment time, which should prevent IP fragmentation in nearly all TCP
`connections. The only time that we should end up with fragmented TCP conn~tions is
`when there is a router in the middle of a connection which must fragment the s<;gments to
`support a smaller MTU. The only networks that use a smaller MTU than Ethernet are
`serial line interfaces such as SLIP and PPP. At the moment, the fastest of these
`connections only run at 128k (ISON) so even if we had 256 of these connections, we
`would still only need to support 34Mb/sec, or a little over three lObT connections worth
`of data. This is not enough to justify any performance enhancements that the INIC
`offers. If this becomes an issue at some point, we may decide to implement the MTU
`discovery algorithm, which should prevent TCP fragmentation on all connections (unless
`an ICMP redirect changes the connection route while the connection is established).
`
`With this in mind, it seems a worthy sacrifice to not attempt to handle fragmented TCP
`segments on the INIC.
`
`UDP is another matter. Since UDP does not support the notion of a Maximum Segment
`Size, it is the responsibility of IP to break down a UDP datagram into MTU sized
`
`Provisional Pat .. App. of Alacrilech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label# EH756230105US
`
`4
`
`ALA00138390
`
`
`
`packets. Thus, fragmented UDP datagrams are very common. The most common UDP
`application running today is NFSV2 over UDP. While this is also the most common
`version of NFS running today, the current version ofSolaris being sold by Sun
`Microsystems runs NFSV3 over TCP by defaulL We can expect to see the NFSV2/UDP
`traffic start to decrease over the coming years.
`
`In summary, we will only offer assistance to non-fragmented TCP connections on the
`INIC.
`
`2.1.2 Don't handle TCP "exceptions"
`
`As noted above, we won't provide support for fragmented TCP segments on the INIC.
`We have also opted to not handle TCP connection and breakdown. Here is a list of other
`TCP "exceptions" which we have elected to not handle on the INIC:
`
`Fragmented Segments - Discussed above.
`
`Retransmission Timeout - Occurs when we do not get an acknowledgement for
`previously sent data within the expected time period.
`
`Out of order segments - Occurs when we receive a segment with a sequence number
`other than the next expected sequence number.
`
`FIN segment - Signals the close of the connection.
`
`•
`
`Since we have now eliminated support for so many different code paths, it might seem
`hardly worth the trouble to provide any assistance by the card at all. This is not the case .
`According to W. Richard Stevens and Gary Write in their book "TCP/IP Illustrated
`Volume 2", TCP operates without experiencing any exceptions between 97 and 100
`percent of the time in local area networks. As network, router, and switch reliability
`improve this number is likely to only improve with time.
`
`Provisional Pat. App. of Alacritech, lnc.
`Invent.ors Laurence B. Boucher et al.
`Express Mail Label# EH756230105US
`
`5
`
`ALA00138391
`
`
`
`~,
`
`2.1.3 Two modes ofoperation
`
`So the next question is what to do about the network packets that do not fit our criteria.
`The answer is to use two modes of operation: One in which the network frames are
`processed on the INIC through TCP and one in which the card operates like a typical
`dumb NIC. We call these two modes fast-path, and slow-path. In the slow-path·case,
`network frames are handed to the system at the MAC layer and passed up through the
`host protocol stack like any other network frame. In the fast path case, network data is
`given lo the host after the headers have been processed and stripped.
`
`INIC
`
`NetBIOS
`
`TCP
`
`IP
`
`MAC
`
`PHYSIC.•T
`
`FAST-PATII
`
`~H.
`I
`
`CLIENT
`
`TOI
`
`Tt..:.I'
`
`U'
`
`SLOW-PATII
`
`Ml\.~
`
`bt11emet
`
`PCI
`
`The transmit case works in much the same fashion. In slow-path mode the packets are
`given to the INIC with all of the headers attached. The INIC simply sends these packets
`out as if it were a dumb NIC. In fast-path mode, the host gives raw data to the INIC
`which it must carve into MSS sized segments, add headers to the data, perform
`checksums on the segment, and then send it out on the wire.
`
`2.1.4 The TCB cache
`
`Consider a situation in which a TCP connection is being bandied by the card and a
`fragmented TCP segment for that connection arrives. In this situation, it will be
`necessary for the card to tum control of this connection over to the host.
`
`This introduces the notion of a Transmit Control Block (TCB) cache. A TCB is a
`structure that contains the entire context associated with a connection. This includes the
`source and destination IP addresses and source and destination TCP ports that d~fine the
`connection. It also contains information about the connection itself such as the current
`send and receive sequence numbers, and the first-hop MAC address, etc. The complete
`set ofTCBs exists in host memory, but a subset of these may be "owned" by the card at
`any given time. This subset is the TCB cache. The INIC can own up to 256 TCBs at any
`given time.
`
`TCBs are initialized by the host during TCP connection setup. Once the connection has
`achieved a "steady-state" of operation, its associated TCB can then be turned over to the
`INIC, putting us into fast-path mode. From this point on, the INIC owns the connection
`until either a FIN arrives signaling that the connection is being closed, or until an
`Provisional Pat. App. of AJacriteeh, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label# EH756230105US
`
`6
`
`ALA00138392
`
`
`
`exception occurs which the INIC is not designed to handle (such as an out of order
`segment). When any of these conditions occur, the INIC will then flush the TCB back to
`host memory, and issue a message to the host telling it that it has relinquished c~ntrol of
`the connection, thus putting the connection back into slow-path mode. From thi~ point
`on, the IN1C simply hands incoming segments that are destined for this TCB off: to the
`host with all of the headers intact.
`
`Note that when a connection is owned by the INIC, the host is not allowed to reference
`the corresponding TCB in host memory as it will contain invalid information about the
`state of the connection.
`
`2.1.5 TCP hardware assistance
`
`When a frame is received by the IN1C, it must verify it completely before it even
`detennines whether it belongs to one of its TCBs or not. This includes all header
`validation (is it IP, IPV 4 or V 6, is the IP header checksum correct, is the TCP checksum
`correct, etc). Once this is done it must compare the source and destination IP aqdress and
`the source and destination TCP port with those in each of its TCBs to determine if it is
`associated with one of its TCBs. This is an expensive process. To expedite thi~, we have
`added several features in hardware to assist us. The header is fully parsed by hardware
`and its type is summarized in a single status word. The checksum is also verified
`automatically in hardware, and a hash key is created out of the IP addresses and'. TCP
`ports to expedite TCB lookup. For full details on these and other hardware optimizations,
`refer to the INIC Hardware Specification sections (Heading 8).
`
`I!
`
`With the aid of these and other hardware features, much of the work associated with TCP
`is done essentially for free. Since the card will automatically calculate the checksum for
`TCP segments, we can pass this on to the host, even when the segment is for a TCB that
`the INIC does not own.
`
`2.1.6 TCP Summary
`
`By moving TCP processing down to the INIC we have offioaded the host of a large
`amount of work. The host no longer bas to pull the data into its cache to calculate the
`TCP checksum. It does not have to process the packet headers, and it does not have to
`generate TCP ACKs. We have achieved most of the goals outlined above, but we are not
`done yet.
`
`2.2 Tra.nsport Layer Interface
`
`This section defines the INIC's relation to the hosts transport layer interface (Called TDI
`or Transport Driver Interface in Windows NT). For full details on this interface, refer to
`the Alacritech TCP (ATCP) driver specification (Heading 4).
`
`2.2.1 Receive
`
`Simply implementing TCP on the INIC does not allow us to achieve our goal oflanding
`the data in its final destination. Somehow the host has to tell the IN1C where to put the
`data. This is a problem in that the host can not do this without knowing what the data
`
`Provisional PaL App. of Alacritech, Inc.
`lnventors Laurence B. Boucher et al.
`Express Mail Label# EH756230I05US
`
`7
`
`ALA00138393
`
`
`
`.,
`
`actually is. Fortunately, NT has provided a mechanism by which a transport driver can
`"indicate" a small amount of data to a client above it while telling it that it has more data
`to come. The client, having then received enough of the data to know what it is,!is then
`responsible for allocating a block of memory and passing the memory address or
`addresses back down to the transport driver, which is in turn responsible for movtng the
`data into the provided location.
`
`We will make use of this feature by providing a small amount of any received data to the
`host, with a notification that we have more data pending. When this small amount of data
`is passed up to the client, and it returns with the address in which to put the rem:iinder of
`the data, our host transport driver will pass that address to the INIC which will DMA the
`remainder of the data into its final destination.
`
`Clearly there are circumstances in which this does not make sense. When a small amount
`of data (500 bytes for example), with a push flag set indicating that the data must be
`delivered to the client immediately, it does not make sense to deliver some of the data
`directly while waiting for the list of addresses to DMA the rest. Under these
`circumstances, it makes more sense to deliver the 500 bytes directly to the host, and
`allow the host to copy it into its final destination. While various ranges are feasible, it is
`currently preferred that anything less than a segment's (1500 bytes) worth of data will be
`delivered directly to the host, while anything more will be delivered as a small piece
`which may be128 bytes, while waiting until receiving the destination memory address
`before moving the rest.
`
`The trick then is knowing when the data should be delivered to the client or not. As
`we've noted, a push flag indicates that the data should be delivered to the client
`immediately, but this alone is not sufficient. Fortunately, in the case ofNetBIOS
`transactions (such as SMB), we are explicitly told the length of the session message in the
`NetBIOS header itself With this we can simply indicate a small amount of data to the
`host immediately upon receiving the first segment. The client will then allocat~ enough
`memory for the entire NetBIOS transaction, which we can then use to DMA the
`remainder of the data into as it arrives. In the case ofa large (56k for example) NetBIOS
`session message, all but the first couple hundred bytes will be DMA'd to their final
`destination in memory.
`
`But what about applications that do not reside above NetBIOS? In this case we can not
`rely on a session level protocol to tell us the length of the transaction. Under these
`circumstances we will buffer the data as it arrives until A) we have receive some
`predetermined number ofbytes such as 8k, or B) some predetermined period of time
`passes between segments or C) we get a push flag. If after any of these conditions occur
`we will then indicate some or all of the data to the host depending on the amount of data
`buffered. If the data buffered is greater than about 1500 bytes we must then also wait for
`the memory address to be returned from the host so that we may then DMA the
`remainder of the data.
`
`2.2.2 Transmit
`
`The transmit case is much simpler. In this case the client (NetBIOS for example) issues a
`TDI Send with a list of memory addresses which contain data that it wishes to send along
`
`Provisional Pat. App. of Alacritech, Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label# EH75623010SUS
`
`8
`
`ALA00138394
`
`
`
`,
`
`with the length. The host can then pass this list of addresses and length off to the INIC.
`The INIC will then pull the data from its source location in host memory, as it needs it,
`until the complete TOI request is satisfied.
`
`2.2.3 Affect on interrupts
`
`Note that when we receive a large SMB transaction, for example, that there are two
`interactions between the INIC and the host. The first in which the INIC indicates a small
`amount of the transaction to the host, and the second in which the host provides the
`memory location(s) in which the JNIC places the remainder of the data This results in
`only two interrupts from the INIC. The first when it indicates the small amount of data
`and the second after it has finished filling in the host memory given to it. A drastic
`reduction from the 33/64k SMB request that we estimate at the beginning of this section.
`
`On transmit, we actually only receive a single interrupt when the send command that has
`been given to the INIC completes.
`
`2.2.4 Transport Layer Interface Summary
`
`Having now established our interaction with Microsoft's TDI interface, we have achieved
`our goal of landing most of our data directly into its final destination in host memory.
`We have also managed to transmit all data from its original location on host memory.
`And finally, we have reduced our interrupts to 2 per 64k SMB read and 1 per 64k SMB
`write. The only thing that remains in our list of objectives is to design an efficient host
`(PCI) interface.
`
`2.3 Host (PCI) Interface
`
`In this section we define the host interface. For a more detailed description, refer to the
`"Host Interface Strategy for the Alacritech JNIC" section (Heading 3).
`
`2.3.1 Avoid PCI reads
`
`One of our primary objectives in designing the host interface of the INIC was to
`eliminate PC! reads in either direction. PC! reads are particularly inefficient in that they
`completely stall the reader until the transaction completes. As we noted above, this could
`hold a CPU up for several microseconds, a thousand times the time typically required to
`execute a single instruction. PCl writes on the other hand, are usually buffered by the
`memory-bus¢> PCI-bridge allowing the writer to continue on with other instructions.
`This technique is known as "posting".
`
`2.3.1.1 Memory-based status register·
`
`The only PCI read that is required by most NICs is the read of the interrupt status
`register. This register gives the host CPU infonnation about what event has caused ~
`inteilllpt (if any). In the design of our INIC we have elected to place this necessary status
`register into host memory. Thus, when an event occurs on the INIC, it writes the status
`register to an agreed upon location in host memory. The correspondihg driver on the host
`reads this local register to determine the cause of the interrupt. The interrupt lines are
`
`Provisional Pat. App. of Alacritech. Inc.
`Inventors Laurence B. Boucher et al.
`Express Mail Label II EH756230105US
`
`9
`
`ALA00138395
`
`
`
`held high until the host clears the interrupt by writing to the INIC's Interrupt Clear
`Register. Shadow registers are maintained on the INIC to ensure that events are )lot lost.
`
`2.3.1.2 Buffer Addresses are pushed to the INIC
`
`Since it is imperative that our INIC operate as efficiently as possible, we must also avoid
`PCl reads from the INlC. We do this by pushing our receive buffer addresses to the
`INIC. A$ mentioned at the b eginning of this section, most NICs work on a descriptor
`queue algorithm in which the NIC reads a descriptor from main memory in order to
`detennine where to place the next frame. We will instead write receive buffer addresses
`to the INIC as receive buffers are filled. In order to avoid having to write to the lNJC for
`every receive frame, w e instead allow the host to pass off a pages worth (4k) of buffers in
`a single write.
`
`2.3.2 Support small and large buffers on receive
`
`In order to reduce further the number of writes to the INIC, and to reduce the amount of
`memory being used by the host, we support two different buffer sizes. A small buffer
`contains roughly 200 bytes of data payload, as well as extra fields containing status about
`the received data bringing the total size to 256 bytes. We can therefore pass 16 of these
`small buffers at a time to the INIC. Large buffers are 2k in size. They are used to
`contain any fast or slow-path data that does not fit in a small buffer. Note that when we
`have a large fast-path receive, a small buffer will be used to indicate a small piece of the
`data. while the remainder of the data will be DMA'd directly into memory. Large
`buffers are never passed to the host by themselves, instead lbey are always accompanied
`by a small buffer which contains status about the receive along with the large buffer
`address. By operating in the manner, the driver must only maintain and process the small
`buffer queue. Large buffers are returned to the host by virtu e of being attached to small
`buffers. Since large buffers are 2k in size they are passed to the INIC 2 buffers at a t