`Yocum et al.
`
`US006308228B1
`US 6,308,228 B1
`Oct. 23, 2001
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`(54)
`
`(75)
`
`(73)
`(*)
`
`(21)
`(22)
`(51)
`(52)
`
`(58)
`
`(56)
`
`SYSTEM AND METHOD OF ADAPTIVE
`MESSAGE PIPELINING
`
`Inventors: Kenneth G. Yocum; Jeffrey S. Chase,
`both of Durham, NC (US)
`
`Assignee: Duke University, Durham, NC (US)
`
`Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`USC 154(b) by 0 days.
`
`Appl. No.: 09/197,598
`Filed:
`Nov. 23, 1998
`
`Int. Cl.7 .................................................... .. G06F 13/00
`US. Cl. ............................ .. 710/52; 710/30; 710/131;
`709/250
`Field of Search ...................... .. 709/250; 710/52—57,
`710/29—35, 129—131
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,258,418 * 3/1981 Heath ................................... .. 710/53
`4,860,193 * 8/1989 Bentley et a1
`710/55
`
`5,121,479 * 6/1992 O’Brien . . . . .
`
`. . . . .. 710/34
`
`709/234
`5,664,116 * 9/1997 Gaytan et a1 .
`710/53
`5,671,445 * 9/1997 Gluyas et al. .
`.. 714/805
`5,732,094 * 3/1998 Petersen et al.
`5,864,713 * 1/1999 Terry .................................. .. 395/872
`5,944,802 * 8/1999 BellO et a1. .......................... .. 710/52
`5,991,835 * 11/1999 Mashimo et al.
`710/58
`6,014,722 * 1/2000 Rudin et al.
`710/240
`6,015,086 * 1/2000 Ritchie et a1. .
`231/2.5
`6,094,695 * 7/2000 Kornher ............................... .. 710/56
`
`FOREIGN PATENT DOCUMENTS
`
`OTHER PUBLICATIONS
`
`Wang et al., “Modeling Communication Pipeline Latency”,
`Proc. of ACM International Conference on Measurement
`and Modeling of Computer Systems, Sigmetrics, Jun. 1998.
`Yocum et al., “Cut—Through Delivery in TrapeZe: An Exer
`cise in LoW—Latency Messaging ”, Proc. Of Sixth IEEE
`International Symposium on High Performance Distributed
`Computing, Aug. 1997.
`Dittia et al., “The APIC Approach to High Performance
`NetWork Interface Design: Protected DMA and Other Tech
`niques”, Proceedings of INFOCOM ’97, Apr. 7—11, 1997.
`Lauria et al., “Ef?cient Layering for High Speed Commu
`nication: Fast Messages 2.x*”, IEEE, Jul. 1998.
`TeZuka et al., “PM: An Operating System Coordinated High
`Performance Communication Library”, Real World Com
`puting Partnership, 1997.
`Prylli et al., “Modeling of a High Speed NetWork to Maxi
`miZe Throughput Performance: the Experience of BIP over
`Myrinet”, Sep. 26, 1997.
`
`* cited by examiner
`
`Primary Examiner—Christopher B. Shin
`(74) Attorney, Agent, or Firm—Jenkins & Wilson, PA.
`(57)
`ABSTRACT
`
`Asystem and method termed adaptive message pipelining in
`Which tWo or more data paths are joined by a data movement
`device. The data movement device has a data path controller
`for each incoming and outgoing data path. Adaptive mes
`sage pipelining is a dynamic policy Which adjusts betWeen
`a store and forWard data delivery method and a cut through
`delivery of data based upon network and I/O bus contention
`in order to decrease latency and maximize system through
`put.
`
`9400935 * 1/1994 (W0) .
`
`15 Claims, 8 Drawing Sheets
`
`Is data ready on incoming data path?
`
`i
`
`:10
`Initiate incoming datapath controller
`transfer of frame to DMD
`i
`
`yes
`
`is frame transfer to DMD
`
`I
`
`'7
`
`no
`
`y
`
`(wait)
`@
`Is outgoing datapath controller free and
`ready to send?
`
`no
`
`Initiate outgoing datapath controller transfer
`of all remaining data of the frame on
`DMD
`w
`
`l
`
`589
`
`Is transfer to outgoing datapath finished‘? (W:1)_l
`
`not
`
`yes
`
`“0
`
`Has
`
`' r '
`
`‘
`
`'
`
`5Q es
`‘ been reached?
`ll
`
`(wait)
`
`1
`
`[5 outgoing datapath controller free and -
`ready to send?
`
`55D
`Initiate outgoing datapafh controller transfer
`of the total amount of the frame‘s data
`which has arrived
`
`Oracle-Huawei-NetApp Ex. 1023, pg. 1
`
`
`
`U.S. Patent
`
`0a. 23, 2001
`
`Sheet 1 0f 8
`
`US 6,308,228 B1
`
`110
`
`V
`
`DATA O
`MOVEMENT DEVICE
`(DMD)
`O
`
`120
`
`100
`
`Oracle-Huawei-NetApp Ex. 1023, pg. 2
`
`
`
`U.S. Patent
`
`0a. 23, 2001
`
`Sheet 2 of 8
`
`US 6,308,228 B1
`
`_ _ _
`
`h r. r.
`
`n w m 0
`_ e 8
`.r 8
`
`.' .m I | v 0 r l .' .t
`_ m m ‘m
`_ y m. m
`
`.t. O
`
`2 9. 2 2
`
`0 0 0
`
`.m m M m c." D
`d .I C
`0.. e D
`O 0
`
`h of m D E
`D. M .B
`
`a n m
`
`I
`
`d
`
`_ S I
`
`_ LL PU r _ O .m f
`- I S
`
`?u 0 O
`_ .1. m
`
`_ _ r rrrrrrrrrrrrrrrrrrrrrrrrrr r IFII
`
`s 6 v1
`
`w .1 O %
`
`) )
`uh ...h. 0 G W W ( ( O n
`
`yes
`
`E
`Is outgoing datapath controller free and
`ready to send?
`
`no
`
`Initiate transfer of frame to
`outgoing dafapath
`
`yes
`
`Is frame transfer to outgoing dafapath
`finished?
`
`no
`
`FIG. 2
`
`PRIOR ART
`
`Oracle-Huawei-NetApp Ex. 1023, pg. 3
`
`
`
`U.S. Patent
`
`10023:2LCO
`
`Sheet 3 of 8
`
`US 6,308,228 B1
`
`@
`
`IIm.0_n_2
`E<moan.
`
`am
`
`.,..5o._3%a:_E8c_co%oo._3%m_
`
`
`
`
`
`._o=o._E8582%oc_E8:_.£o££
`
`amam
`
`olmm
`
`~..£u_aEoo9,52Ease:westE
`
`9.58«Eatwe._2mEE
`
`
`
`
`
`....vEoo8$8mumu_:mEmo.cmo:95out._w__obc8£833mc_o9=oma
`
`«team3>32
`
`
`
`an._£m:o.:..m__ob:oo583%m:_o9=o3523
`
`
`
`
`
`
`
`95out._m__o.:cou£332.m:_o9:o2m..:mEmo¢.6Bzuogommc_:_uEm._9.:we
`mvcmm35.2aBEE2:.2
`
`
`
`
`
`
`
`
`
`.o__obcoo£838m=_o9.=o32:5anEoEmo¢o.8%o;m_c:£338mc_oE=o3.£mEE2mzmcob
`
`
`
`Oracle-Huawei-NetApp Ex. 1023, pg. 4
`
`
`
`U.S. Patent
`
`Oct. 23, 2001
`
`00f04teehS
`
`US 6,308,228 B1
`
`E<mo_Ev.c_.._
`
`«£33%mc_E8£so:52Son3
`
`mm...‘
`
`~.BuaE8ES3EmacsSSEmH
`
`9,58oEo.c3human:
`
`
`
`§_obcoufioqsovmc_E8E35:.
`
`
`
`
`
`
`
`
`
`«cocoa»;EonEozmefiuwfiE.=.....Emo:968.:.o__o._E8£832.mc_oE:o3am$4..
`
`
`
`
`
`wucmm9..68:
`
`
`
`
`
`Lflmco...._o__8Eoofioqsovm:_.o3=o0625a
`
`
`
`
`
`
`
`wucom8anuso?92¢
`
`omv_.o>_.:umo;.82:
`
`
`
`
`
`
`
`
`
`
`
`3%when:2:weE:oEc.339:be%m;m_.._cfioqscum£o9:o3__£mco:mH
`
`
`
`
`
`
`
`$=o::oo£832.mc_o3:o303...:
`
`
`
`
`
`95out_m=ob:8533%9.63:0mm8men:9:B22...m:_:_oEo.___oS._£mco:
`
`
`
`Oracle-Huawei-NetApp Ex. 1023, pg. 5
`
`
`
`U.S. Patent
`
`Oct. 23, 2001
`
`pl05tCEhS
`
`00
`
`US 6,308,228 B1
`
`q%
`
`5
`
`«£83%mc_E8c_co%8._Bow3
`
`
`
`.o__ob:8£332.m:_E8=_«.655
`
`
`
`
`
`
`
` w£oaEoo3LflmcoboeocEaw..
`
`9.530:6:3Lflmcob
`
`93
`
`
`
`wuusooeEonn_9_mE£m&.&.E.Emo:E6mm:,_m__9_.c8£838o:_o9:om_aqmm
`
`
`
`
`
`mbcom3>32
`
`
`E58.;_m__EEou£332.msomuzomucoScot9:.63%9.5.52__o.8
`
`wvcmm3€52own23
`
`
`
`
`
`
`
`
`
`L880:Lo__obcoofiuqfluuoc_o9:o35:3
`
`
`
`
`
`
`
`uw>.Ecwas€23
`
`Eu;
`
`3%m.oEc._.,2:weEzoeo_Bo_.Q:_o
`
`
`
`
`
`._£m:.ELozobcoo532%o.._._o9=oSUEE%£m_é£832.m:_o9=oB._&wco:2am
`
`
`
`
`
`
`
`mo»
`
`we»
`
`Oracle-Huawei-NetApp Ex. 1023, pg. 6
`
`
`
`U.S. Patent
`
`0a. 23, 2001
`
`Sheet 6 0f 8
`
`US 6,308,228 B1
`
`Sending Host
`
`Memory
`
`610
`)
`
`I
`
`@_Host/PCI
`
`bridge
`
`620 -/
`
`CPU
`
`Sending NIC
`
`660
`Receiving Host )
`
`I
`
`Memory
`
`Host/PCI__@
`
`bridge
`_\-67O
`
`\I 650
`
`Myrinet
`network
`
`640
`
`Receiving NIC
`
`FIG. 6A
`
`stage 1: sender PCI
`
`stage 2: network links
`
`Netie7-(HostRevl
`NetTx
`I HostTx
`stage 3: receiver PCI
`
`FIG. 6B
`
`Oracle-Huawei-NetApp Ex. 1023, pg. 7
`
`
`
`U.S. Patent
`
`Oct. 23, 2001
`
`Sheet 7 0f 8
`
`US 6,308,228 B1
`
`
`
`HostTx NetTx
`
`HostRev
`
`Store and Forward
`
`FIG. 7A
`
`0
`
`150
`100
`50
`microseconds
`
`200
`
`HostTx
`
`Oétimol Flxed
`
`- UUUUUU
`.UUUUUU.
`
`25 50 75 100125150
`microseconds
`
`Hos’cRev *
`0
`
`FIG. 7B
`
`éOptimol 1700513115
`
`" UUUUUUU ". UUUUUUU ,
`i
`
`HostTx
`
`NetTx
`
`HostRev
`
`0 25 50 75100125150175
`microseconds
`
`FIG. 7C
`
`Oracle-Huawei-NetApp Ex. 1023, pg. 8
`
`
`
`U.S. Patent
`
`0a. 23, 2001
`
`Sheet 8 0f 8
`
`US 6,308,228 B1
`
`
`
`
`
`002. com com o? com a
`
`“208358
`
`12.25%“
`mE5
`
`. 255%
`
`om: m? o9 mh om o
`
`mucouowobwE
`
`w 6E
`
`l
`
`xbmOI
`
`
`
`1 XCQZ
`
`
`
`1 >838:
`
`Oracle-Huawei-NetApp Ex. 1023, pg. 9
`
`
`
`US 6,308,228 B1
`
`1
`SYSTEM AND METHOD OF ADAPTIVE
`MESSAGE PIPELINING
`
`TECHNICAL FIELD
`
`The present invention relates to a technique for reducing
`network latency of messages while delivering high through
`put.
`
`BACKGROUND AND RELATED ART
`
`Latency of messages in a network is typically linked to
`high end-to-end performance for applications. However,
`certain computer applications such as network memory or
`?le access require low latency for large messages and high
`bandwidth or throughput under load in order to perform
`optimally. Reconciling these con?icting demands requires
`careful attention to data movement across data buses, net
`work interfaces, and network links.
`One method of achieving high data throughput is to send
`larger data packets and thus reduce per-packet overheads.
`On the other hand, a key technique for achieving low latency
`is to fragment data packets or messages and pipeline the
`fragments through the network, overlapping transfers on the
`network links and I/O buses. Since it is not possible to do
`both at once, messaging systems must select which strategy
`to use.
`It is therefore desirable to automatically adapt a fragmen
`tation policy along the continuum between low latency and
`high bandwidth based on the characteristics of system
`hardware, application behavior, and network traf?c.
`A number of prior art systems have used fragmentation/
`reassembly to reduce network latency on networks whose
`architecture utiliZes a ?xed delimited transmission unit such
`as ATM (asynchronous transfer mode). Fixed fragmentation
`is a common latency reduction scheme used in the following
`systems: APIC, Fast Messages (FM), Active Messages
`(AM), and Basic Interface for Parallelism (BIP). PM (to be
`discussed) appears to utiliZe a form of variable fragmenta
`tion only on packet transmission. Variable and hierarchical
`fragmentation were theoretically explored in Wang et. al. In
`contrast, a technique termed cut-through delivery was devel
`oped in the TrapeZe Myrinet messaging system. Cut-through
`delivery is a non-static variable fragmentation scheme
`meaning fragment siZes can vary at each stage in a pipeline.
`The following prior-art discussion describes messaging soft
`ware developed for a Myrinet gigabit network unless oth
`erwise noted.
`The design of the APIC (ATM Port Interconnect
`Controller) network interface card (NIC) speci?es imple
`mentation of full AAL-S segmentation and reassembly
`(SAR) on-chip. The APIC NIC uses ?xed-siZe fragmenta
`tion at the cell granularity(48 byte of data), so it does not
`store and forward entire frames. Moreover, APIC does not
`adapt to host/network architectures or to changing condi
`tions on the host or network. (See generally, Dittia et al., The
`APIC Approach to High Performance Network Interface
`Design: Protected DlVlA and Other Techniques, Proceedings
`of INFOCOM’ 97, April, 1997)
`Fast Messages (FM) utiliZes ?xed-siZe fragmentation
`when moving data between the host, NIC, and network link
`in order to lower the latency of large messages. Though FM
`uses a streaming interface that allows a programmer to
`manually pipeline transfers in variably siZed fragments to
`and from host API buffers, API data moves to and from the
`network interface card in ?xed siZe fragments. Thus, it is the
`programmer’s task to pipeline packets by making multiple
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`2
`FM
`calls to the application programming interface
`lacks the ability to adapt automatically and transparently to
`changing host and network characteristics. (See generally,
`Lauria et al., Ejficient Layering for High Speed Communi
`cation: Fast Messages 2.x, IEEE, July, 1998)
`Active Messages
`uses a ?xed-siZe fragmentation
`scheme to reduce latency of medium to large packets. Active
`messages, however, is non-adaptive and utiliZes store and
`forward for non-bulk packets as a means for increasing
`throughput.
`Basic Interface for Parallelism (BIP) performs static
`?xed-siZe fragmentation on the adapter. BIP, however,
`adjusts the fragment siZe depending on the siZe of the entire
`packet. When a packet is sent, fragment siZe is determined
`by a table look-up as indexed by the packet’s length. BIP,
`while statically adaptive to packet siZe, does not adjust
`dynamically to changing host and network characteristics.
`(See generally, Prylli et al., Modeling of a High Speed
`Network Thrughput Performance: The Experience of BIP
`over Myrinet, September 1997)
`The Real World Computing Partnership has developed a
`messaging package, also for Myrinet, called PM which
`implements fragmentation on the adapter for sending in a
`technique they term immediate sending. Double buffering is
`used for receiving. It is unclear from their current documents
`exactly what form of fragmentation constitutes immediate
`sending, but it appears to be a form of variable fragmenta
`tion. Moreover, their technique is limited since PM claims it
`is not possible to perform immediate sending on the recep
`tion of a packet. (See generally, TeZuka et al., PM: An
`Operating System Coordinated High Performance Commu
`nication Library, Real World Computing Partnership) In a
`theoretical approach, Wang et al. examines variable siZed
`and hierarchical fragmentation pipelining strategies. Hier
`archical fragmentation is one scheme in which a fragmen
`tation schedule may change in different pipeline stages; it is
`not a static pipelining method. The theory rests on different
`assumptions than the present invention, adaptive message
`pipelining
`Wang et al. assumes that g- (?xed transfer
`overhead) and Gi (time per unit of data) values are ?xed and
`previously known, so that both static and non-static pipeline
`schedules can be computed beforehand, and therefore are
`not adaptable to changing conditions. Neither does Wang et
`al. consider throughput as a goal in any of their studied
`pipelining strategies. (See generally, Modeling and Optimz
`ing Communication Pipelines, Proceedings of ACM Inter
`national Conference on Measurement and Modeling of
`Computer Systems, SIGMETRI CS, June 1998)
`Cut-through delivery, disclosed in a previously published
`paper, is a variable siZed fragmentation scheme in which the
`schedules are not static for any stage of a pipeline. Cut
`through delivery alone, however, is unable to adjust to
`inter-packet pipelining and therefore cannot extract the
`maximum bandwidth from the underlying system hardware.
`(Yocum et al., Cut-Through Delivery in Trapeze: An Exer
`cise in Low-Latency Messaging, Proc. Of Sixth IEEE Inter
`national Symposium on High Performance Distributed
`Computing, August, 1997)
`The approach of the present invention differs from the
`prior-art in several ways. First, pipelining is implemented on
`the network interface card (NIC), transparently to the hosts
`and host network software. Second, selection of transfer
`siZes at each stage is automatic, dynamic, and adaptive to
`congestion conditions encountered within the pipeline. The
`schedules, therefore, are variable and non-static. Third, the
`user of the API does not need to know anything about the
`
`Oracle-Huawei-NetApp Ex. 1023, pg. 10
`
`
`
`US 6,308,228 B1
`
`4
`FIG. 6B is a schematic diagram illustrating transfers
`between stages in a three-stage pipeline model;
`FIGS. 7A—7C are graphs illustrating how message pipe
`lining alternatives (store and forward, static ?xed, and static
`variable) decrease latency;
`FIG. 8 is a graph illustrating the present invention of
`adaptive message pipelining; and
`FIG. 9 is a graph illustrating a snapshot of AMP derived
`from the reference implementation in action for a stream of
`8KB frames.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`The present invention will now be described with refer
`ence to the accompanying drawings in which like reference
`numbers represent like elements. For reference purposes, a
`glossary of key terms is presented.
`
`TERM
`
`DEFINITION
`
`3
`hardware or network characteristics or load in order to
`achieve both low-latency and high bandwidth.
`
`SUMMARY OF THE INVENTION
`The present invention describes adaptive message pipe
`lining (AMP) which is a scheme that reduces network
`latency of messages larger than minpulse, a pre-de?ned
`threshold amount of data, while delivering maximum
`throughput of high-speed networks and I/O buses under
`load. Adaptive message pipelining for the present invention
`is supported in part by the previously discussed local policy
`called cut-through delivery implemented within a network
`interface card (NIC). AMP, as opposed to ordinary cut
`through delivery, automatically adjusts to congestion, yield
`ing peak throughput for streams of messages without the
`need for a separate bulk data transfer mechanism.
`The present invention’s reference implementation was
`coded as a simple policy in ?rmware running on a Myrinet
`network adapter. This local policy, which will be referred to
`as AMP, combines low network latency with high through
`put through careful pipelining of the movement of data
`between network links and host memory.
`Adaptive message pipelining improves network perfor
`mance by managing the data transfers involved in sending
`and receiving data. The bene?ts of adaptive message pipe
`lining include a signi?cant decrease in latency for medium
`to large packets when the network or 1/0 buses are lightly
`loaded, and maximized link and I/O bus utilization when the
`network or 1/0 buses are congested. Pipelining data within
`a frame (intra-packet) is a technique used to decrease
`latency. Inter-packet pipelining, i.e., pipelining between
`frames of data, commonly achieved through the technique of
`double-buffering, maximizes the throughput of the connec
`tion. Between these two extremes is a continuum of pipe
`lining alternatives. AMP adjusts or manages the fragmenta
`tion policy across the entire range achieving its lowest
`latency as a result of intra-packet pipelining and its highest
`throughput as a result of inter-packet pipelining.
`AMP utilizes the techniques of cut-through delivery and
`double buffering in a novel way within a local policy that
`manages the data transfers. The policy begins by transferring
`frames via cut-through delivery, but, as conditions become
`more congested, falls out of cut-through delivery and into
`store and forward in order to deliver high bandwidth.
`It is therefore the object of the present invention to adapt
`the management policy between intra-packet and inter
`packet pipelining based on host and network characteristics.
`Thus, messaging systems can achieve both low latency and
`high bandwidth or throughput.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is a schematic diagram illustrating the How of data
`through a data movement device (DMD) as it receives data
`from an incoming data path and sends it out through an
`outgoing data path;
`FIG. 2 is a block diagram illustrating the prior art meth
`odology of store and forward with double buffering;
`FIG. 3 is a block diagram illustrating the prior art meth
`odology of static ?xed/variable fragmentation.
`FIG. 4 is a block diagram illustrating the prior art meth
`odology of cut-through delivery;
`FIG. 5 is a block diagram illustrating the adaptive mes
`sage pipelining methodology of the present invention;
`FIG. 6A is a schematic diagram illustrating four send and
`receive Direct Memory Access (DMA) transfers through
`DMA engines on the network interface card;
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`Buffer
`
`Cut-Through
`Delivery
`Data Movement
`Device
`(DMD)
`
`Data Path
`Controller
`(DPC)
`Data Path
`
`Adaptive Message A scheme for reducing network latency for
`Pipelining
`messages larger than a predetermined size (min
`(AMP)
`pulse) while delivering maximum throughput
`(bandwidth) of high-speed networks and I/O buses
`under load.
`An area for data to be transferred to and from with
`a data path controller. In AMP, buffering is on the
`data movement device or DMD.
`A variable sized fragmentation scheme in which
`schedules are not static for any stage of a pipeline.
`Any device connected to two or more data paths,
`on which resides buffering for data and data path
`controllers. Examples include network interface
`cards, CPUs, network switches, etc.
`Any device which controls the movement of data
`along a data path. e.g., DMA engine, P I/O
`controller, etc.
`Any path along which data travels. i.e. memory
`buses, I/O buses, network links, etc.
`Any fragmentation scheme in which each fragment
`of the original frame is also transferred as a frame.
`Also referred to by those skilled in the art as
`packetization.
`A fragmentation scheme in which each fragment is
`the same size.
`Any method by which a frame of data is fragmented
`into, usually, smaller sized transfers in order to
`pipeline the movement of data across multiple data
`paths.
`A delimited unit of data of any length such as a
`packet, message, I/O block, etc.
`Pipelining between two separate packets or frames
`commonly achieved through double buffering.
`Pipelining within one packet or frame commonly
`achieved with ?xed size fragmentation.
`A pre-de?ned threshold amount of data. Cut
`through delivery waits until a minpulse amount of
`data has arrived on the DMD before initiating an
`outgoing data path controller transfer.
`The schedule of fragments for a frame is the same
`for each stage in the pipeline. The schedule can
`either be ?xed or variable.
`A fragmentation scheme in which each fragment
`can be a different size.
`
`Delimited
`Fragmentation
`
`Fixed
`Fragmentation
`Fragmentation
`Schemes
`
`Frame
`
`Inter-Packet
`Pipelining
`Intra-Packet
`Pipelining
`Minpulse
`
`Static
`Fragmentation
`
`Variable
`Fragmentation
`
`Adaptive message pipelining (AMP) is a technique which
`minimizes latency and maximizes throughput when trans
`ferring data across multiple data paths. AMP is a unique
`combination of three techniques or policies:
`store and
`forward, (ii) cut-through delivery, and (iii) double buffering.
`Store and forward is a data movement technique which
`requires that an entire frame of data completely ?nish a
`current stage transfer before beginning a transfer to the next
`
`Oracle-Huawei-NetApp Ex. 1023, pg. 11
`
`
`
`US 6,308,228 B1
`
`10
`
`15
`
`5
`stage. The chief bene?t of store and forward is increased
`throughput due to reduction in overheads. Cut-through
`delivery is a technique used to decrease latency in Which a
`frame of data need not complete its current stage before parts
`of the frame can be transferred to the next stage. This is an
`attribute of cut-through delivery called intra-packet
`pipelining(de?ned above). Double buffering is a technique
`using tWo or more data buffers that permits a store and
`forWard architecture to overlap the transfer and emission of
`tWo data frames simultaneously. This is sometimes referred
`to as inter-packet pipelining.
`FIG. 1 illustrates an example of data How through a data
`movement device (DMD) 100 as it receives data from an
`incoming data path 110 and sends it out through an outgoing
`data path 120. While only one incoming and one outgoing
`data path have been illustrated in FIG. 1, a DMD can have
`multiple data paths. Adaptive message pipelining is essen
`tially the policy of handling the data as it enters and exits the
`data movement device 100. A data movement device, for
`purposes of the present invention, is a device Which joins
`data paths. A data path is any path along Which data travels.
`Moreover, a data movement device possesses data buffer(s),
`data path controller(s), and a mechanism by Which to be
`noti?ed When a data path controller has completed a data
`transfer (interrupt or polling capabilities are suf?cient). The
`data movement devices of the present invention also have
`the ability to utiliZe data that has arrived as a result of a
`transfer in progress. Another capability of the data move
`ment device is the ability to monitor the progress of a data
`path controllers transfer, i.e., to note the amount of data
`transferred to date.
`A data path controller is the mechanism by Which data is
`transmitted across a data path. Data path controllers include,
`but are not limited to, direct memory access devices or
`programmed I/O devices. Data path controllers must be able
`to initiate concurrent transfers of data on data paths.
`Thus, data path controllers according to the present inven
`tion have the ability to
`initiate and terminate concurrent
`transfers, With respect to the data movement device, (ii)
`notify the data movement device of a completed transfer,
`(iii) alloW the data movement device to utiliZe data that has
`arrived as a result of a transfer in progress, (iv) monitor the
`progress of a transfer, and (v) delimit frames of data.
`FIG. 2 is a block diagram illustrating the prior art meth
`odology of store and forWard With double buffering. A data
`movement device (DMD) checks to see if data is ready to be
`transferred on the incoming data path as illustrated by block
`200. When there is data ready to be transferred, block 210
`initiates a transfer of the entire frame of data to the data
`movement device once a data buffer becomes available. A
`frame is a delimited unit of data such as a packet, message,
`I/O request, etc, and can be any length. Next, the data
`movement device checks to see if the frame transfer to the
`DMD is complete as illustrated in block 220. If not, the data
`movement device Waits until the frame transfer is complete.
`When complete, tWo events take place simultaneously as
`illustrated by the dotted line and solid line exiting block 220.
`One event tells the DMD to start over again at block 200 and
`check Whether there is more data ready on the incoming data
`path. The other event causes the DMD to test the availability
`of the outgoing data path controller illustrated in block 230.
`Once the outgoing data path controller is free and ready to
`send, block 230 succeeds, the DMD transfers the frame on
`the outgoing link in block 240. The DMD then Waits until
`the outgoing transfer is complete as illustrated in block 250
`before returning to block 220 to aWait another frame of data
`to arrive on the DMD from the incoming data path.
`
`6
`FIG. 3 is a block diagram illustrating the prior art meth
`odology of static ?xed and variable fragmentation. This
`process begins much the same as store and forWard. A data
`movement device (DMD) checks to see if data is ready to be
`transferred on the incoming data path as illustrated by block
`300. When there is data ready to be transferred, block 310
`initiates the transfer of the frame of data to the DMD. In
`these schemes, hoWever, the entire frame need not have
`arrived into the DMD prior to initiating a transfer from the
`DMD to the outgoing data path. Block 320 queries Whether
`the incoming frame transfer to the DMD has completed. If
`it has, block 360 checks to make sure the data path controller
`for the outgoing path is free and ready to send. Block 370
`initiates the transfer of as many individual fragments as
`make up the remaining data on the DMD, i.e., the DMD
`transmits the rest of the fragmentation schedule for that
`frame.
`The DMD then Waits for that transfer to complete before
`beginning the cycle aneW. If block 320 discovers that the
`transfer to the DMD has not completed, then block 330
`checks to see if the speci?ed fragment siZe has been
`received. If not, the DMD Waits until an amount of data
`equal to the fragment siZe has been received While checking
`the end of the transfer to the DMD. When the amount of data
`meets the requisite fragment siZe, the DMD checks the status
`of the outgoing data path controller in block 340. If that data
`path controller is free and ready to send, then block 350
`transfers a fragment to the outgoing data path.
`The data movement device then returns to block 320 to
`see if enough neW data has arrived for the next outgoing data
`path transfer. In static ?xed/variable fragmentation the out
`going transfers are alWays equal to the scheduled fragment
`siZe (except perhaps the last transfer as the siZe of the frame
`may not be evenly divisible by fragment siZe).
`FIG. 4 is a block diagram illustrating the prior art meth
`odology of cut-through delivery. This process begins much
`the same as store and forWard and static ?xed/variable
`fragmentation. A data movement device (DMD) checks to
`see if data is ready to be transferred on the incoming data
`path as illustrated by block 400. When there is data ready to
`be transferred, block 410 initiates the transfer of the frame
`of data to the DMD. In these schemes, hoWever, the entire
`frame need not have arrived into the DMD prior to initiating
`a transfer from the DMD to the outgoing data path. Block
`420 queries Whether the incoming transfer to the DMD has
`completed. At this point everything Works the same as in
`FIG. 3 except for the folloWing. If the transfer of the frame
`to the DMD has completed, then block 470 Will issue a
`transfer to the outgoing data path of all remaining data of the
`frame on the DMD. If incoming transfer of the frame has not
`completed, then Block 430 checks to see if minpulse amount
`of data, of the frame arriving on the DMD from the incoming
`data path, has been reached or exceeded. As in FIG. 3, block
`440 Waits until the outgoing data path is free and ready to
`send. Then block 450 initiates the transfer of one fragment
`Whose siZe is equal to the amount of data of the frame that
`has arrived on the DMD, from the incoming data path, at that
`point in time.
`FIG. 5 is a block diagram illustrating the methodology of
`the present invention, namely, adaptive message pipelining
`(AMP). Just as in store and forWard, static ?xed/variable
`fragmentation, and cut-through delivery, adaptive message
`pipelining performs the same tWo initial steps. The data
`movement device (DMD) checks to see if data is ready to be
`transferred on the incoming data path as illustrated by block
`500. When there is data ready to be transferred, block 510
`initiates a transfer of the frame of data to the data movement
`
`25
`
`35
`
`45
`
`55
`
`65
`
`Oracle-Huawei-NetApp Ex. 1023, pg. 12
`
`
`
`US 6,308,228 B1
`
`7
`device once a data buffer becomes available. Block 520
`checks to see if the entire frame transfer has completed. At
`this point everything Works the same as in FIG. 4 except for
`the following. When block 520 discovers that the entire
`frame transfer has completed, control passes to both blocks
`560 and 500 (as indicated by the solid and dashed lines).
`When block 580 discovers that the last of the frame has
`completed its transfer to the outgoing data path, then control
`is given to block 520. These tWo key differences alloW
`double buffering to be used With cut-through delivery. Data
`path controllers for moving data to the DMD from the
`incoming data path and for moving data from the DMD to
`the outgoing data path may noW operate in parallel for either
`the same or different frames.
`Adaptive message pipelining’s reference implementation
`exists Within the TrapeZe messaging system for Myrinet. It
`is implemented as a host-side API and a ?rmWare program
`for Myrinet netWork interface cards. In this implementation
`of AMP, the netWork interface card (NIC) is the data
`movement device (DMD), the host’s I/O bus is both an
`incoming and an outgoing data path, and the netWork link is
`both an incoming and outgoing data path. Direct memory
`Access (DMA) engines, Which reside on the netWork inter
`face card, possess sufficient capabilities for adaptive mes
`sage pipelining. In particular the DMA engines alloW trans
`fers to and from the netWork interface card to occur
`simultaneously. They also throW interrupts (on the netWork
`interface card) When transfers are complete, alloW data
`arriving on the netWork interface card to be utiliZed during
`a transfer, alloW the netWork interface card to vieW the
`number of bytes Which have arrived so far, and can delimit
`outgoing transfers via tail-?its(to the netWork link) or PCI
`I/O bus transfer completions.
`Adaptive message pipelining reduces latency by overlap
`ping stages of data transfers for frames. In a typical netWork,
`such as Myrinet, four Direct Memory Access (DMA) trans
`fers through DMA engines on the netWork interface card
`(NIC) are required to both send and receive a packet. This
`is illustrated in FIG. 6A. The ?rst transfer is from host
`memory 610 to NIC memory 620 across the sending hosts
`peripheral component interconnect (PCI) I/O bus 630. Sec
`ond is the transfer from NIC memory 620 to the netWork link
`640. The third and fourth transfers are symmetric to the
`second and ?rst respectively, i.e., netWork link 640 to NIC
`memory 650, and NIC memory 650 to host memory 660
`across the receiving host’s PCI I/O bus 670. The correspond
`ing netWork interface card DMA engines are referred to as
`HostTx, NetTx, NetRcv, and HostRcv respectively. Trans
`fers through components that never introduce store-and
`forWard delays, such as the host/PC bridges, the NetRcv
`engine, and the Wormhole-routed Myrinet netWork sWitches
`are ignored in this analysis. NetWork interface card ?rmWare
`controls all transfers betWeen stages in the resulting three
`stage pipeline model illustrated in FIG. 6B.
`FIGS. 7A—7C illustrate hoW fragmentation decreases
`latency. The simplistic approach, illustrated in FIG. 7A, is to
`send the message as a single packet With store-and-forWard
`d