throbber
(12) United States Patent
`Yocum et al.
`
`US006308228B1
`US 6,308,228 B1
`Oct. 23, 2001
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`(54)
`
`(75)
`
`(73)
`(*)
`
`(21)
`(22)
`(51)
`(52)
`
`(58)
`
`(56)
`
`SYSTEM AND METHOD OF ADAPTIVE
`MESSAGE PIPELINING
`
`Inventors: Kenneth G. Yocum; Jeffrey S. Chase,
`both of Durham, NC (US)
`
`Assignee: Duke University, Durham, NC (US)
`
`Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`USC 154(b) by 0 days.
`
`Appl. No.: 09/197,598
`Filed:
`Nov. 23, 1998
`
`Int. Cl.7 .................................................... .. G06F 13/00
`US. Cl. ............................ .. 710/52; 710/30; 710/131;
`709/250
`Field of Search ...................... .. 709/250; 710/52—57,
`710/29—35, 129—131
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,258,418 * 3/1981 Heath ................................... .. 710/53
`4,860,193 * 8/1989 Bentley et a1
`710/55
`
`5,121,479 * 6/1992 O’Brien . . . . .
`
`. . . . .. 710/34
`
`709/234
`5,664,116 * 9/1997 Gaytan et a1 .
`710/53
`5,671,445 * 9/1997 Gluyas et al. .
`.. 714/805
`5,732,094 * 3/1998 Petersen et al.
`5,864,713 * 1/1999 Terry .................................. .. 395/872
`5,944,802 * 8/1999 BellO et a1. .......................... .. 710/52
`5,991,835 * 11/1999 Mashimo et al.
`710/58
`6,014,722 * 1/2000 Rudin et al.
`710/240
`6,015,086 * 1/2000 Ritchie et a1. .
`231/2.5
`6,094,695 * 7/2000 Kornher ............................... .. 710/56
`
`FOREIGN PATENT DOCUMENTS
`
`OTHER PUBLICATIONS
`
`Wang et al., “Modeling Communication Pipeline Latency”,
`Proc. of ACM International Conference on Measurement
`and Modeling of Computer Systems, Sigmetrics, Jun. 1998.
`Yocum et al., “Cut—Through Delivery in TrapeZe: An Exer
`cise in LoW—Latency Messaging ”, Proc. Of Sixth IEEE
`International Symposium on High Performance Distributed
`Computing, Aug. 1997.
`Dittia et al., “The APIC Approach to High Performance
`NetWork Interface Design: Protected DMA and Other Tech
`niques”, Proceedings of INFOCOM ’97, Apr. 7—11, 1997.
`Lauria et al., “Ef?cient Layering for High Speed Commu
`nication: Fast Messages 2.x*”, IEEE, Jul. 1998.
`TeZuka et al., “PM: An Operating System Coordinated High
`Performance Communication Library”, Real World Com
`puting Partnership, 1997.
`Prylli et al., “Modeling of a High Speed NetWork to Maxi
`miZe Throughput Performance: the Experience of BIP over
`Myrinet”, Sep. 26, 1997.
`
`* cited by examiner
`
`Primary Examiner—Christopher B. Shin
`(74) Attorney, Agent, or Firm—Jenkins & Wilson, PA.
`(57)
`ABSTRACT
`
`Asystem and method termed adaptive message pipelining in
`Which tWo or more data paths are joined by a data movement
`device. The data movement device has a data path controller
`for each incoming and outgoing data path. Adaptive mes
`sage pipelining is a dynamic policy Which adjusts betWeen
`a store and forWard data delivery method and a cut through
`delivery of data based upon network and I/O bus contention
`in order to decrease latency and maximize system through
`put.
`
`9400935 * 1/1994 (W0) .
`
`15 Claims, 8 Drawing Sheets
`
`Is data ready on incoming data path?
`
`i
`
`:10
`Initiate incoming datapath controller
`transfer of frame to DMD
`i
`
`yes
`
`is frame transfer to DMD
`
`I
`
`'7
`
`no
`
`y
`
`(wait)
`@
`Is outgoing datapath controller free and
`ready to send?
`
`no
`
`Initiate outgoing datapath controller transfer
`of all remaining data of the frame on
`DMD
`w
`
`l
`
`589
`
`Is transfer to outgoing datapath finished‘? (W:1)_l
`
`not
`
`yes
`
`“0
`
`Has
`
`' r '
`
`‘
`
`'
`
`5Q es
`‘ been reached?
`ll
`
`(wait)
`
`1
`
`[5 outgoing datapath controller free and -
`ready to send?
`
`55D
`Initiate outgoing datapafh controller transfer
`of the total amount of the frame‘s data
`which has arrived
`
`Oracle Ex. 1023, pg. 1
`
`

`

`U.S. Patent
`
`0a. 23, 2001
`
`Sheet 1 0f 8
`
`US 6,308,228 B1
`
`110
`
`V
`
`DATA O
`MOVEMENT DEVICE
`(DMD)
`O
`
`120
`
`100
`
`Oracle Ex. 1023, pg. 2
`
`

`

`U.S. Patent
`
`0a. 23, 2001
`
`Sheet 2 of 8
`
`US 6,308,228 B1
`
`_ _ _
`
`h r. r.
`
`n w m 0
`_ e 8
`.r 8
`
`.' .m I | v 0 r l .' .t
`_ m m ‘m
`_ y m. m
`
`.t. O
`
`2 9. 2 2
`
`0 0 0
`
`.m m M m c." D
`d .I C
`0.. e D
`O 0
`
`h of m D E
`D. M .B
`
`a n m
`
`I
`
`d
`
`_ S I
`
`_ LL PU r _ O .m f
`- I S
`
`?u 0 O
`_ .1. m
`
`_ _ r rrrrrrrrrrrrrrrrrrrrrrrrrr r IFII
`
`s 6 v1
`
`w .1 O %
`
`) )
`uh ...h. 0 G W W ( ( O n
`
`yes
`
`E
`Is outgoing datapath controller free and
`ready to send?
`
`no
`
`Initiate transfer of frame to
`outgoing dafapath
`
`yes
`
`Is frame transfer to outgoing dafapath
`finished?
`
`no
`
`FIG. 2
`
`PRIOR ART
`
`Oracle Ex. 1023, pg. 3
`
`

`

`US. Patent
`
`Oct. 23, 2001
`
`Sheet 3 0f 8
`
`US 6,308,228 B1
`
`mg
`
`.8»IE
`
`.E<moEmm.05
`
`an
`
`film
`
`olmm
`
`$23858mumEwEmPcmu:
`
`
`
`958:5:95:005332.9:0332
`
`22.33>82
`
`a,
`
`#5:.meobe
`
`
`
`
`
`835:3:95:00£832.@509333:5
`
`«53BanmEEouEco>32BowmH
`
`
`
`
`
`5:95:00£833oEEooE305.:
`
`9.5B25:mo.325:
`
`
`
`«329:893356:5:«Eu:mH
`
`main
`
`
`
`26act3:828583%@5098mm
`
`mncmw3>38
`
`
`
`
`
`Emacs.5:82858339:09832;;
`
`
`
`EnBEE93.8
`
`ficwEoot”6238:8mcEEEE9.:we
`
`olmn
`
`$055:£838@50333.395:3
`
`Oracle Ex. 1023, pg. 4
`
`Oracle Ex. 1023, pg. 4
`
`
`
`
`
`
`
`
`
`

`

`
`
`
`
`9.68.;3:95:00£3380503303
`
`«Emu3hug.
`
`
`
`
`
`3.28“.3:0ch£3239.63:032:5
`
`
`
`
`
`.52«Egg¢.9...
`
`«£33%mEESEcox823%fl
`
`
`
`
`
`$=obc8£332.mEEooE32:5
`
`9.5325:weEmacs
`
`
`
`0:
`
`a
`
`
`
`“363889.53L396:0E0:fl
`
`mo»
`
`
`
`
`
`$053..coma29.3.5«3355me:
`
`ole“
`
`o:
`
`$5MB368
`
`
`
`Euout3:828533%9.6933
`
`3»
`
`US. Patent
`
`Oct. 23, 2001
`
`Sheet 4 0f 8
`
`US 6,308,228 B1
`
`a:II0?
`
`mg
`
`o3
`
`
`
`0983:0mo;52:
`
`
`
`38when:05Me#595.33v5Fe
`
`
`
`Stag.5:95:00583%@503332:5
`
`
`
`
`
`
`
`o:920
`
`so95E9:we23ochEB=0“6
`
`a
`
`
`
`865?:533%363385.20:E
`
`m0)
`
`Oracle Ex. 1023, pg. 5
`
`Oracle Ex. 1023, pg. 5
`
`
`
`
`
`
`
`
`
`

`

`US. Patent
`
`Oct. 23, 2001
`
`Sheet 5 0f 8
`
`US 6,308,228 B1
`
`m.e_.._2
`
`84".
`
`«£83%9:885co:388%m:
`
`olwm
`
`~32:an9.5BEms—E2:8:m:
`
`oz:8meat:03:22:
`
`
`
`
`
`5:228£332:mEEoQE32:5
`
`we:
`
`a
`
`
`
`
`
`
`
`«8:08:cmon20:32:33:558:
`
`22.38:88
`
`_
`
`
`
`
`
`
`
`.29556:2E8533%@509322:5
`
`
`
`8%99:2:05,6E326:5305.8
`
`335we:52:;
`
`
`
`
`
`EBmm...”3:92:00£3239:033m:
`
`
`
`
`
`a
`
`
`
`2.6we:5:228538%9:093w:
`
`wvcum3352
`
`
`
`
`
`BEE:3:828533%9:09332:5
`
`
`
`
`
`
`
`2.55«ES:9::0So: chm359%::o:o
`
`fl
`
`«BEE:£833@6983L822:m:
`
`Oracle Ex. 1023, pg. 6
`
`Oracle Ex. 1023, pg. 6
`
`
`
`
`
`
`
`

`

`U.S. Patent
`
`0a. 23, 2001
`
`Sheet 6 0f 8
`
`US 6,308,228 B1
`
`Sending Host
`
`Memory
`
`610
`)
`
`I
`
`@_Host/PCI
`
`bridge
`
`620 -/
`
`CPU
`
`Sending NIC
`
`660
`Receiving Host )
`
`I
`
`Memory
`
`Host/PCI__@
`
`bridge
`_\-67O
`
`\I 650
`
`Myrinet
`network
`
`640
`
`Receiving NIC
`
`FIG. 6A
`
`stage 1: sender PCI
`
`stage 2: network links
`
`Netie7-(HostRevl
`NetTx
`I HostTx
`stage 3: receiver PCI
`
`FIG. 6B
`
`Oracle Ex. 1023, pg. 7
`
`

`

`U.S. Patent
`
`Oct. 23, 2001
`
`Sheet 7 0f 8
`
`US 6,308,228 B1
`
`
`
`HostTx NetTx
`
`HostRev
`
`Store and Forward
`
`FIG. 7A
`
`0
`
`150
`100
`50
`microseconds
`
`200
`
`HostTx
`
`Oétimol Flxed
`
`- UUUUUU
`.UUUUUU.
`
`25 50 75 100125150
`microseconds
`
`Hos’cRev *
`0
`
`FIG. 7B
`
`éOptimol 1700513115
`
`" UUUUUUU ". UUUUUUU ,
`i
`
`HostTx
`
`NetTx
`
`HostRev
`
`0 25 50 75100125150175
`microseconds
`
`FIG. 7C
`
`Oracle Ex. 1023, pg. 8
`
`

`

`U.S. Patent
`
`0a. 23, 2001
`
`Sheet 8 0f 8
`
`US 6,308,228 B1
`
`
`
`
`
`002. com com o? com a
`
`“208358
`
`12.25%“
`mE5
`
`. 255%
`
`om: m? o9 mh om o
`
`mucouowobwE
`
`w 6E
`
`l
`
`xbmOI
`
`
`
`1 XCQZ
`
`
`
`1 >838:
`
`Oracle Ex. 1023, pg. 9
`
`

`

`US 6,308,228 B1
`
`1
`SYSTEM AND METHOD OF ADAPTIVE
`MESSAGE PIPELINING
`
`TECHNICAL FIELD
`
`The present invention relates to a technique for reducing
`network latency of messages while delivering high through
`put.
`
`BACKGROUND AND RELATED ART
`
`Latency of messages in a network is typically linked to
`high end-to-end performance for applications. However,
`certain computer applications such as network memory or
`?le access require low latency for large messages and high
`bandwidth or throughput under load in order to perform
`optimally. Reconciling these con?icting demands requires
`careful attention to data movement across data buses, net
`work interfaces, and network links.
`One method of achieving high data throughput is to send
`larger data packets and thus reduce per-packet overheads.
`On the other hand, a key technique for achieving low latency
`is to fragment data packets or messages and pipeline the
`fragments through the network, overlapping transfers on the
`network links and I/O buses. Since it is not possible to do
`both at once, messaging systems must select which strategy
`to use.
`It is therefore desirable to automatically adapt a fragmen
`tation policy along the continuum between low latency and
`high bandwidth based on the characteristics of system
`hardware, application behavior, and network traf?c.
`A number of prior art systems have used fragmentation/
`reassembly to reduce network latency on networks whose
`architecture utiliZes a ?xed delimited transmission unit such
`as ATM (asynchronous transfer mode). Fixed fragmentation
`is a common latency reduction scheme used in the following
`systems: APIC, Fast Messages (FM), Active Messages
`(AM), and Basic Interface for Parallelism (BIP). PM (to be
`discussed) appears to utiliZe a form of variable fragmenta
`tion only on packet transmission. Variable and hierarchical
`fragmentation were theoretically explored in Wang et. al. In
`contrast, a technique termed cut-through delivery was devel
`oped in the TrapeZe Myrinet messaging system. Cut-through
`delivery is a non-static variable fragmentation scheme
`meaning fragment siZes can vary at each stage in a pipeline.
`The following prior-art discussion describes messaging soft
`ware developed for a Myrinet gigabit network unless oth
`erwise noted.
`The design of the APIC (ATM Port Interconnect
`Controller) network interface card (NIC) speci?es imple
`mentation of full AAL-S segmentation and reassembly
`(SAR) on-chip. The APIC NIC uses ?xed-siZe fragmenta
`tion at the cell granularity(48 byte of data), so it does not
`store and forward entire frames. Moreover, APIC does not
`adapt to host/network architectures or to changing condi
`tions on the host or network. (See generally, Dittia et al., The
`APIC Approach to High Performance Network Interface
`Design: Protected DlVlA and Other Techniques, Proceedings
`of INFOCOM’ 97, April, 1997)
`Fast Messages (FM) utiliZes ?xed-siZe fragmentation
`when moving data between the host, NIC, and network link
`in order to lower the latency of large messages. Though FM
`uses a streaming interface that allows a programmer to
`manually pipeline transfers in variably siZed fragments to
`and from host API buffers, API data moves to and from the
`network interface card in ?xed siZe fragments. Thus, it is the
`programmer’s task to pipeline packets by making multiple
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`2
`FM
`calls to the application programming interface
`lacks the ability to adapt automatically and transparently to
`changing host and network characteristics. (See generally,
`Lauria et al., Ejficient Layering for High Speed Communi
`cation: Fast Messages 2.x, IEEE, July, 1998)
`Active Messages
`uses a ?xed-siZe fragmentation
`scheme to reduce latency of medium to large packets. Active
`messages, however, is non-adaptive and utiliZes store and
`forward for non-bulk packets as a means for increasing
`throughput.
`Basic Interface for Parallelism (BIP) performs static
`?xed-siZe fragmentation on the adapter. BIP, however,
`adjusts the fragment siZe depending on the siZe of the entire
`packet. When a packet is sent, fragment siZe is determined
`by a table look-up as indexed by the packet’s length. BIP,
`while statically adaptive to packet siZe, does not adjust
`dynamically to changing host and network characteristics.
`(See generally, Prylli et al., Modeling of a High Speed
`Network Thrughput Performance: The Experience of BIP
`over Myrinet, September 1997)
`The Real World Computing Partnership has developed a
`messaging package, also for Myrinet, called PM which
`implements fragmentation on the adapter for sending in a
`technique they term immediate sending. Double buffering is
`used for receiving. It is unclear from their current documents
`exactly what form of fragmentation constitutes immediate
`sending, but it appears to be a form of variable fragmenta
`tion. Moreover, their technique is limited since PM claims it
`is not possible to perform immediate sending on the recep
`tion of a packet. (See generally, TeZuka et al., PM: An
`Operating System Coordinated High Performance Commu
`nication Library, Real World Computing Partnership) In a
`theoretical approach, Wang et al. examines variable siZed
`and hierarchical fragmentation pipelining strategies. Hier
`archical fragmentation is one scheme in which a fragmen
`tation schedule may change in different pipeline stages; it is
`not a static pipelining method. The theory rests on different
`assumptions than the present invention, adaptive message
`pipelining
`Wang et al. assumes that g- (?xed transfer
`overhead) and Gi (time per unit of data) values are ?xed and
`previously known, so that both static and non-static pipeline
`schedules can be computed beforehand, and therefore are
`not adaptable to changing conditions. Neither does Wang et
`al. consider throughput as a goal in any of their studied
`pipelining strategies. (See generally, Modeling and Optimz
`ing Communication Pipelines, Proceedings of ACM Inter
`national Conference on Measurement and Modeling of
`Computer Systems, SIGMETRI CS, June 1998)
`Cut-through delivery, disclosed in a previously published
`paper, is a variable siZed fragmentation scheme in which the
`schedules are not static for any stage of a pipeline. Cut
`through delivery alone, however, is unable to adjust to
`inter-packet pipelining and therefore cannot extract the
`maximum bandwidth from the underlying system hardware.
`(Yocum et al., Cut-Through Delivery in Trapeze: An Exer
`cise in Low-Latency Messaging, Proc. Of Sixth IEEE Inter
`national Symposium on High Performance Distributed
`Computing, August, 1997)
`The approach of the present invention differs from the
`prior-art in several ways. First, pipelining is implemented on
`the network interface card (NIC), transparently to the hosts
`and host network software. Second, selection of transfer
`siZes at each stage is automatic, dynamic, and adaptive to
`congestion conditions encountered within the pipeline. The
`schedules, therefore, are variable and non-static. Third, the
`user of the API does not need to know anything about the
`
`Oracle Ex. 1023, pg. 10
`
`

`

`US 6,308,228 B1
`
`4
`FIG. 6B is a schematic diagram illustrating transfers
`between stages in a three-stage pipeline model;
`FIGS. 7A—7C are graphs illustrating how message pipe
`lining alternatives (store and forward, static ?xed, and static
`variable) decrease latency;
`FIG. 8 is a graph illustrating the present invention of
`adaptive message pipelining; and
`FIG. 9 is a graph illustrating a snapshot of AMP derived
`from the reference implementation in action for a stream of
`8KB frames.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`The present invention will now be described with refer
`ence to the accompanying drawings in which like reference
`numbers represent like elements. For reference purposes, a
`glossary of key terms is presented.
`
`TERM
`
`DEFINITION
`
`3
`hardware or network characteristics or load in order to
`achieve both low-latency and high bandwidth.
`
`SUMMARY OF THE INVENTION
`The present invention describes adaptive message pipe
`lining (AMP) which is a scheme that reduces network
`latency of messages larger than minpulse, a pre-de?ned
`threshold amount of data, while delivering maximum
`throughput of high-speed networks and I/O buses under
`load. Adaptive message pipelining for the present invention
`is supported in part by the previously discussed local policy
`called cut-through delivery implemented within a network
`interface card (NIC). AMP, as opposed to ordinary cut
`through delivery, automatically adjusts to congestion, yield
`ing peak throughput for streams of messages without the
`need for a separate bulk data transfer mechanism.
`The present invention’s reference implementation was
`coded as a simple policy in ?rmware running on a Myrinet
`network adapter. This local policy, which will be referred to
`as AMP, combines low network latency with high through
`put through careful pipelining of the movement of data
`between network links and host memory.
`Adaptive message pipelining improves network perfor
`mance by managing the data transfers involved in sending
`and receiving data. The bene?ts of adaptive message pipe
`lining include a signi?cant decrease in latency for medium
`to large packets when the network or 1/0 buses are lightly
`loaded, and maximized link and I/O bus utilization when the
`network or 1/0 buses are congested. Pipelining data within
`a frame (intra-packet) is a technique used to decrease
`latency. Inter-packet pipelining, i.e., pipelining between
`frames of data, commonly achieved through the technique of
`double-buffering, maximizes the throughput of the connec
`tion. Between these two extremes is a continuum of pipe
`lining alternatives. AMP adjusts or manages the fragmenta
`tion policy across the entire range achieving its lowest
`latency as a result of intra-packet pipelining and its highest
`throughput as a result of inter-packet pipelining.
`AMP utilizes the techniques of cut-through delivery and
`double buffering in a novel way within a local policy that
`manages the data transfers. The policy begins by transferring
`frames via cut-through delivery, but, as conditions become
`more congested, falls out of cut-through delivery and into
`store and forward in order to deliver high bandwidth.
`It is therefore the object of the present invention to adapt
`the management policy between intra-packet and inter
`packet pipelining based on host and network characteristics.
`Thus, messaging systems can achieve both low latency and
`high bandwidth or throughput.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is a schematic diagram illustrating the How of data
`through a data movement device (DMD) as it receives data
`from an incoming data path and sends it out through an
`outgoing data path;
`FIG. 2 is a block diagram illustrating the prior art meth
`odology of store and forward with double buffering;
`FIG. 3 is a block diagram illustrating the prior art meth
`odology of static ?xed/variable fragmentation.
`FIG. 4 is a block diagram illustrating the prior art meth
`odology of cut-through delivery;
`FIG. 5 is a block diagram illustrating the adaptive mes
`sage pipelining methodology of the present invention;
`FIG. 6A is a schematic diagram illustrating four send and
`receive Direct Memory Access (DMA) transfers through
`DMA engines on the network interface card;
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`Buffer
`
`Cut-Through
`Delivery
`Data Movement
`Device
`(DMD)
`
`Data Path
`Controller
`(DPC)
`Data Path
`
`Adaptive Message A scheme for reducing network latency for
`Pipelining
`messages larger than a predetermined size (min
`(AMP)
`pulse) while delivering maximum throughput
`(bandwidth) of high-speed networks and I/O buses
`under load.
`An area for data to be transferred to and from with
`a data path controller. In AMP, buffering is on the
`data movement device or DMD.
`A variable sized fragmentation scheme in which
`schedules are not static for any stage of a pipeline.
`Any device connected to two or more data paths,
`on which resides buffering for data and data path
`controllers. Examples include network interface
`cards, CPUs, network switches, etc.
`Any device which controls the movement of data
`along a data path. e.g., DMA engine, P I/O
`controller, etc.
`Any path along which data travels. i.e. memory
`buses, I/O buses, network links, etc.
`Any fragmentation scheme in which each fragment
`of the original frame is also transferred as a frame.
`Also referred to by those skilled in the art as
`packetization.
`A fragmentation scheme in which each fragment is
`the same size.
`Any method by which a frame of data is fragmented
`into, usually, smaller sized transfers in order to
`pipeline the movement of data across multiple data
`paths.
`A delimited unit of data of any length such as a
`packet, message, I/O block, etc.
`Pipelining between two separate packets or frames
`commonly achieved through double buffering.
`Pipelining within one packet or frame commonly
`achieved with ?xed size fragmentation.
`A pre-de?ned threshold amount of data. Cut
`through delivery waits until a minpulse amount of
`data has arrived on the DMD before initiating an
`outgoing data path controller transfer.
`The schedule of fragments for a frame is the same
`for each stage in the pipeline. The schedule can
`either be ?xed or variable.
`A fragmentation scheme in which each fragment
`can be a different size.
`
`Delimited
`Fragmentation
`
`Fixed
`Fragmentation
`Fragmentation
`Schemes
`
`Frame
`
`Inter-Packet
`Pipelining
`Intra-Packet
`Pipelining
`Minpulse
`
`Static
`Fragmentation
`
`Variable
`Fragmentation
`
`Adaptive message pipelining (AMP) is a technique which
`minimizes latency and maximizes throughput when trans
`ferring data across multiple data paths. AMP is a unique
`combination of three techniques or policies:
`store and
`forward, (ii) cut-through delivery, and (iii) double buffering.
`Store and forward is a data movement technique which
`requires that an entire frame of data completely ?nish a
`current stage transfer before beginning a transfer to the next
`
`Oracle Ex. 1023, pg. 11
`
`

`

`US 6,308,228 B1
`
`10
`
`15
`
`25
`
`35
`
`5
`stage. The chief bene?t of store and forward is increased
`throughput due to reduction in overheads. Cut-through
`delivery is a technique used to decrease latency in Which a
`frame of data need not complete its current stage before parts
`of the frame can be transferred to the next stage. This is an
`attribute of cut-through delivery called intra-packet
`pipelining(de?ned above). Double buffering is a technique
`using tWo or more data buffers that permits a store and
`forWard architecture to overlap the transfer and emission of
`tWo data frames simultaneously. This is sometimes referred
`to as inter-packet pipelining.
`FIG. 1 illustrates an example of data How through a data
`movement device (DMD) 100 as it receives data from an
`incoming data path 110 and sends it out through an outgoing
`data path 120. While only one incoming and one outgoing
`data path have been illustrated in FIG. 1, a DMD can have
`multiple data paths. Adaptive message pipelining is essen
`tially the policy of handling the data as it enters and exits the
`data movement device 100. A data movement device, for
`purposes of the present invention, is a device Which joins
`data paths. A data path is any path along Which data travels.
`Moreover, a data movement device possesses data buffer(s),
`data path controller(s), and a mechanism by Which to be
`noti?ed When a data path controller has completed a data
`transfer (interrupt or polling capabilities are suf?cient). The
`data movement devices of the present invention also have
`the ability to utiliZe data that has arrived as a result of a
`transfer in progress. Another capability of the data move
`ment device is the ability to monitor the progress of a data
`path controllers transfer, i.e., to note the amount of data
`transferred to date.
`A data path controller is the mechanism by Which data is
`transmitted across a data path. Data path controllers include,
`but are not limited to, direct memory access devices or
`programmed I/O devices. Data path controllers must be able
`to initiate concurrent transfers of data on data paths.
`Thus, data path controllers according to the present inven
`tion have the ability to
`initiate and terminate concurrent
`transfers, With respect to the data movement device, (ii)
`notify the data movement device of a completed transfer,
`(iii) alloW the data movement device to utiliZe data that has
`arrived as a result of a transfer in progress, (iv) monitor the
`progress of a transfer, and (v) delimit frames of data.
`FIG. 2 is a block diagram illustrating the prior art meth
`odology of store and forWard With double buffering. A data
`movement device (DMD) checks to see if data is ready to be
`transferred on the incoming data path as illustrated by block
`200. When there is data ready to be transferred, block 210
`initiates a transfer of the entire frame of data to the data
`movement device once a data buffer becomes available. A
`frame is a delimited unit of data such as a packet, message,
`I/O request, etc, and can be any length. Next, the data
`movement device checks to see if the frame transfer to the
`DMD is complete as illustrated in block 220. If not, the data
`movement device Waits until the frame transfer is complete.
`When complete, tWo events take place simultaneously as
`illustrated by the dotted line and solid line exiting block 220.
`One event tells the DMD to start over again at block 200 and
`check Whether there is more data ready on the incoming data
`path. The other event causes the DMD to test the availability
`of the outgoing data path controller illustrated in block 230.
`Once the outgoing data path controller is free and ready to
`send, block 230 succeeds, the DMD transfers the frame on
`the outgoing link in block 240. The DMD then Waits until
`the outgoing transfer is complete as illustrated in block 250
`before returning to block 220 to aWait another frame of data
`to arrive on the DMD from the incoming data path.
`
`45
`
`55
`
`65
`
`6
`FIG. 3 is a block diagram illustrating the prior art meth
`odology of static ?xed and variable fragmentation. This
`process begins much the same as store and forWard. A data
`movement device (DMD) checks to see if data is ready to be
`transferred on the incoming data path as illustrated by block
`300. When there is data ready to be transferred, block 310
`initiates the transfer of the frame of data to the DMD. In
`these schemes, hoWever, the entire frame need not have
`arrived into the DMD prior to initiating a transfer from the
`DMD to the outgoing data path. Block 320 queries Whether
`the incoming frame transfer to the DMD has completed. If
`it has, block 360 checks to make sure the data path controller
`for the outgoing path is free and ready to send. Block 370
`initiates the transfer of as many individual fragments as
`make up the remaining data on the DMD, i.e., the DMD
`transmits the rest of the fragmentation schedule for that
`frame.
`The DMD then Waits for that transfer to complete before
`beginning the cycle aneW. If block 320 discovers that the
`transfer to the DMD has not completed, then block 330
`checks to see if the speci?ed fragment siZe has been
`received. If not, the DMD Waits until an amount of data
`equal to the fragment siZe has been received While checking
`the end of the transfer to the DMD. When the amount of data
`meets the requisite fragment siZe, the DMD checks the status
`of the outgoing data path controller in block 340. If that data
`path controller is free and ready to send, then block 350
`transfers a fragment to the outgoing data path.
`The data movement device then returns to block 320 to
`see if enough neW data has arrived for the next outgoing data
`path transfer. In static ?xed/variable fragmentation the out
`going transfers are alWays equal to the scheduled fragment
`siZe (except perhaps the last transfer as the siZe of the frame
`may not be evenly divisible by fragment siZe).
`FIG. 4 is a block diagram illustrating the prior art meth
`odology of cut-through delivery. This process begins much
`the same as store and forWard and static ?xed/variable
`fragmentation. A data movement device (DMD) checks to
`see if data is ready to be transferred on the incoming data
`path as illustrated by block 400. When there is data ready to
`be transferred, block 410 initiates the transfer of the frame
`of data to the DMD. In these schemes, hoWever, the entire
`frame need not have arrived into the DMD prior to initiating
`a transfer from the DMD to the outgoing data path. Block
`420 queries Whether the incoming transfer to the DMD has
`completed. At this point everything Works the same as in
`FIG. 3 except for the folloWing. If the transfer of the frame
`to the DMD has completed, then block 470 Will issue a
`transfer to the outgoing data path of all remaining data of the
`frame on the DMD. If incoming transfer of the frame has not
`completed, then Block 430 checks to see if minpulse amount
`of data, of the frame arriving on the DMD from the incoming
`data path, has been reached or exceeded. As in FIG. 3, block
`440 Waits until the outgoing data path is free and ready to
`send. Then block 450 initiates the transfer of one fragment
`Whose siZe is equal to the amount of data of the frame that
`has arrived on the DMD, from the incoming data path, at that
`point in time.
`FIG. 5 is a block diagram illustrating the methodology of
`the present invention, namely, adaptive message pipelining
`(AMP). Just as in store and forWard, static ?xed/variable
`fragmentation, and cut-through delivery, adaptive message
`pipelining performs the same tWo initial steps. The data
`movement device (DMD) checks to see if data is ready to be
`transferred on the incoming data path as illustrated by block
`500. When there is data ready to be transferred, block 510
`initiates a transfer of the frame of data to the data movement
`
`Oracle Ex. 1023, pg. 12
`
`

`

`US 6,308,228 B1
`
`7
`device once a data buffer becomes available. Block 520
`checks to see if the entire frame transfer has completed. At
`this point everything Works the same as in FIG. 4 except for
`the following. When block 520 discovers that the entire
`frame transfer has completed, control passes to both blocks
`560 and 500 (as indicated by the solid and dashed lines).
`When block 580 discovers that the last of the frame has
`completed its transfer to the outgoing data path, then control
`is given to block 520. These tWo key differences alloW
`double buffering to be used With cut-through delivery. Data
`path controllers for moving data to the DMD from the
`incoming data path and for moving data from the DMD to
`the outgoing data path may noW operate in parallel for either
`the same or different frames.
`Adaptive message pipelining’s reference implementation
`exists Within the TrapeZe messaging system for Myrinet. It
`is implemented as a host-side API and a ?rmWare program
`for Myrinet netWork interface cards. In this implementation
`of AMP, the netWork interface card (NIC) is the data
`movement device (DMD), the host’s I/O bus is both an
`incoming and an outgoing data path, and the netWork link is
`both an incoming and outgoing data path. Direct memory
`Access (DMA) engines, Which reside on the netWork inter
`face card, possess sufficient capabilities for adaptive mes
`sage pipelining. In particular the DMA engines alloW trans
`fers to and from the netWork interface card to occur
`simultaneously. They also throW interrupts (on the netWork
`interface card) When transfers are complete, alloW data
`arriving on the netWork interface card to be utiliZed during
`a transfer, alloW the netWork interface card to vieW the
`number of bytes Which have arrived so far, and can delimit
`outgoing transfers via tail-?its(to the netWork link) or PCI
`I/O bus transfer completions.
`Adaptive message pipelining reduces latency by overlap
`ping stages of data transfers for frames. In a typical netWork,
`such as Myrinet, four Direct Memory Access (DMA) trans
`fers through DMA engines on the netWork interface card
`(NIC) are required to both send and receive a packet. This
`is illustrated in FIG. 6A. The ?rst transfer is from host
`memory 610 to NIC memory 620 across the sending hosts
`peripheral component interconnect (PCI) I/O bus 630. Sec
`ond is the transfer from NIC memory 620 to the netWork link
`640. The third and fourth transfers are symmetric to the
`second and ?rst respectively, i.e., netWork link 640 to NIC
`memory 650, and NIC memory 650 to host memory 660
`across the receiving host’s PCI I/O bus 670. The correspond
`ing netWork interface card DMA engines are referred to as
`HostTx, NetTx, NetRcv, and HostRcv respectively. Trans
`fers through components that never introduce store-and
`forWard delays, such as the host/PC bridges, the NetRcv
`engine, and the Wormhole-routed Myrinet netWork sWitches
`are ignored in this analysis. NetWork interface card ?rmWare
`controls all transfers betWeen stages in the resulting three
`stage pipeline model illustrated in FIG. 6B.
`FIGS. 7A—7C illustrate hoW fragmentation decreases
`latency. The simplistic approach, illustrated in FIG. 7A, is to
`send the message as a single packet With store-and-forWard
`delay in betWeen each pipeline stage. Static ?xed fragmen
`tation is illustrated in FIG. 7B, and static variable fragmen
`tation is illustrate

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket