throbber
Cut-Through Delivery in Trapeze:
`An Exercise in Low-Latency Messaging
`
`Kenneth G. Yocum
`
`Andrew J. Gallatin
`Jeffrey S. Chase
`Dept. of Computer Science
`Duke University
`Durham, NC 27708-0129
`fgrant, chase, gallatin, alvyg@cs.duke.edu
`
`Alvin R. Lebeck
`
`Abstract
`
`New network technology continues to improve both
`the latency and bandwidth of communication in com-
`puter clusters. The fastest high-speed networks ap-
`proach or exceed the I/O bus bandwidths of “gigabit-
`ready” hosts. These advances introduce new considera-
`tions for the design of network interfaces and messaging
`systems for low-latency communication.
`This paper investigates cut-through delivery, a tech-
`nique for overlapping host I/O DMA transfers with net-
`work traversal. Cut-through delivery significantly re-
`duces end-to-end latency of large messages, which are
`often critical for application performance.
`We have implemented cut-through delivery in
`Trapeze, a new messaging substrate for network mem-
`ory and other distributed operating system services. Our
`current Trapeze prototype is capable of demand-fetching
`8K virtual memory pages in 200s across a Myrinet
`cluster of DEC AlphaStations.
`
`1. Introduction
`
`Advances in network technology continue to improve
`the latency and bandwidth of communication in com-
`puter clusters. The tighter coupling of cluster nodes
`creates new opportunities to reduce the running time of
`large-scale computations by using hardware resources
`across the cluster in a coordinated way.
`
` This work is supported in part by NSF grant CDA-95-12356,
`Duke University, and the Open Software Foundation. Jeff Chase is par-
`tially supported by NSF Career Award CCR-96-24857. Alvin Lebeck
`is partially supported by NSF Career Award MIP-97-02547
`
`Latency of network communication is an impor-
`tant factor in determining the effectiveness of cluster-
`based parallel computing, distributed file storage, net-
`work memory, and other resource sharing schemes. Net-
`work hardware and software are typically optimized for
`high-bandwidth continuous data transfer or low-latency
`exchanges of small messages of a few hundred bytes.
`However, in many instances, network communication
`involves large messages on the order of one to ten kilo-
`bytes. Latency of large messages is critical for page mi-
`gration, block data transfer for parallel applications, or
`demand fetching of virtual memory pages or file blocks.
`This paper describes the design and implementation
`of Trapeze, a new messaging system that delivers low
`latency for both large and small messages. Trapeze was
`designed primarily to handle page migration traffic in
`the Global Memory Service (GMS), a cooperative mem-
`ory system for clusters. The Trapeze prototype consists
`of a messaging libary and custom firmware for Myrinet,
`a high-performance cluster interconnect. GMS and the
`Myrinet platform are described in Section 2.
`Trapeze uses several important buffering and DMA
`management optimizations to minimize the latency of
`large messages, in this case page transfers. Many of the
`optimizations used by Trapeze have been used in ear-
`lier fast messaging implementations. This paper gives
`an overview of Trapeze and focuses on a principal tech-
`nique not described or evaluated elsewhere—called cut-
`through delivery. Cut-through delivery minimizes la-
`tency by aggressively overlapping the four DMA trans-
`fers needed to move a large message from one host to
`another across a network (Figure 1): (1) sender’s host
`memory to adapter, (2) sender’s adapter to network link,
`(3) network link to receiver’s adapter, and (4) receiver’s
`adapter to host memory.
`
`Oracle Ex. 1013, pg. 1
`
`

`
`conclude in Section 5.
`
`2. Overview of Trapeze
`
`Memory
`
`Trapeze was designed as a high-performance com-
`munication substrate for cluster operating system ser-
`vices such as cooperative virtual memory, rather than as
`a full-featured messaging system. This role has dictated
`the features of Trapeze and the character of the messag-
`ing interface. However, cut-through delivery and other
`optimizations used in Trapeze are applicable to any mes-
`saging system or application that is sensitive to large-
`message latency, including some parallel applications
`built using message passing (e.g., MPI [11]) or software
`distributed shared memory.
`In this section we (1) discuss the importance of large-
`message latency for a specific system that uses Trapeze,
`(2) give an overview of Myrinet and the Trapeze messag-
`ing system, and (3) relate Trapeze to other low-latency
`messaging systems.
`
`2.1. Importance of Latency for Network Memory
`
`Our first use of Trapeze is as a messaging substrate
`for the Global Memory Service (GMS) [9], a Unix ker-
`nel facility that manages the memories of cluster nodes
`as a shared, distributed page cache. The GMS imple-
`mentation supports remote paging [5, 10] and coopera-
`tive caching [6] of file blocks and virtual memory pages,
`unified at a low level of the operating system kernel.
`The purpose of GMS is not to support a shared mem-
`ory abstraction, but rather to transparently improve the
`performance of data-intensive workloads. GMS coor-
`dinates memory usage across the cluster so that nodes
`can satisfy paging and file operations with memory-to-
`memory network transfers whenever possible, avoiding
`disk accesses. The key insight is that improvements in
`disk latency are limited by mechanical factors, whereas
`network performance has improved at a rapid rate. For
`example, even with a 100 Mb/s Ethernet interconnect,
`a 266 MHz AlphaStation 500 running a GMS-enhanced
`Digital Unix 4.0 kernel can demand-fetch a file block or
`virtual memory page from the memory of a peer in 1.2
`ms, an order of magnitude faster than the average 12 ms
`access time of a local fast/wide Seagate Barracuda disk.
`However, the performance of data-intensive applica-
`tions under GMS is still dominated by communication
`latency. GMS may reduce I/O stall times by an order
`of magnitude or more, but a 266 MHz Alpha 21164
`CPU could issue up to a million instructions in a mil-
`lisecond spent idling for a remote page fault. Moreover,
`
`4
`
`Bridge
`
`Adapter
`
`3
`
`Interconnect
`
`1
`
`Bridge
`
`Adapter
`
`2
`
`P
`
`P
`
`Memory
`
`Figure 1. Four DMA Transfers for a Network
`Message
`
`Cut-through delivery eliminates network adapter
`store-and-forward latency by pipelining packets through
`the adapter, similar to the way cut-through switch-
`ing [13] eliminates store-and-forward latency in high-
`performance network switches. With cut-through deliv-
`ery, large messages flow through the network adapter;
`the adapter can place data at the sink port shortly after
`it arrives at the source, without waiting for the entire
`packet to be transferred onto the adapter. In this way,
`the DMA transfers from the source and to the sink are
`overlapped, on both the sending and receiving sides.
`We do not claim that cut-through delivery is novel.
`Indeed, variants of cut-through delivery are used in low-
`cost network interfaces to reduce the need for expen-
`sive buffer space on the adapter. Instead, we argue that
`cut-through delivery is a fundamental issue in network
`interface design that has not been adequately explored.
`We show that cut-through delivery significantly reduces
`large-packet latencies, and that the expected benefit of
`cut-through delivery grows rapidly as the network link
`speed approaches I/O bus speeds, as is the case with
`modern high-speed networks. In our cluster, cut-through
`delivery reduces 8KB page transfer latencies by 43% (to
`177 us) after all other known optimizations have been
`applied. As technology advances, high-speed networks
`will drive I/O bus design, maintaining the relevance of
`this optimization. Finally, cut-through delivery is miss-
`ing from other fast messaging systems we have seen.
`This paper is organized as follows. Section 2 presents
`the motivation and background for low-latency page
`transfers, and sets Trapeze in context with other fast
`messaging systems. Section 3 outlines the implemen-
`tation of cut-through delivery in Trapeze for Myrinet.
`Section 4 presents microbenchmark results to evaluate
`the Trapeze prototype on a Myrinet/Alpha cluster. We
`
`Oracle Ex. 1013, pg. 2
`
`

`
`experience with GMS on several networks has shown
`that large message latencies are often higher than ex-
`pected. High bandwidth is most easily achieved with
`continuous streams of packets that naturally pipeline in
`the network and adapters, but this does not translate
`into low-latency delivery of individual large messages
`sent as a single packet. For example, a 155 Mb/s ATM
`network can in principle deliver an 8K message in un-
`In practice, the original GMS prototype
`der 400 s.
`measured page transfer times above a millisecond us-
`ing high-quality ATM adapters. The GMS developers
`have explored pipelined subpage fetches [12] and other
`approaches to masking fetch latency.
`
`small messages, they did not support the DMA facil-
`ities we needed for zero-copy page fetches, relied on
`polling for detection of received messages, and were
`not available for LANai 4.1 or Alpha-based hosts. Fi-
`nally, where large-message latency timings were re-
`ported, they were comparable to MyriAPI. Page trans-
`fers and other large messages can be sent using bulk
`transfer extensions (e.g., FM’s “streaming messages” in-
`terface), which fragment transfers into pipelined packet
`streams, but this did not meet our goal of transferring
`a memory page and associated control information as a
`single packet with the lowest possible latency and over-
`head.
`
`2.2. Page Transfers on Myrinet
`
`2.3. An Overview of Messaging with Trapeze
`
`Our approach to minimizing page transfer costs is to
`use a custom messaging system for Myrinet [3], a high-
`speed wormhole-routed LAN. Our Myrinet configura-
`tion consists of 8-port crossbar switches and adapters
`(Network Interface Cards or NICs) that attach to our Al-
`phaStation hosts through the 32-bit PCI I/O bus.
`The Myrinet NICs are programmable, providing a
`flexible interface to the host. The NICs include a 256K
`block of dual-ported static RAM and a custom CPU and
`link interface based on the third-generation LANai 4.1
`chipset. The NIC SRAM is addressable from the host
`physical address space; the host and LANai interact by
`reading and writing locations of this shared memory.
`The behavior of the adapter is determined by a firmware
`program (a Myrinet Control Program or MCP) that is
`written into the NIC SRAM from the host at startup.
`Myrinet messaging latencies are determined primarily
`by overhead in the MCP and host messaging software
`and the time to transfer data on the host I/O bus.
`Although Myrinet can handle a large transfer as a
`single packet, experiments using the vendor-supplied
`firmware (supporting a host message interface called
`MyriAPI) showed that latency grew more steeply with
`message size than we had expected. A one-way 8K
`page transfer took over 400s at best, almost a factor
`of three higher than the minimum achievable latency.
`Sending page transfers as Internet datagrams over Myr-
`iAPI yielded demand-fault times of over 850s in the
`best case, even with a well-optimized network driver.
`We investigated other message systems available for
`Myrinet in the research community, e.g., Active Mes-
`sages (AM) [17] and Fast Messages (FM) [15]. Like
`MyriAPI, these systems support full-featured APIs de-
`signed primarily to meet the needs of parallel applica-
`tions. While they reported superior performance for
`
`Our solution was to develop Trapeze, a simple, raw
`transport layer designed to provide low latency for both
`large and small messages. The Trapeze prototype con-
`sists of custom Myrinet firmware (Trapeze-MCP) and
`a host library that implements the messaging interface
`(Trapeze-API). To allow kernel-kernel communication,
`the host library is linked into the Digital Unix 4.0 ker-
`nel, which includes a driver to initialize the device and
`to field interrupts. The Trapeze-MCP manages the net-
`work link, coordinates message buffering and DMA, and
`optionally generates host interrupts for incoming pack-
`ets.
`The dominant design goal of Trapeze is to sup-
`port low-latency, zero-copy transfers of virtual memory
`pages across the interconnect. Our current implemen-
`tation is capable of demand-fetching 8K pages with la-
`tencies below 200 s, about fifty times faster than the
`average access time for a high-quality disk. Trapeze im-
`plements several optimizations to achieve this goal. In
`this paper we limit our attention to the cut-through de-
`livery technique, which pipelines individual large mes-
`sages within the adapters, transparently to the hosts.
`Focusing on page transfers allowed us to make sev-
`eral simplifications to streamline the messaging system
`and reduce our prototyping effort:
`
` Fixed-size message and payload buffers. A
`Trapeze message is a 128-byte control message
`with an optional attached payload of up to 8KB.
`Control message buffers are contiguous, aligned
`blocks of adapter memory; payload buffers are
`frames of host physical memory.
`
` No protection. Trapeze currently supports only
`a single logical communication endpoint or chan-
`nel, intended for but not limited to kernel-to-kernel
`
`Oracle Ex. 1013, pg. 3
`
`

`
`communication. We assume that the interconnect
`is secure and the hosts are trusted.
`
` Best-effort delivery. Packets are never dropped
`by the Myrinet network itself, which provides link-
`level flow control by backpressure from a bottle-
`neck interface. However, to avoid deadlocking the
`interconnect, the MCP will drop received packets
`if the host fails to keep up with incoming traffic
`on the link.
`It is the responsibility of the mes-
`sage sender and receiver to coordinate end-to-end
`flow control and recovery from dropped messages,
`if any is needed.
`
`Table 1 lists the Trapeze-API routines used for the ex-
`periments in this paper. These routines interact with the
`Trapeze-MCP through an endpoint structure containing
`a pair of buffer rings: a send ring for outgoing messages
`and a receive ring for incoming messages. Each ring
`is an array of 128-byte control message buffers in the
`NIC SRAM, managed as a circular producer/consumer
`queue. Each ring entry includes space for the message
`contents and header fields for control information.
`The host Alpha processor accesses control message
`contents directly using programmed I/O. The Trapeze-
`API tpz get sendmsg and tpz get rcvmsg rou-
`tines each return a pointer (msg t) to a ring entry;
`the application (e.g., the GMS kernel module) has ex-
`clusive access to the buffer until it releases it with
`tpz release msg. The intent is that the caller moves
`message data directly between processor registers and
`the valid locations of the buffer.
`Payload frames can be attached to entries in ei-
`ther ring. The Trapeze-API attaches a payload frame
`(vm page t) by storing the frame physical address into
`the ring entry in a form that the MCP can use to request
`DMA to or from the frame. If a payload is attached to
`an outgoing send ring entry, the MCP sends the payload
`contents along with the message. Frames attached to
`the receive ring are used as buffers for incoming pay-
`loads. An incoming control message is deposited in the
`next available receive ring entry; any payload is trans-
`ferred via DMA to the frame attached to that entry, or
`discarded if no frame is available.
`
`2.4. Related Messaging Systems
`
`The basic structure of Trapeze is similar to AM, FM,
`and other fast messaging systems designed to minimize
`latency of small messages by eliminating operating sys-
`tem overheads and network protocol processing (e.g.,
`Hamlyn [4], U-net [1], SHRIMP [2], and others). Our
`
`work was also influenced by the Osiris [8] and FRPC
`work [16], which identify adapter design issues for low-
`latency messaging on high-speed networks.
`Some of these systems include bulk data transfer fa-
`cilities optimized for high bandwidth, but none of the
`published work describes cut-through delivery optimiza-
`tions or identifies low latency for large packets as an ex-
`plicit design goal. The FRPC and Osiris platforms had
`I/O buses offering much higher bandwidth than the net-
`work links, thus large-packet latency was limited by link
`bandwidth. Trapeze is similar in spirit to its predeces-
`sors, but it provides more specialized functionality and
`an explicit emphasis on optimizations for large packets
`on modern high-speed networks.
`
`3. Cut-Through Delivery in Trapeze
`
`The Trapeze-MCP uses cut-through delivery to en-
`sure maximum utilization of the network link and the
`PCI I/O bus when transferring message payloads using
`DMA. Cut-through delivery simply means that the MCP
`always initiates DMA as soon as possible, in order to
`maximize overlapping of the DMA transfers needed to
`move a packet between the link and the host.
`
`Store−and−Forward
`Delivery
`
`Cut−Through
`Delivery
`
`Send DMA
`
`Network Traversal
`
`Receive DMA
`
`Time
`
`Figure 2. Cut-Through Delivery vs. Store-
`and-Forward on the AlphaStation 500
`
`The MCP initiates DMA transfers by storing to
`LANai registers and then waiting for the DMA unit to
`signal completion by interrupting the LANai processor.
`There are four possible DMA operation types: host to
`NIC or NIC to host on the PCI bus, and link to NIC or
`NIC to link on the network interface. The Myrinet net-
`work link is bidirectional, but only one transfer at a time
`can take place on the PCI bus. Thus there are effectively
`three DMA functional units.
`The Trapeze-MCP uses a resource-centered structure
`(Figure 3) to maximize utilization of the three DMA
`units. On each iteration through its main loop, the MCP
`
`Oracle Ex. 1013, pg. 4
`
`

`
`msg t tpz get sendmsg()
`
`msg t tpz get rcvmsg()
`
`msg t tpz release sendmsg()
`msg t tpz release rcvmsg()
`
`Operations on Message Rings and Slots
`Allocate a send ring entry for an outgoing message; fail if no entry is available. The caller
`builds the message (including header) in the buffer with a sequence of store instructions.
`Receive the next incoming message; fail if no message is pending. The caller reads the
`message contents with a sequence of load instructions.
`Release a ring entry. On the send ring, this sends the message in the entry.
`
`tpz attach sendmsg(msg t,vm page t,
`io completion t,caddr t)
`
`tpz attach rcvmsg(msg t,vm page t)
`vm page t tpz detach sendmsg(msg t)
`vm page t tpz detach rcvmsg(msg t)
`void tpz check xmit complete(void)
`
`Payload Operations
`Attach a payload frame to a ring entry. The frame contents shall be sent with the message
`as a payload, and the completion routine will be called, with its arguments, either when
`the send entry is re-used, or when tpz check xmit complete(msg t) is called.
`Attach a frame for a payload receive buffer to a receive ring slot.
`Detach a payload buffer from a ring entry. Used to extract the source buffer for a send
`message, or to retrieve the payload received with an incoming message.
`Calls io completion routines to free payload frames for transmitted packets.
`
`void tpz set payload len(msg t,int)
`int tpz get payload len(msg t)
`
`Set the length of the payload to be transmitted, or retrieve the payload length of a received
`packet.
`
`Table 1. Trapeze Messaging API Subset
`
`asks: which of the three DMA engines are idle now, and
`how can they be used?
`The alternative functional view is best illustrated
`by the current LANai Active Messages prototype
`(LAM) [14]. LAM’s sending and receiving sides ex-
`ecute as separate loops using a simple coroutine facil-
`ity. Each iteration through the loop asks: is the current
`packet ready for the next stage of processing? If a corou-
`tine detects that the previous step in processing a packet
`has not yet completed, it transfers control to the other
`coroutine using a special LANai punt instruction. The
`functional LAM structure achieves near-optimal pipelin-
`ing of the DMA transfers for successive outgoing pack-
`ets, but it does not provide any DMA overlap for indi-
`vidual packets on either the sending or receiving sides.
`
`3.1. Cut-Through Delivery
`
`The resource-centered approach of the Trapeze-MCP
`enables intra-packet DMA pipelining, to reduce the la-
`tency of individual packets. The NIC is viewed as a
`cut-through device rather than a store-and-forward de-
`vice, overlapping the transfer of the message across the
`sending host I/O bus, the network, and the receive host
`I/O bus. As shown in Figure 2, the pipelining of cut-
`through delivery can reduce message transfer time com-
`pared to store-and-forward delivery by hiding the la-
`tency of transfers on the host I/O bus, which is the bot-
`tleneck link in our configuration.
`Cut-through delivery works for both incoming and
`outgoing packets as follows. On the sending side, the
`MCP initiates DMA of an outgoing packet onto the net-
`
`work link as soon as a sufficient number of bytes have
`arrived from host memory over the PCI bus. On the re-
`ceiving side, the MCP initiates DMA of the incoming
`packet into host memory as soon as a sufficient num-
`ber of bytes have been deposited in NIC SRAM from
`the network link. Cut-through delivery performs this
`pipelining on a single packet, rather than fragmenting
`into smaller packets and incurring additional packet han-
`dling overheads.
`Cut-through delivery must be implemented carefully
`to prevent the DMA out of the adapter from overrunning
`the DMA into the adapter, on either the sending or re-
`ceiving sides. The Trapeze MCP initiates the outgoing
`DMA as a series of shorter pulses. The MCP initiates
`an outgoing pulse as soon as a threshold number of in-
`coming bytes (set by the pulse threshold parameter) have
`arrived and the outgoing DMA engine is available. The
`LANai exposes the status of active DMA transactions to
`the firmware through counter registers for each DMA
`engine; the Trapeze-MCP main resource loop queries
`these counters to trigger outgoing DMA pulses.
`
`3.2. Discussion
`
`The primary goal of cut-through delivery is to overlap
`I/O bus transfers with network transfers. This must be
`balanced against the bus cycles consumed to acquire the
`bus for a larger number of smaller transfers, which re-
`duces the bus bandwidth available for transferring data.
`Historically, LANs achieved only a fraction of the I/O
`bus bandwidth, and cut-through delivery would have
`negligible effect on large message latency. Therefore,
`
`Oracle Ex. 1013, pg. 5
`
`

`
`to the DMA pulse sizes? How can the optimal pulse
`size for a specific platform be determined?
`
` What is the smallest message size that benefits from
`cut-through delivery?
`
` Will cut-through delivery continue to reduce large-
`packet latency as networks and I/O buses advance?
`
`A
`
`Lanai
`RECV
`RESOURCE
`(dma engine)
`
`4. Performance Analysis
`
`# 3
`Send Buffer
`
`B
`Recv Buffer
`
`C
`
`Host
`Recv
`Resource
`(dma)
`
`HOST DMA ENGINE
`
`# 2
`
`Host
`Send
`Resource
`(dma)
`
`while(1){
`if (HOST DMA ENGINE FREE)
`if (TO_NET)
`Start # 2 Send Resource (dma)
`to Fill # 3 Send Buffer
`TO_HOST
`else if (TO_HOST)
`Start C Recv Resource (dma)
`to Empty B Recv Buffer
`TO_NET
`if (Lanai send dma engine FREE)
`Start #1 Lanai SEND RESOURCE
`if (!Sent_Ctrl)
`Send Ctrl
`else
`Empty # 3 Send Buffer
`if (Lanai recv dma engine FREE)
`Start A Lanai RECV RESOURCE
`to Fill B Recv Buffer
`
`}
`
`#1
`
`Lanai
`SEND
`RESOURCE
`(dma engine)
`
`This section presents the measured performance of
`cut-through delivery on our platform, and analytically
`derives expected benefits on other platforms with dif-
`ferent I/O bus speeds. We first evaluate the critical de-
`sign tradeoffs for cut-through delivery, then describe and
`evaluate a refinement called eager DMA that dynami-
`cally adapts DMA transfer sizes in response to the ob-
`served behavior of the I/O bus and network link. We
`then present performance numbers for a complete de-
`mand fetch of a page from a remote host’s memory, in-
`cluding request latency and interrupt-based notification.
`The timings in this section are taken from mi-
`crobenchmarks of Trapeze on DEC AlphaStation 500
`workstations with 266 Mhz 21164 Alpha processors,
`2MB of off-chip cache, 128MB of main memory and
`LANai 4.1 Myrinet adapters. The I/O bus is a 33MHz
`32-bit PCI with a Digital 21171 PCI bridge.
`
`4.1. Measured Benefits of Cut-Through Delivery
`
`We first present timings for cut-through delivery us-
`ing a range of pulse thresholds. The send threshold is
`the number of bytes that must arrive from the host be-
`fore the sending NIC initiates DMA to the link. The
`receive threshold is the number of bytes that must arrive
`from the link before the receiving NIC initiates DMA to
`host memory.
`The pulse thresholds are critical tuning parameters.
`Small thresholds result in smaller DMA transfers, which
`can decrease the effective bandwidth of the I/O bus due
`to the overhead of acquiring the bus for each transfer. On
`the other hand, higher thresholds yield smaller pipelin-
`ing benefits for individual packets. Moreover, the opti-
`mal pulse threshold may depend on the relative speeds
`of source and sink links. Our goal is to empirically de-
`termine the optimal value for both pulse thresholds on
`our platform.
`Figure 4 shows the effect of send and receive thresh-
`olds on one-way page transfer latency. For these experi-
`ments, the DMA pulse size was fixed at the pulse thresh-
`old. The left-hand graph plots the send threshold on the
`
`Figure 3. Trapeze MCP Structure
`
`high-end LAN adapters may use store-and-forward de-
`livery to enable larger bus transfers and attain maximum
`bandwidth for streams of packets. This can result in
`higher latencies for large messages, e.g., page transfers
`sent as individual AAL5 frames over an ATM network.
`Ethernet and ATM network adapter designs may em-
`ploy a form of cut-through delivery to reduce cost. A
`common design includes a receive FIFO that can hold a
`small amount of data (e.g., a few ATM cells), which are
`deposited into host buffers with DMA as new data ar-
`rives. This eliminates the need for enough adapter mem-
`ory to buffer an entire packet or Protocol Data Unit.
`For Ethernet, cut-through delivery will not reduce la-
`tency significantly because the maximum transmission
`unit (MTU) is too small. For ATM adapters, cut-through
`delivery can overlap DMA of leading cells into the host
`with DMA of trailing cells into the adapter, reducing
`large packet latency. However, this may also reduce
`bandwidth. The tradeoff is determined by the size of
`the FIFO or by more sophisticated balancing of batch-
`ing and cut-through optimizations [7]. For Trapeze, the
`balance can be adjusted by varying the pulse thresholds.
`In the next section we quantify the effects of cut-
`through delivery on large packet latency and bandwidth,
`focusing on the following questions:
`
` How much does cut-through delivery improve
`large-packet latencies in a real system?
`
` How sensitive is the benefit of cut-through delivery
`
`Oracle Ex. 1013, pg. 6
`
`

`
`305
`
`295
`
`285
`
`275
`
`265
`
`255
`
`245
`
`235
`
`225
`
`215
`
`205
`
`195
`
`latency (microseconds)
`
`305
`
`295
`
`285
`
`275
`
`265
`
`255
`
`245
`
`235
`
`225
`
`215
`
`205
`
`195
`
`latency (microseconds)
`
`0
`
`512
`
`1536
`
`2560
`
`3584
`
`4608
`
`5632
`
`6656
`
`7680
`
`0
`
`512
`
`1536
`
`2560
`
`3584
`
`4608
`
`5632
`
`6656
`
`7680
`
`send (pulse size)
`
`recv (pulse size)
`
`Figure 4. Impact of Pulse Threshold on Trapeze Latency for 8K Payloads
`
`Read from system memory to LANai
`Write from LANai to system memory
`
`128
`120
`112
`104
`96
`88
`80
`72
`64
`56
`48
`40
`32
`24
`6
`
`081
`
`MB/sec
`
`128 640
`
`1280
`
`1920
`
`2560
`
`3200
`
`3840
`
`4480
`
`5120
`
`5760
`
`6400
`
`7040
`
`7680
`
`DMA Transfer Size
`
`Figure 5. DEC Alphastation 500 DMA per-
`formance
`
`assuming a gigabit network link), a store-and-forward
`transfer should take no less than 256s, plus control
`message latency. This is consistent with the 308s us-
`ing Trapeze with store-and-forward delivery.
`Although the I/O DMA bandwidth profile and imbal-
`ances may require different send and receive thresholds
`on some systems, on the 500 platform a single pulse
`size of 1280 bytes is near optimal. However, latency
`increases rapidly for adjacent pulse values.
`
`4.2. Eager DMA
`
`The upturn in the left-hand side of the graph in Fig-
`ure 4 suggests a refinement to cut-through delivery. The
`upturn shows that although smaller pulse sizes extract
`the maximum benefit from pipelining, the overhead of
`the additional DMA transfers overshadows this effect.
`This is because the amount of data transferred with each
`
`x-axis; each line represents a different value for the re-
`ceive pulse. The right-hand graph shows the same data
`with the receive thresholds on the x-axis. These timings
`represent one half the average of 50 host-to-host round-
`trip page transfers, each consisting of a 128-byte control
`message and an attached 8KB payload. The hosts poll
`the LANai memory (with an optimal backoff determined
`empirically) to detect message arrival. We observed less
`than a 10s variation from the average.
`As expected, page transfer latency is highest at both
`ends of the graphs, due to limited overlap of large pulses
`and the DMA setup overhead of small pulses. The upper
`right point of each graph represents the latency for store-
`and-forward; both send and receive pulse sizes are 8KB,
`so there is no pipelining benefit. The lowest points cor-
`respond to the optimal pulse thresholds on the sending
`and receiving sides. Cut-through delivery with optimal
`thresholds reduces page transfer latency from 308s to
`195s, a 37% improvement.
`Note that the two graphs have different shapes, indi-
`cating that optimal pulse sizes may not be equal on the
`sending and receiving sides. We believe this is due to an
`imbalance in the PCI performance of our platform. This
`imbalance is readily seen from Figure 5, which shows
`the DMA bandwidth for transferring data to and from
`the I/O bus as a function of DMA sizes ranging from 64
`bytes to 8192 bytes (the base VM page size in Digital
`Unix 4.0). The timings were produced by Myricom’s
`hswap MCP, using a cycle counter on the adapter. Peak
`DMA read bandwidth (transfers from main memory to
`the NIC) approaches 128 MB/s, the maximum achiev-
`able by the 32-bit PCI standard. However, the DMA
`write bandwidth (transfers from the NIC to main mem-
`ory) is about half the read bandwidth, apparently due to a
`flaw in the 21171 PCI bridge. With these numbers (and
`
`Oracle Ex. 1013, pg. 7
`
`

`
`308
`
`298
`
`288
`
`278
`
`268
`
`258
`
`248
`
`238
`
`228
`
`218
`
`208
`
`198
`
`188
`
`178
`
`latency (microseconds)
`
`308
`
`298
`
`288
`
`278
`
`268
`
`258
`
`248
`
`238
`
`228
`
`218
`
`208
`
`198
`
`188
`
`178
`
`latency (microseconds)
`
`0
`
`512
`
`1536
`
`2560
`
`3584
`
`4608
`
`5632
`
`6656
`
`7680
`
`0
`
`512
`
`1536
`
`2560
`
`3584
`
`4608
`
`5632
`
`6656
`
`7680
`
`send (pulse size)
`
`recv (pulse size)
`
`Figure 6. Trapeze One-Way Page Transfer Latency with Eager DMA (128 + 8K bytes)
`
`livery. The benefit of cut-through delivery starts at
`1KB payloads, reducing latency to 63s compared to
`68s for store-and-forward. The relative advantage of
`cut-through delivery increases with payload size, with a
`43% latency reduction for 8KB messages.
`
`store-and-forward
`cut-through
`
`302
`
`282
`
`262
`
`242
`
`222
`
`202
`
`182
`
`162
`
`142
`
`122
`
`102
`
`82
`
`62
`
`42
`
`latency (microseconds)
`
`0
`
`1024
`
`2048
`
`3072
`
`4096
`
`5120
`
`6144
`
`7168
`
`8192
`
`payload size
`
`Figure 7. Cut-Through vs.
`Forward Varying Payload Size
`
`Store-and-
`
`4.4. Effect of I/O Bus Bandwidth
`
`With optimal cut-through delivery, large-packet la-
`tency is determined by the time to move the packet
`across the bottleneck link. Since the network link DMAs
`are overlapped even with store-and-forward, cut-through
`delivery is most effective when the network link band-
`width approaches or exceeds the available I/O bus band-
`width. In this case, bus transfer time is a significant com-
`ponent of total latency using store-and-forward delivery.
`This is the case with Myrinet on our AlphaStations: the
`
`pulse is fixed to the pulse threshold, even though more
`data may have arrived from the source by the time the
`sink DMA engine is idle.
`We enhanced the Trapeze-MCP to combine the ben-
`efits of small and large thresholds using a technique we
`call eager DMA. Eager DMA treats the pulse threshold
`as a minimum unit of transfer, but moves all the data
`available at the start of each DMA transfer. For example,
`on the receiving side, the MCP initiates DMA to the host
`as soon as the number of bytes received from the net-
`work link exceeds the threshold. When the host DMA
`pulse completes, the MCP checks the number of bytes
`that arrived from the network while it was in progress.
`If this amount exceeds the pulse threshold, the MCP im-
`mediately initiates a new DMA for all of the newly re-
`ceived data. Eager DMA achieves maximum transfer
`overlap with the smallest number of transfers.
`Figure 6 shows the performance of the enhanced
`Trapeze firmware, using the same experiment as in Fig-
`ure 4. As expected, eager DMA delivers lower latency
`than static pulse sizes. The effect is most pronounced
`for small pulse thresholds, allowing Trapeze to achieve
`an average one-way page transfer latency of 177s with
`any of the following send/receive pulse threshold pairs:
`(384/512), (512,512), (512, 384). This is an additional
`9% improvement over the previous minimu

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket