`An Exercise in Low-Latency Messaging
`
`Kenneth G. Yocum
`
`Andrew J. Gallatin
`Jeffrey S. Chase
`Dept. of Computer Science
`Duke University
`Durham, NC 27708-0129
`fgrant, chase, gallatin, alvyg@cs.duke.edu
`
`Alvin R. Lebeck
`
`Abstract
`
`New network technology continues to improve both
`the latency and bandwidth of communication in com-
`puter clusters. The fastest high-speed networks ap-
`proach or exceed the I/O bus bandwidths of “gigabit-
`ready” hosts. These advances introduce new considera-
`tions for the design of network interfaces and messaging
`systems for low-latency communication.
`This paper investigates cut-through delivery, a tech-
`nique for overlapping host I/O DMA transfers with net-
`work traversal. Cut-through delivery significantly re-
`duces end-to-end latency of large messages, which are
`often critical for application performance.
`We have implemented cut-through delivery in
`Trapeze, a new messaging substrate for network mem-
`ory and other distributed operating system services. Our
`current Trapeze prototype is capable of demand-fetching
`8K virtual memory pages in 200s across a Myrinet
`cluster of DEC AlphaStations.
`
`1. Introduction
`
`Advances in network technology continue to improve
`the latency and bandwidth of communication in com-
`puter clusters. The tighter coupling of cluster nodes
`creates new opportunities to reduce the running time of
`large-scale computations by using hardware resources
`across the cluster in a coordinated way.
`
` This work is supported in part by NSF grant CDA-95-12356,
`Duke University, and the Open Software Foundation. Jeff Chase is par-
`tially supported by NSF Career Award CCR-96-24857. Alvin Lebeck
`is partially supported by NSF Career Award MIP-97-02547
`
`Latency of network communication is an impor-
`tant factor in determining the effectiveness of cluster-
`based parallel computing, distributed file storage, net-
`work memory, and other resource sharing schemes. Net-
`work hardware and software are typically optimized for
`high-bandwidth continuous data transfer or low-latency
`exchanges of small messages of a few hundred bytes.
`However, in many instances, network communication
`involves large messages on the order of one to ten kilo-
`bytes. Latency of large messages is critical for page mi-
`gration, block data transfer for parallel applications, or
`demand fetching of virtual memory pages or file blocks.
`This paper describes the design and implementation
`of Trapeze, a new messaging system that delivers low
`latency for both large and small messages. Trapeze was
`designed primarily to handle page migration traffic in
`the Global Memory Service (GMS), a cooperative mem-
`ory system for clusters. The Trapeze prototype consists
`of a messaging libary and custom firmware for Myrinet,
`a high-performance cluster interconnect. GMS and the
`Myrinet platform are described in Section 2.
`Trapeze uses several important buffering and DMA
`management optimizations to minimize the latency of
`large messages, in this case page transfers. Many of the
`optimizations used by Trapeze have been used in ear-
`lier fast messaging implementations. This paper gives
`an overview of Trapeze and focuses on a principal tech-
`nique not described or evaluated elsewhere—called cut-
`through delivery. Cut-through delivery minimizes la-
`tency by aggressively overlapping the four DMA trans-
`fers needed to move a large message from one host to
`another across a network (Figure 1): (1) sender’s host
`memory to adapter, (2) sender’s adapter to network link,
`(3) network link to receiver’s adapter, and (4) receiver’s
`adapter to host memory.
`
`Oracle Ex. 1013, pg. 1
`
`
`
`conclude in Section 5.
`
`2. Overview of Trapeze
`
`Memory
`
`Trapeze was designed as a high-performance com-
`munication substrate for cluster operating system ser-
`vices such as cooperative virtual memory, rather than as
`a full-featured messaging system. This role has dictated
`the features of Trapeze and the character of the messag-
`ing interface. However, cut-through delivery and other
`optimizations used in Trapeze are applicable to any mes-
`saging system or application that is sensitive to large-
`message latency, including some parallel applications
`built using message passing (e.g., MPI [11]) or software
`distributed shared memory.
`In this section we (1) discuss the importance of large-
`message latency for a specific system that uses Trapeze,
`(2) give an overview of Myrinet and the Trapeze messag-
`ing system, and (3) relate Trapeze to other low-latency
`messaging systems.
`
`2.1. Importance of Latency for Network Memory
`
`Our first use of Trapeze is as a messaging substrate
`for the Global Memory Service (GMS) [9], a Unix ker-
`nel facility that manages the memories of cluster nodes
`as a shared, distributed page cache. The GMS imple-
`mentation supports remote paging [5, 10] and coopera-
`tive caching [6] of file blocks and virtual memory pages,
`unified at a low level of the operating system kernel.
`The purpose of GMS is not to support a shared mem-
`ory abstraction, but rather to transparently improve the
`performance of data-intensive workloads. GMS coor-
`dinates memory usage across the cluster so that nodes
`can satisfy paging and file operations with memory-to-
`memory network transfers whenever possible, avoiding
`disk accesses. The key insight is that improvements in
`disk latency are limited by mechanical factors, whereas
`network performance has improved at a rapid rate. For
`example, even with a 100 Mb/s Ethernet interconnect,
`a 266 MHz AlphaStation 500 running a GMS-enhanced
`Digital Unix 4.0 kernel can demand-fetch a file block or
`virtual memory page from the memory of a peer in 1.2
`ms, an order of magnitude faster than the average 12 ms
`access time of a local fast/wide Seagate Barracuda disk.
`However, the performance of data-intensive applica-
`tions under GMS is still dominated by communication
`latency. GMS may reduce I/O stall times by an order
`of magnitude or more, but a 266 MHz Alpha 21164
`CPU could issue up to a million instructions in a mil-
`lisecond spent idling for a remote page fault. Moreover,
`
`4
`
`Bridge
`
`Adapter
`
`3
`
`Interconnect
`
`1
`
`Bridge
`
`Adapter
`
`2
`
`P
`
`P
`
`Memory
`
`Figure 1. Four DMA Transfers for a Network
`Message
`
`Cut-through delivery eliminates network adapter
`store-and-forward latency by pipelining packets through
`the adapter, similar to the way cut-through switch-
`ing [13] eliminates store-and-forward latency in high-
`performance network switches. With cut-through deliv-
`ery, large messages flow through the network adapter;
`the adapter can place data at the sink port shortly after
`it arrives at the source, without waiting for the entire
`packet to be transferred onto the adapter. In this way,
`the DMA transfers from the source and to the sink are
`overlapped, on both the sending and receiving sides.
`We do not claim that cut-through delivery is novel.
`Indeed, variants of cut-through delivery are used in low-
`cost network interfaces to reduce the need for expen-
`sive buffer space on the adapter. Instead, we argue that
`cut-through delivery is a fundamental issue in network
`interface design that has not been adequately explored.
`We show that cut-through delivery significantly reduces
`large-packet latencies, and that the expected benefit of
`cut-through delivery grows rapidly as the network link
`speed approaches I/O bus speeds, as is the case with
`modern high-speed networks. In our cluster, cut-through
`delivery reduces 8KB page transfer latencies by 43% (to
`177 us) after all other known optimizations have been
`applied. As technology advances, high-speed networks
`will drive I/O bus design, maintaining the relevance of
`this optimization. Finally, cut-through delivery is miss-
`ing from other fast messaging systems we have seen.
`This paper is organized as follows. Section 2 presents
`the motivation and background for low-latency page
`transfers, and sets Trapeze in context with other fast
`messaging systems. Section 3 outlines the implemen-
`tation of cut-through delivery in Trapeze for Myrinet.
`Section 4 presents microbenchmark results to evaluate
`the Trapeze prototype on a Myrinet/Alpha cluster. We
`
`Oracle Ex. 1013, pg. 2
`
`
`
`experience with GMS on several networks has shown
`that large message latencies are often higher than ex-
`pected. High bandwidth is most easily achieved with
`continuous streams of packets that naturally pipeline in
`the network and adapters, but this does not translate
`into low-latency delivery of individual large messages
`sent as a single packet. For example, a 155 Mb/s ATM
`network can in principle deliver an 8K message in un-
`In practice, the original GMS prototype
`der 400 s.
`measured page transfer times above a millisecond us-
`ing high-quality ATM adapters. The GMS developers
`have explored pipelined subpage fetches [12] and other
`approaches to masking fetch latency.
`
`small messages, they did not support the DMA facil-
`ities we needed for zero-copy page fetches, relied on
`polling for detection of received messages, and were
`not available for LANai 4.1 or Alpha-based hosts. Fi-
`nally, where large-message latency timings were re-
`ported, they were comparable to MyriAPI. Page trans-
`fers and other large messages can be sent using bulk
`transfer extensions (e.g., FM’s “streaming messages” in-
`terface), which fragment transfers into pipelined packet
`streams, but this did not meet our goal of transferring
`a memory page and associated control information as a
`single packet with the lowest possible latency and over-
`head.
`
`2.2. Page Transfers on Myrinet
`
`2.3. An Overview of Messaging with Trapeze
`
`Our approach to minimizing page transfer costs is to
`use a custom messaging system for Myrinet [3], a high-
`speed wormhole-routed LAN. Our Myrinet configura-
`tion consists of 8-port crossbar switches and adapters
`(Network Interface Cards or NICs) that attach to our Al-
`phaStation hosts through the 32-bit PCI I/O bus.
`The Myrinet NICs are programmable, providing a
`flexible interface to the host. The NICs include a 256K
`block of dual-ported static RAM and a custom CPU and
`link interface based on the third-generation LANai 4.1
`chipset. The NIC SRAM is addressable from the host
`physical address space; the host and LANai interact by
`reading and writing locations of this shared memory.
`The behavior of the adapter is determined by a firmware
`program (a Myrinet Control Program or MCP) that is
`written into the NIC SRAM from the host at startup.
`Myrinet messaging latencies are determined primarily
`by overhead in the MCP and host messaging software
`and the time to transfer data on the host I/O bus.
`Although Myrinet can handle a large transfer as a
`single packet, experiments using the vendor-supplied
`firmware (supporting a host message interface called
`MyriAPI) showed that latency grew more steeply with
`message size than we had expected. A one-way 8K
`page transfer took over 400s at best, almost a factor
`of three higher than the minimum achievable latency.
`Sending page transfers as Internet datagrams over Myr-
`iAPI yielded demand-fault times of over 850s in the
`best case, even with a well-optimized network driver.
`We investigated other message systems available for
`Myrinet in the research community, e.g., Active Mes-
`sages (AM) [17] and Fast Messages (FM) [15]. Like
`MyriAPI, these systems support full-featured APIs de-
`signed primarily to meet the needs of parallel applica-
`tions. While they reported superior performance for
`
`Our solution was to develop Trapeze, a simple, raw
`transport layer designed to provide low latency for both
`large and small messages. The Trapeze prototype con-
`sists of custom Myrinet firmware (Trapeze-MCP) and
`a host library that implements the messaging interface
`(Trapeze-API). To allow kernel-kernel communication,
`the host library is linked into the Digital Unix 4.0 ker-
`nel, which includes a driver to initialize the device and
`to field interrupts. The Trapeze-MCP manages the net-
`work link, coordinates message buffering and DMA, and
`optionally generates host interrupts for incoming pack-
`ets.
`The dominant design goal of Trapeze is to sup-
`port low-latency, zero-copy transfers of virtual memory
`pages across the interconnect. Our current implemen-
`tation is capable of demand-fetching 8K pages with la-
`tencies below 200 s, about fifty times faster than the
`average access time for a high-quality disk. Trapeze im-
`plements several optimizations to achieve this goal. In
`this paper we limit our attention to the cut-through de-
`livery technique, which pipelines individual large mes-
`sages within the adapters, transparently to the hosts.
`Focusing on page transfers allowed us to make sev-
`eral simplifications to streamline the messaging system
`and reduce our prototyping effort:
`
` Fixed-size message and payload buffers. A
`Trapeze message is a 128-byte control message
`with an optional attached payload of up to 8KB.
`Control message buffers are contiguous, aligned
`blocks of adapter memory; payload buffers are
`frames of host physical memory.
`
` No protection. Trapeze currently supports only
`a single logical communication endpoint or chan-
`nel, intended for but not limited to kernel-to-kernel
`
`Oracle Ex. 1013, pg. 3
`
`
`
`communication. We assume that the interconnect
`is secure and the hosts are trusted.
`
` Best-effort delivery. Packets are never dropped
`by the Myrinet network itself, which provides link-
`level flow control by backpressure from a bottle-
`neck interface. However, to avoid deadlocking the
`interconnect, the MCP will drop received packets
`if the host fails to keep up with incoming traffic
`on the link.
`It is the responsibility of the mes-
`sage sender and receiver to coordinate end-to-end
`flow control and recovery from dropped messages,
`if any is needed.
`
`Table 1 lists the Trapeze-API routines used for the ex-
`periments in this paper. These routines interact with the
`Trapeze-MCP through an endpoint structure containing
`a pair of buffer rings: a send ring for outgoing messages
`and a receive ring for incoming messages. Each ring
`is an array of 128-byte control message buffers in the
`NIC SRAM, managed as a circular producer/consumer
`queue. Each ring entry includes space for the message
`contents and header fields for control information.
`The host Alpha processor accesses control message
`contents directly using programmed I/O. The Trapeze-
`API tpz get sendmsg and tpz get rcvmsg rou-
`tines each return a pointer (msg t) to a ring entry;
`the application (e.g., the GMS kernel module) has ex-
`clusive access to the buffer until it releases it with
`tpz release msg. The intent is that the caller moves
`message data directly between processor registers and
`the valid locations of the buffer.
`Payload frames can be attached to entries in ei-
`ther ring. The Trapeze-API attaches a payload frame
`(vm page t) by storing the frame physical address into
`the ring entry in a form that the MCP can use to request
`DMA to or from the frame. If a payload is attached to
`an outgoing send ring entry, the MCP sends the payload
`contents along with the message. Frames attached to
`the receive ring are used as buffers for incoming pay-
`loads. An incoming control message is deposited in the
`next available receive ring entry; any payload is trans-
`ferred via DMA to the frame attached to that entry, or
`discarded if no frame is available.
`
`2.4. Related Messaging Systems
`
`The basic structure of Trapeze is similar to AM, FM,
`and other fast messaging systems designed to minimize
`latency of small messages by eliminating operating sys-
`tem overheads and network protocol processing (e.g.,
`Hamlyn [4], U-net [1], SHRIMP [2], and others). Our
`
`work was also influenced by the Osiris [8] and FRPC
`work [16], which identify adapter design issues for low-
`latency messaging on high-speed networks.
`Some of these systems include bulk data transfer fa-
`cilities optimized for high bandwidth, but none of the
`published work describes cut-through delivery optimiza-
`tions or identifies low latency for large packets as an ex-
`plicit design goal. The FRPC and Osiris platforms had
`I/O buses offering much higher bandwidth than the net-
`work links, thus large-packet latency was limited by link
`bandwidth. Trapeze is similar in spirit to its predeces-
`sors, but it provides more specialized functionality and
`an explicit emphasis on optimizations for large packets
`on modern high-speed networks.
`
`3. Cut-Through Delivery in Trapeze
`
`The Trapeze-MCP uses cut-through delivery to en-
`sure maximum utilization of the network link and the
`PCI I/O bus when transferring message payloads using
`DMA. Cut-through delivery simply means that the MCP
`always initiates DMA as soon as possible, in order to
`maximize overlapping of the DMA transfers needed to
`move a packet between the link and the host.
`
`Store−and−Forward
`Delivery
`
`Cut−Through
`Delivery
`
`Send DMA
`
`Network Traversal
`
`Receive DMA
`
`Time
`
`Figure 2. Cut-Through Delivery vs. Store-
`and-Forward on the AlphaStation 500
`
`The MCP initiates DMA transfers by storing to
`LANai registers and then waiting for the DMA unit to
`signal completion by interrupting the LANai processor.
`There are four possible DMA operation types: host to
`NIC or NIC to host on the PCI bus, and link to NIC or
`NIC to link on the network interface. The Myrinet net-
`work link is bidirectional, but only one transfer at a time
`can take place on the PCI bus. Thus there are effectively
`three DMA functional units.
`The Trapeze-MCP uses a resource-centered structure
`(Figure 3) to maximize utilization of the three DMA
`units. On each iteration through its main loop, the MCP
`
`Oracle Ex. 1013, pg. 4
`
`
`
`msg t tpz get sendmsg()
`
`msg t tpz get rcvmsg()
`
`msg t tpz release sendmsg()
`msg t tpz release rcvmsg()
`
`Operations on Message Rings and Slots
`Allocate a send ring entry for an outgoing message; fail if no entry is available. The caller
`builds the message (including header) in the buffer with a sequence of store instructions.
`Receive the next incoming message; fail if no message is pending. The caller reads the
`message contents with a sequence of load instructions.
`Release a ring entry. On the send ring, this sends the message in the entry.
`
`tpz attach sendmsg(msg t,vm page t,
`io completion t,caddr t)
`
`tpz attach rcvmsg(msg t,vm page t)
`vm page t tpz detach sendmsg(msg t)
`vm page t tpz detach rcvmsg(msg t)
`void tpz check xmit complete(void)
`
`Payload Operations
`Attach a payload frame to a ring entry. The frame contents shall be sent with the message
`as a payload, and the completion routine will be called, with its arguments, either when
`the send entry is re-used, or when tpz check xmit complete(msg t) is called.
`Attach a frame for a payload receive buffer to a receive ring slot.
`Detach a payload buffer from a ring entry. Used to extract the source buffer for a send
`message, or to retrieve the payload received with an incoming message.
`Calls io completion routines to free payload frames for transmitted packets.
`
`void tpz set payload len(msg t,int)
`int tpz get payload len(msg t)
`
`Set the length of the payload to be transmitted, or retrieve the payload length of a received
`packet.
`
`Table 1. Trapeze Messaging API Subset
`
`asks: which of the three DMA engines are idle now, and
`how can they be used?
`The alternative functional view is best illustrated
`by the current LANai Active Messages prototype
`(LAM) [14]. LAM’s sending and receiving sides ex-
`ecute as separate loops using a simple coroutine facil-
`ity. Each iteration through the loop asks: is the current
`packet ready for the next stage of processing? If a corou-
`tine detects that the previous step in processing a packet
`has not yet completed, it transfers control to the other
`coroutine using a special LANai punt instruction. The
`functional LAM structure achieves near-optimal pipelin-
`ing of the DMA transfers for successive outgoing pack-
`ets, but it does not provide any DMA overlap for indi-
`vidual packets on either the sending or receiving sides.
`
`3.1. Cut-Through Delivery
`
`The resource-centered approach of the Trapeze-MCP
`enables intra-packet DMA pipelining, to reduce the la-
`tency of individual packets. The NIC is viewed as a
`cut-through device rather than a store-and-forward de-
`vice, overlapping the transfer of the message across the
`sending host I/O bus, the network, and the receive host
`I/O bus. As shown in Figure 2, the pipelining of cut-
`through delivery can reduce message transfer time com-
`pared to store-and-forward delivery by hiding the la-
`tency of transfers on the host I/O bus, which is the bot-
`tleneck link in our configuration.
`Cut-through delivery works for both incoming and
`outgoing packets as follows. On the sending side, the
`MCP initiates DMA of an outgoing packet onto the net-
`
`work link as soon as a sufficient number of bytes have
`arrived from host memory over the PCI bus. On the re-
`ceiving side, the MCP initiates DMA of the incoming
`packet into host memory as soon as a sufficient num-
`ber of bytes have been deposited in NIC SRAM from
`the network link. Cut-through delivery performs this
`pipelining on a single packet, rather than fragmenting
`into smaller packets and incurring additional packet han-
`dling overheads.
`Cut-through delivery must be implemented carefully
`to prevent the DMA out of the adapter from overrunning
`the DMA into the adapter, on either the sending or re-
`ceiving sides. The Trapeze MCP initiates the outgoing
`DMA as a series of shorter pulses. The MCP initiates
`an outgoing pulse as soon as a threshold number of in-
`coming bytes (set by the pulse threshold parameter) have
`arrived and the outgoing DMA engine is available. The
`LANai exposes the status of active DMA transactions to
`the firmware through counter registers for each DMA
`engine; the Trapeze-MCP main resource loop queries
`these counters to trigger outgoing DMA pulses.
`
`3.2. Discussion
`
`The primary goal of cut-through delivery is to overlap
`I/O bus transfers with network transfers. This must be
`balanced against the bus cycles consumed to acquire the
`bus for a larger number of smaller transfers, which re-
`duces the bus bandwidth available for transferring data.
`Historically, LANs achieved only a fraction of the I/O
`bus bandwidth, and cut-through delivery would have
`negligible effect on large message latency. Therefore,
`
`Oracle Ex. 1013, pg. 5
`
`
`
`to the DMA pulse sizes? How can the optimal pulse
`size for a specific platform be determined?
`
` What is the smallest message size that benefits from
`cut-through delivery?
`
` Will cut-through delivery continue to reduce large-
`packet latency as networks and I/O buses advance?
`
`A
`
`Lanai
`RECV
`RESOURCE
`(dma engine)
`
`4. Performance Analysis
`
`# 3
`Send Buffer
`
`B
`Recv Buffer
`
`C
`
`Host
`Recv
`Resource
`(dma)
`
`HOST DMA ENGINE
`
`# 2
`
`Host
`Send
`Resource
`(dma)
`
`while(1){
`if (HOST DMA ENGINE FREE)
`if (TO_NET)
`Start # 2 Send Resource (dma)
`to Fill # 3 Send Buffer
`TO_HOST
`else if (TO_HOST)
`Start C Recv Resource (dma)
`to Empty B Recv Buffer
`TO_NET
`if (Lanai send dma engine FREE)
`Start #1 Lanai SEND RESOURCE
`if (!Sent_Ctrl)
`Send Ctrl
`else
`Empty # 3 Send Buffer
`if (Lanai recv dma engine FREE)
`Start A Lanai RECV RESOURCE
`to Fill B Recv Buffer
`
`}
`
`#1
`
`Lanai
`SEND
`RESOURCE
`(dma engine)
`
`This section presents the measured performance of
`cut-through delivery on our platform, and analytically
`derives expected benefits on other platforms with dif-
`ferent I/O bus speeds. We first evaluate the critical de-
`sign tradeoffs for cut-through delivery, then describe and
`evaluate a refinement called eager DMA that dynami-
`cally adapts DMA transfer sizes in response to the ob-
`served behavior of the I/O bus and network link. We
`then present performance numbers for a complete de-
`mand fetch of a page from a remote host’s memory, in-
`cluding request latency and interrupt-based notification.
`The timings in this section are taken from mi-
`crobenchmarks of Trapeze on DEC AlphaStation 500
`workstations with 266 Mhz 21164 Alpha processors,
`2MB of off-chip cache, 128MB of main memory and
`LANai 4.1 Myrinet adapters. The I/O bus is a 33MHz
`32-bit PCI with a Digital 21171 PCI bridge.
`
`4.1. Measured Benefits of Cut-Through Delivery
`
`We first present timings for cut-through delivery us-
`ing a range of pulse thresholds. The send threshold is
`the number of bytes that must arrive from the host be-
`fore the sending NIC initiates DMA to the link. The
`receive threshold is the number of bytes that must arrive
`from the link before the receiving NIC initiates DMA to
`host memory.
`The pulse thresholds are critical tuning parameters.
`Small thresholds result in smaller DMA transfers, which
`can decrease the effective bandwidth of the I/O bus due
`to the overhead of acquiring the bus for each transfer. On
`the other hand, higher thresholds yield smaller pipelin-
`ing benefits for individual packets. Moreover, the opti-
`mal pulse threshold may depend on the relative speeds
`of source and sink links. Our goal is to empirically de-
`termine the optimal value for both pulse thresholds on
`our platform.
`Figure 4 shows the effect of send and receive thresh-
`olds on one-way page transfer latency. For these experi-
`ments, the DMA pulse size was fixed at the pulse thresh-
`old. The left-hand graph plots the send threshold on the
`
`Figure 3. Trapeze MCP Structure
`
`high-end LAN adapters may use store-and-forward de-
`livery to enable larger bus transfers and attain maximum
`bandwidth for streams of packets. This can result in
`higher latencies for large messages, e.g., page transfers
`sent as individual AAL5 frames over an ATM network.
`Ethernet and ATM network adapter designs may em-
`ploy a form of cut-through delivery to reduce cost. A
`common design includes a receive FIFO that can hold a
`small amount of data (e.g., a few ATM cells), which are
`deposited into host buffers with DMA as new data ar-
`rives. This eliminates the need for enough adapter mem-
`ory to buffer an entire packet or Protocol Data Unit.
`For Ethernet, cut-through delivery will not reduce la-
`tency significantly because the maximum transmission
`unit (MTU) is too small. For ATM adapters, cut-through
`delivery can overlap DMA of leading cells into the host
`with DMA of trailing cells into the adapter, reducing
`large packet latency. However, this may also reduce
`bandwidth. The tradeoff is determined by the size of
`the FIFO or by more sophisticated balancing of batch-
`ing and cut-through optimizations [7]. For Trapeze, the
`balance can be adjusted by varying the pulse thresholds.
`In the next section we quantify the effects of cut-
`through delivery on large packet latency and bandwidth,
`focusing on the following questions:
`
` How much does cut-through delivery improve
`large-packet latencies in a real system?
`
` How sensitive is the benefit of cut-through delivery
`
`Oracle Ex. 1013, pg. 6
`
`
`
`305
`
`295
`
`285
`
`275
`
`265
`
`255
`
`245
`
`235
`
`225
`
`215
`
`205
`
`195
`
`latency (microseconds)
`
`305
`
`295
`
`285
`
`275
`
`265
`
`255
`
`245
`
`235
`
`225
`
`215
`
`205
`
`195
`
`latency (microseconds)
`
`0
`
`512
`
`1536
`
`2560
`
`3584
`
`4608
`
`5632
`
`6656
`
`7680
`
`0
`
`512
`
`1536
`
`2560
`
`3584
`
`4608
`
`5632
`
`6656
`
`7680
`
`send (pulse size)
`
`recv (pulse size)
`
`Figure 4. Impact of Pulse Threshold on Trapeze Latency for 8K Payloads
`
`Read from system memory to LANai
`Write from LANai to system memory
`
`128
`120
`112
`104
`96
`88
`80
`72
`64
`56
`48
`40
`32
`24
`6
`
`081
`
`MB/sec
`
`128 640
`
`1280
`
`1920
`
`2560
`
`3200
`
`3840
`
`4480
`
`5120
`
`5760
`
`6400
`
`7040
`
`7680
`
`DMA Transfer Size
`
`Figure 5. DEC Alphastation 500 DMA per-
`formance
`
`assuming a gigabit network link), a store-and-forward
`transfer should take no less than 256s, plus control
`message latency. This is consistent with the 308s us-
`ing Trapeze with store-and-forward delivery.
`Although the I/O DMA bandwidth profile and imbal-
`ances may require different send and receive thresholds
`on some systems, on the 500 platform a single pulse
`size of 1280 bytes is near optimal. However, latency
`increases rapidly for adjacent pulse values.
`
`4.2. Eager DMA
`
`The upturn in the left-hand side of the graph in Fig-
`ure 4 suggests a refinement to cut-through delivery. The
`upturn shows that although smaller pulse sizes extract
`the maximum benefit from pipelining, the overhead of
`the additional DMA transfers overshadows this effect.
`This is because the amount of data transferred with each
`
`x-axis; each line represents a different value for the re-
`ceive pulse. The right-hand graph shows the same data
`with the receive thresholds on the x-axis. These timings
`represent one half the average of 50 host-to-host round-
`trip page transfers, each consisting of a 128-byte control
`message and an attached 8KB payload. The hosts poll
`the LANai memory (with an optimal backoff determined
`empirically) to detect message arrival. We observed less
`than a 10s variation from the average.
`As expected, page transfer latency is highest at both
`ends of the graphs, due to limited overlap of large pulses
`and the DMA setup overhead of small pulses. The upper
`right point of each graph represents the latency for store-
`and-forward; both send and receive pulse sizes are 8KB,
`so there is no pipelining benefit. The lowest points cor-
`respond to the optimal pulse thresholds on the sending
`and receiving sides. Cut-through delivery with optimal
`thresholds reduces page transfer latency from 308s to
`195s, a 37% improvement.
`Note that the two graphs have different shapes, indi-
`cating that optimal pulse sizes may not be equal on the
`sending and receiving sides. We believe this is due to an
`imbalance in the PCI performance of our platform. This
`imbalance is readily seen from Figure 5, which shows
`the DMA bandwidth for transferring data to and from
`the I/O bus as a function of DMA sizes ranging from 64
`bytes to 8192 bytes (the base VM page size in Digital
`Unix 4.0). The timings were produced by Myricom’s
`hswap MCP, using a cycle counter on the adapter. Peak
`DMA read bandwidth (transfers from main memory to
`the NIC) approaches 128 MB/s, the maximum achiev-
`able by the 32-bit PCI standard. However, the DMA
`write bandwidth (transfers from the NIC to main mem-
`ory) is about half the read bandwidth, apparently due to a
`flaw in the 21171 PCI bridge. With these numbers (and
`
`Oracle Ex. 1013, pg. 7
`
`
`
`308
`
`298
`
`288
`
`278
`
`268
`
`258
`
`248
`
`238
`
`228
`
`218
`
`208
`
`198
`
`188
`
`178
`
`latency (microseconds)
`
`308
`
`298
`
`288
`
`278
`
`268
`
`258
`
`248
`
`238
`
`228
`
`218
`
`208
`
`198
`
`188
`
`178
`
`latency (microseconds)
`
`0
`
`512
`
`1536
`
`2560
`
`3584
`
`4608
`
`5632
`
`6656
`
`7680
`
`0
`
`512
`
`1536
`
`2560
`
`3584
`
`4608
`
`5632
`
`6656
`
`7680
`
`send (pulse size)
`
`recv (pulse size)
`
`Figure 6. Trapeze One-Way Page Transfer Latency with Eager DMA (128 + 8K bytes)
`
`livery. The benefit of cut-through delivery starts at
`1KB payloads, reducing latency to 63s compared to
`68s for store-and-forward. The relative advantage of
`cut-through delivery increases with payload size, with a
`43% latency reduction for 8KB messages.
`
`store-and-forward
`cut-through
`
`302
`
`282
`
`262
`
`242
`
`222
`
`202
`
`182
`
`162
`
`142
`
`122
`
`102
`
`82
`
`62
`
`42
`
`latency (microseconds)
`
`0
`
`1024
`
`2048
`
`3072
`
`4096
`
`5120
`
`6144
`
`7168
`
`8192
`
`payload size
`
`Figure 7. Cut-Through vs.
`Forward Varying Payload Size
`
`Store-and-
`
`4.4. Effect of I/O Bus Bandwidth
`
`With optimal cut-through delivery, large-packet la-
`tency is determined by the time to move the packet
`across the bottleneck link. Since the network link DMAs
`are overlapped even with store-and-forward, cut-through
`delivery is most effective when the network link band-
`width approaches or exceeds the available I/O bus band-
`width. In this case, bus transfer time is a significant com-
`ponent of total latency using store-and-forward delivery.
`This is the case with Myrinet on our AlphaStations: the
`
`pulse is fixed to the pulse threshold, even though more
`data may have arrived from the source by the time the
`sink DMA engine is idle.
`We enhanced the Trapeze-MCP to combine the ben-
`efits of small and large thresholds using a technique we
`call eager DMA. Eager DMA treats the pulse threshold
`as a minimum unit of transfer, but moves all the data
`available at the start of each DMA transfer. For example,
`on the receiving side, the MCP initiates DMA to the host
`as soon as the number of bytes received from the net-
`work link exceeds the threshold. When the host DMA
`pulse completes, the MCP checks the number of bytes
`that arrived from the network while it was in progress.
`If this amount exceeds the pulse threshold, the MCP im-
`mediately initiates a new DMA for all of the newly re-
`ceived data. Eager DMA achieves maximum transfer
`overlap with the smallest number of transfers.
`Figure 6 shows the performance of the enhanced
`Trapeze firmware, using the same experiment as in Fig-
`ure 4. As expected, eager DMA delivers lower latency
`than static pulse sizes. The effect is most pronounced
`for small pulse thresholds, allowing Trapeze to achieve
`an average one-way page transfer latency of 177s with
`any of the following send/receive pulse threshold pairs:
`(384/512), (512,512), (512, 384). This is an additional
`9% improvement over the previous minimu