`Proceedings of the USENIX Annual Technical Conference (NO 98)
`New Orleans, Louisiana, June 1998
`
`Cheating the I/O Bottleneck:
`Network Storage with Trapeze/Myrinet
`
`Darrell C. Anderson, Jeffrey S. Chase, Syam Gadde, Andrew J. Gallatin, and Kenneth G. Yocum,
`Duke University
`Michael J. Feeley,
`University of British Columbia
`
`For more information about USENIX Association contact:
`1. Phone:
`510 528-8649
`2. FAX:
`510 548-5738
`3. Email:
`office@usenix.org
`4. WWW URL:http://www.usenix.org/
`
`NetApp Ex. 1011, pg. 1
`
`
`
`Cheating the I/O Bottleneck:
`Network Storage with Trapeze/Myrinet
`Darrell C. Anderson, Jeffrey S. Chase, Syam Gadde, Andrew J. Gallatin, and Kenneth G. Yocum
`Department of Computer Science
`Duke University
`fanderson, chase, gadde, gallatin, grantg@cs.duke.edu
`Michael J. Feeley
`Department of Computer Science
`University of British Columbia
`feeley@cs.ubc.ca
`
`Abstract
`
`Recent advances in I/O bus structures (e.g., PCI), high-
`speed networks, and fast, cheap disks have signifi-
`cantly expanded the I/O capacity of desktop-class sys-
`tems. This paper describes a messaging system de-
`signed to deliver the potential of these advances for
`network storage systems including cluster file systems
`and network memory. We describe gms net, an RPC-
`like kernel-kernel messaging system based on Trapeze,
`a new firmware program for Myrinet network interfaces.
`We show how the communication features of Trapeze
`and gms net are used by the Global Memory Service
`(GMS), a kernel-based network memory system.
`The paper focuses on support for zero-copy page mi-
`gration in GMS/Trapeze using two RPC variants im-
`portant for peer-peer distributed services: (1) delegated
`RPC in which a request is delegated to a third party,
`and (2) nonblocking RPC in which replies are pro-
`cessed from the Trapeze receive interrupt handler. We
`present measurements of sequential file access from
`network memory in the GMS/Trapeze prototype on a
`Myrinet/Alpha cluster, showing the bandwidth effects
`of file system interfaces and communication choices.
`GMS/Trapeze delivers a peak read bandwidth of 96
`MB/s using memory-mapped file I/O.
`
`1 Introduction
`
`Two recent hardware advances boost the potential of
`cluster computing: switched cluster interconnects that
`can carry 1Gb/s or more of point-to-point bandwidth,
`and high-quality PCI bus implementations that can han-
`dle data streams at gigabit speeds. We are develop-
`ing system facilities to realize the potential for high-
`
` This work is supported by the National Science Foundation un-
`der grants CCR-96-24857 and CDA-95-12356, equipment grants from
`Intel Corporation and Myricom, and a software grant from the Open
`Group.
`
`speed data transfer over Myricom’s 1.28 Gb/s Myrinet
`LAN [2], and harness it for cluster file systems, network
`memory systems, and other distributed OS services that
`cooperatively share data across the cluster. Our broad
`goal is to use the power of the network to “cheat” the
`I/O bottleneck for data-intensive computing on worksta-
`tion clusters.
`This paper describes use of the Trapeze messag-
`ing system [27, 5] for high-speed data transfer in a
`network memory system, the Global Memory Service
`(GMS) [14, 18]. Trapeze is a firmware program for
`Myrinet/PCI adapters, and an associated messaging li-
`brary for DEC AlphaStations running Digital Unix 4.0
`and Intel platforms running FreeBSD 2.2. Trapeze com-
`munication delivers the performance of the underlying
`I/O bus hardware, balancing low latency with high band-
`width. Since the Myrinet firmware is customer-loadable,
`any Myrinet network site with PCI-based machines can
`use Trapeze.
`GMS [14] is a Unix kernel facility that manages the
`memories of cluster nodes as a shared, distributed page
`cache. GMS supports remote paging [8, 15] and co-
`operative caching [10] of file blocks and virtual mem-
`ory pages, unified at a low level of the Digital Unix 4.0
`kernel (a FreeBSD port is in progress). The purpose of
`GMS is to exploit high-speed networks to improve per-
`formance of data-intensive workloads by replacing disk
`activity with memory-memory transfers across the net-
`work whenever possible. The GMS mechanisms man-
`age the movement of VM pages and file blocks between
`each node’s local page cache — the file buffer cache
`and the set of resident virtual pages — and the network
`memory global page cache.
`This paper deals with the communication mecha-
`nisms and network performance of GMS systems us-
`ing Trapeze/Myrinet, with particular focus on the sup-
`port for zero-copy read-ahead and write-behind of se-
`quentially accessed files. Cluster file systems that stripe
`data across multiple servers are typically limited by
`
`NetApp Ex. 1011, pg. 2
`
`
`
`the bandwidth of the network and communication sys-
`tem [23, 16, 1]. We measure synthetic bandwidth tests
`that access files in network memory, in order to deter-
`mine the maximum bandwidth achievable through the
`file system interface by any network storage system us-
`ing Trapeze. The current GMS/Trapeze prototype can
`read files from network memory at 96 MB/s on an Al-
`phaStation/Myrinet network. Since these speeds ap-
`proach the physical limits of the hardware, unnecessary
`overheads (e.g., copying) can have significant effects on
`performance. These overheads can occur in the file ac-
`cess interface as well as in the messaging system. We
`evaluate three file access interfaces, including two that
`use the Unix mmap system call to eliminate copying.
`Central to GMS is an RPC-like messaging facility
`(gms net) that works with the Trapeze interface to sup-
`port the messaging patterns and block migration traffic
`characteristic of GMS and other network storage ser-
`vices. This includes a mix of asynchronous and re-
`quest/response messaging (RPC) that is peer-to-peer in
`the sense that each “client” may also act as a “server”.
`The support for RPC includes two variants important
`for network storage: (1) delegated RPCs in which re-
`quests are delegated to third parties, and (2) nonblock-
`ing RPC in which the replies are processed by contin-
`uation procedures executing from an interrupt handler.
`These features are important for peer-to-peer network
`storage services: the first supports directory lookups for
`fetched data, and the second supports lightweight asyn-
`chronous calls, which are useful for prefetching. When
`using these features, GMS and gms net cooperate with
`Trapeze to unify buffering of migrated pages, eliminat-
`ing all page copies by sending and receiving directly
`from the file buffer cache and local VM page cache.
`This paper is organized as follows. Section 2 gives an
`overview of the Trapeze network interface and the fea-
`tures relevant to GMS communication. Section 3 deals
`with the gms net messaging layer for Trapeze, focus-
`ing on the RPC variants and zero-copy handling of page
`transfers. Section 4 presents performance results from
`the GMS/Trapeze prototype. We conclude in Section 5.
`
`2 High-Speed Data Transfer with Trapeze
`
`The Trapeze messaging system consists of two compo-
`nents: a messaging library that is linked into programs
`using the package, and a firmware program that runs on
`the Myrinet network interface card (NIC). The Trapeze
`firmware and the host interact by exchanging commands
`and data through a block of memory on the NIC, which
`is addressable in the host’s physical address space using
`programmed I/O. The firmware defines the interface be-
`tween the host CPU and the network device; it interprets
`commands issued by the host and controls the movement
`
`of data between the host and the network link. The host
`accesses the network using macros and procedures in the
`Trapeze library, which defines the lowest level API for
`network communication across the Myrinet.
`
`user applications
`
`File/VM system
`
`Global Memory Service
`
`sockets
`
`TCP/IP
`
`RPC-like message layer
`Trapeze-API
`
`network driver
`
`Trapeze/Myrinet NIC
`
`kernel
`
`PCI
`
`Figure 1: Using Trapeze for TCP/IP and for kernel-
`kernel messaging for network memory.
`
`Like other network interfaces based on Myrinet (e.g.,
`Hamlyn [4], VMMC-2 [13], Active Messages [9],
`FM [21]), Trapeze can be used as a memory-mapped
`network interface for user applications, e.g., parallel pro-
`grams. However, Trapeze was designed primarily to
`support fast kernel-to-kernel messaging alongside con-
`ventional TCP/IP networking. The Trapeze distribu-
`tion includes a network device driver that allows the
`native TCP/IP protocol stack to use a Trapeze network
`alongside the gms net layer. Figure 1 depicts this struc-
`ture. The kernel-to-kernel messaging layer is intended
`for GMS and other services that assume mutually trust-
`ing kernels.
`
`2.1 Trapeze Overview
`
`Trapeze messages are short control messages (maximum
`128 bytes) with optional attached payloads typically
`containing application data not interpreted by the mes-
`sage system, e.g., a file block, a virtual memory page,
`or a TCP segment. Each message can have at most one
`payload attached to it. Separation of control messages
`and bulk data transfer is common to a large number of
`messaging systems since the V system [6].
`A Trapeze control message and its payload (if any) are
`sent as a single packet on the network. Since Myrinet has
`no fixed maximum packet size (MTU), the maximum
`payload size of a Trapeze network is configurable, and is
`typically set to the virtual memory page size (4K or 8K).
`The Trapeze MTU is the maximum control message size
`plus the payload size.
`Payloads are sent and received using DMA to/from
`aligned buffers residing anywhere in host memory. The
`host attaches a payload to an outgoing message using a
`Trapeze macro that stores the payload’s DMA address
`and length into designated fields of the send ring entry.
`On the receiving side, Trapeze deposits the payload into
`a host memory buffer before delivering the control mes-
`sage.
`
`NetApp Ex. 1011, pg. 3
`
`
`
`NIC Memory
`
`Send Ring
`
`Key
`
`Key
`
`Key
`
`Key
`
`Key
`
`Key
`
`Receive Ring
`
`Incoming
`Payload Table
`
`Host Memory
`Frames
`
`Figure 2: NIC Memory Structures for a Trapeze end-
`point.
`
`The data structures in NIC memory include an end-
`point structure shared with the host. A Trapeze endpoint
`(shown in Figure 2) includes two message rings, one for
`sending and one for receiving. Each message ring is a
`circular array of 128-byte control message buffers and
`related state, managed as a producer/consumer queue.
`From the perspective of a host CPU, the NIC produces
`incoming messages in the receive ring and consumes
`outgoing messages in the send ring. The host sends a
`message by forming it in the next free send ring entry
`and setting a bit to indicate that the message is ready
`to send. When a message arrives from the network, the
`firmware deposits it into the next free receive ring entry,
`sets a bit to inform the host that the message is ready to
`consume, and optionally signals the host with an inter-
`rupt.
`Handling of incoming messages is interrupt-driven
`when Trapeze is used from within the kernel. Each ker-
`nel protocol module using Trapeze (i.e., gms net and the
`IP network driver) registers a receiver interrupt handler
`upcalled from the Trapeze interrupt handler.
`Trapeze is designed to optimize handling of payloads
`as well as to deliver good performance for small mes-
`sages.
`In a network memory system, page fault stall
`time is determined primarily by the time to transfer the
`requested page on the network. On the other hand,
`bursts of page transfers (e.g., for read-ahead for se-
`quential access) require high bandwidth. The Trapeze
`firmware employs a message pipelining technique called
`cut-through delivery [27] to balance low payload latency
`with high bandwidth under load. With this technique,
`the one-way raw Trapeze latency for a 4K page trans-
`fer is 70s on 300MHz Pentium-II/440LX systems with
`LANai 4.1 M2M-PCI32 Myrinet adapters. On these sys-
`tems, Trapeze delivers 112 MB/s for a stream of 8K pay-
`loads; with 64K payloads, Trapeze can use over 95% of
`the peak bandwidth of the I/O bus, achieving 126 MB/s
`of user-to-user point-to-point bandwidth.1
`
`1These bandwidth numbers define a “megabyte” as one million
`
`2.2 Unified Buffering
`Trapeze
`
`for
`
`In-Kernel
`
`All kernel-based Trapeze protocol modules share a com-
`mon pool of receive buffers allocated from the virtual
`memory page frame pool; the maximum payload size is
`set to the virtual memory page size. Since Digital Unix
`allocates its file block buffers from the virtual memory
`page frame pool as well, this allows unified buffering
`among the network, file, and VM systems. For example,
`the system can send any virtual memory page or cached
`file block out to the network by attaching it as a payload
`to an outgoing message. Similarly, every incoming pay-
`load is deposited in an aligned physical frame that can
`mapped into a user process or hashed into the file cache.
`Since file caching and virtual memory management are
`reasonably unified, we often refer to the two subsystems
`collectively as “the file/VM system”, and use the term
`“page” to include file blocks.
`The TCP/IP stack can also benefit from the unified
`buffering of Trapeze payloads to reduce copying over-
`head by payload remapping (similar to [11, 3, 17]). On
`a normal transmission, IP message data is copied from
`a user memory buffer into an mbuf chain [20] on the
`sending side; on the receiving side, the driver copies the
`header into a small mbuf, points a BSD-style external
`mbuf at the payload buffer, and passes the chain through
`the IP stack to the socket layer, which copies the payload
`into user memory and frees the kernel buffer. We have
`modified the Digital Unix socket layer to avoid copying
`when size and alignment properties allow. On the send-
`ing side, the socket layer builds mbuf chains by pinning
`the user buffer frames, marking them copy-on-write, ref-
`erencing them with external mbufs, and passing them
`through the TCP/IP stack to the network driver, which
`attaches them to outgoing messages as payloads. On
`the receiving side, the socket layer unmaps the frames
`of the user buffer, replaces them with the kernel pay-
`load buffer frames, and frees the user frames. With pay-
`load remapping, AlphaStations running the standard net-
`perf TCP benchmark over Trapeze sustain point-to-point
`bandwidth of 87 MB/s.2
`Since outgoing payload frames attached to the send
`ring may be owned by the file/VM system, they must
`be protected from modification or reuse while a trans-
`mit is in progress. Trapeze notifies the system that it is
`safe to overwrite an outgoing frame by upcalling a spec-
`ified transmit completion handler routine. For example,
`when an IP send on a user frame completes, Trapeze up-
`calls the completion routine, which unpins the frame and
`
`bytes. All other bandwidth numbers in this paper define 1MB as
`1024*1024 bytes.
`2Measured Alcor (266 MHz AS 500) to Miata (500 MHz PWS
`500au), 8320-byte MTU, 1M netperf transfers, socket buffers at 1M,
`software TCP checksums disabled (hardware CRC only): 732 Mb/s.
`
`NetApp Ex. 1011, pg. 4
`
`
`
`releases its copy-on-write protection.
`However, to reduce overhead Trapeze does not gener-
`ate transmit-complete interrupts. Instead, Trapeze saves
`the handler pointer in host memory and upcalls the han-
`dler only when the send ring entry is reused for another
`send. Since messages may be sent from interrupt han-
`dlers, a completion routine could be called in the context
`of an interrupt handler that happened to reuse the same
`send ring entry as the original message. For this rea-
`son, completion handlers must not block, and the struc-
`tures they manipulate must be protected by disabling
`interrupts. Since completion upcalls may be arbitrar-
`ily delayed, the Trapeze API includes a routine to poll
`all pending transmits and call their handlers if they have
`completed.
`
`2.3 Incoming Payload Table
`
`The benefits of high-speed networking are easily over-
`shadowed by processing costs and copying overhead
`in the hosts. To support zero-copy communication, a
`Trapeze receiver can designate a region of memory as
`the receive buffer space for a specific incoming payload
`identified by a tag field. When the message arrives, the
`firmware recognizes the tag and deposits the payload di-
`rectly into the waiting buffer. Handling of tagged pay-
`loads is governed by a third structure in NIC memory,
`the incoming payload table (IPT).
`GMS uses the Trapeze IPT for copy-free handling of
`fetched pages in RPC replies, as described in Section 3.
`Ordinarily, Trapeze payloads are received into buffers
`attached by the host to the receive ring entries; since the
`firmware places messages in the ring in the order they
`arrive, the host cannot know in advance which generic
`buffer will be selected to receive any given payload, and
`the payload may need to be copied within the host if it
`cannot be remapped. Early demultiplexing with the IPT
`avoids this copy.
`To set up an IPT mapping, the host calls a Trapeze
`API routine to allocate a free entry in the IPT, initialize it
`with the DMA address of the designated payload buffer,
`and return a tag value (payload token) consisting of an
`IPT index and a protection key. The payload token is a
`weak form of capability that can be passed in a message
`to another node; any node that knows the token can use
`it to tag a message and transmit a payload into the buffer.
`When the firmware receives a tagged message from the
`network, it validates the key against the indexed IPT en-
`try before initiating a DMA into the designated receive
`buffer. The receiving host may cancel the IPT entry at
`any time (e.g., request timeout); similarly, the firmware
`protects against dangling tokens and duplicate messages
`by cancelling the entry when a matching message is re-
`ceived. If the key is not valid, the NIC drops the payload
`
`and delivers the control portion with a payload length of
`zero, so the receive message handler can recognize and
`handle the error.
`At present, the IPT maps only a few megabytes of
`host memory, enough for the reply payloads of all out-
`standing requests (e.g., outstanding page fetches). This
`is a modest approach that meets our needs, relative to
`more ambitious approaches that indirect through TLB-
`like structures on the NIC [13, 26, 7]. We have con-
`sidered a larger IPT with support for multiple transfers
`to the same buffer at different offsets, as in Hamlyn’s
`sender-based memory management [4], but we have not
`found a need for these features in our current uses of
`Trapeze.
`
`3 Page Transfers in GMS/Trapeze
`
`This section outlines a Trapeze-based kernel-kernel
`RPC-like messaging layer designed to support coop-
`erative cluster services. The package is derived from
`the original RPC package for the Global Memory Ser-
`vice [14] (gms net), extended to use Trapeze and to sup-
`port a richer set of communication styles, primarily for
`asynchronous prefetching at high bandwidth [24]. Al-
`though the package is generic, we draw on GMS exam-
`ples to motivate its features and to illustrate their use.
`Since many aspects of RPC and messaging systems
`are well-understood, we focus on those aspects that ben-
`efit from the Trapeze features discussed in the previ-
`ous section.
`In particular, we explain the features for
`transferring pages (or file blocks) efficiently within the
`RPC framework, and their use by the protocol operations
`most critical for GMS performance: page fetches (get-
`page) from the global page cache to a local page cache,
`and page pushes or evictions (putpage or movepage)
`from a local cache to the global cache.
`Section 3.2 discusses the zero-copy handling of
`fetched pages using the Trapeze incoming payload ta-
`ble (IPT); Sections 3.3 and 3.4 extend the zero-copy re-
`ply scheme to delegated and nonblocking RPC variants
`useful in GMS and other peer-to-peer network services.
`We illustrate use of nonblocking RPC to extend standard
`read-ahead for files and virtual memory to GMS; this al-
`lows processes to access data from network memory or
`storage servers at close to network bandwidth.
`
`3.1 Basic Mechanisms
`
`The gms net messaging layer includes basic support for
`typed messages, stub procedures, dispatching to ser-
`vice procedures based on message types, and matching
`replies with requests. The Trapeze receiver interrupt
`handler directs incoming messages to gms net by up-
`calling a registered service routine; the service routine
`
`NetApp Ex. 1011, pg. 5
`
`
`
`waiting frame using DMA.
`The RPC system uses the Trapeze incoming payload
`table (IPT) described in Section 2.3 for this purpose. The
`client-side getpage stub calls Trapeze to allocate an IPT
`entry and obtain a payload token, which is added to the
`reply token for the call. When the server-side getpage
`stub generates a reply, it attaches the frame containing
`the requested page as a payload, extracts the Trapeze
`payload token, and tags the outgoing reply message by
`placing the token in its Trapeze header. Back on the
`client side, the Trapeze firmware recognizes the tag in
`the message header as the reply payload begins to arrive
`on the adapter; once the tag is decoded and validated,
`the firmware initiates DMA of the message payload into
`the waiting frame.
`
`Directory Site
`...
`A: ...
`...
`
`D elegate
`
`R equ est
`
`Reply
`
`Page A
`
`Requesting Node
`
`Caching Site
`
`Figure 3: GMS getpage operation through a directory
`site using a delegated RPC.
`
`3.3 Delegated RPC
`
`Unlike traditional RPC, some GMS protocol operations
`involve more than one server. To fetch a remote page,
`for example, the getpage operation must first locate the
`page’s caching site. To keep track of pages, GMS uses a
`distributed hash directory for pages potentially sharable
`by multiple nodes [14]. A requesting node locates a
`page’s directory site by applying a globally replicated
`hash function to a unique page identifier. It then issues
`a getpage RPC to the directory site, which looks up the
`page in its portion of the directory, and forwards the re-
`quest to the caching site. The caching site completes the
`three-way RPC by returning the page directly to the re-
`quester. We call this type of operation a delegated RPC.
`The key idea behind delegated RPC is to allow the re-
`ply token to be passed from node to node until the call is
`complete; the last node in the RPC sequence then uses
`the reply token to complete the RPC and reply to the
`original requester. The delegation is transparent to the
`requester, which creates its reply token and includes it
`in the request message exactly as for an ordinary RPC.
`
`hands off incoming requests to a server thread. How-
`ever, gms net it is not a true RPC system: many protocol
`messages do not produce replies, and there is no sup-
`port for automatic stub generation. The package is best
`thought of as a library of procedures and macros used by
`the messaging stubs to build and decode messages and
`to direct their flow through the system. It is designed for
`messages with relatively simple arguments and bulk data
`payloads (e.g., file blocks) that are not interpreted by the
`message handlers themselves.
`To send a message, a stub allocates a message buffer
`with gms net makebuf, calls routines and macros to
`build the message, e.g., by pushing data items into the
`message, and sends the message to a destination with
`gms net sendto. In Trapeze, gms net makebuf returns a
`pointer to a send ring entry, and gms net sendto releases
`it. Messages are typed by an operation code and a re-
`quest/reply bit. Incoming requests are dispatched by us-
`ing the operation code to index into a vector of registered
`server-side stubs. Incoming replies are handled directly
`by the receiver interrupt handler, either by waking up a
`waiting thread or by calling a reply continuation proce-
`dure as described in Section 3.4.
`An important function of the RPC layer is to match in-
`coming replies with requests. If a reply is expected, the
`caller makes an entry in a call record table before send-
`ing the request message, and places a reply token con-
`taining a unique call record ID into the outgoing request
`message. After sending the request, the calling thread
`or process may block on the call record entry. When the
`server side generates a reply, it places a copy of the reply
`token in the reply message. When the reply arrives, the
`receiver interrupt handler decodes the reply token and
`retrieves the call record. The call record includes all in-
`formation needed to process the reply, e.g., by awaken-
`ing the calling thread or process.
`To transfer a page or file block in a request or re-
`ply, the stub attaches the memory frame to the mes-
`sage buffer as a payload. The system is inhibited from
`reusing the frame or overwriting it until the frame con-
`tents have been transferred to the network adapter using
`DMA (Section 2.2). On the receiving side, Trapeze uses
`DMA to deposit each received payload into a memory
`frame designated by the receiver.
`
`3.2 Zero-Copy Reply Handling
`
`GMS performance depends on efficient handling of get-
`page replies containing page payloads. When the virtual
`memory system or file system initiates a page fetch, it
`first selects the target page frame according to its poli-
`cies for page replacement and other factors such as page
`coloring. The goal of the GMS getpage client stub is
`to arrange to transfer the incoming page directly to the
`
`NetApp Ex. 1011, pg. 6
`
`
`
`With delegated RPC, however, a remote procedure can
`either return a reply or delegate the request to another
`peer by generating a new request message with a copy
`of the original reply token. Each node has the same op-
`tion of either replying to the original requester or del-
`egating the request to yet another peer. Note that each
`delegation call is unlike a normal RPC in that the dele-
`gated procedure never replies to its immediate caller; it
`either forwards the request again or replies directly to the
`node that initiated the delegated RPC. Most importantly,
`the zero-copy reply handling scheme is preserved, since
`the Trapeze payload token is embedded within the re-
`ply token, and so is available to the node that ultimately
`generates the reply.
`Figure 3 shows how GMS getpage uses delegated
`RPC. The directory site for the requested page delegates
`the request to the caching site, forwarding the original
`request parameters and reply token. The caching site
`generates a reply, attaches the requested page as a pay-
`load, and tags the message with the payload token, as
`described in the previous section. It then sends the reply
`directly to the original requester, where it is handled in
`the same way as a direct reply.
`
`3.4 Read-Ahead with Nonblocking RPC
`
`Most file system implementations implement read-ahead
`to prefetch file data before it is requested. This signif-
`icantly improves bandwidth for sequential file access,
`which is easy to detect and exploit [20]. We have ex-
`tended GMS to support read-ahead in order to meet our
`goal of delivering files from network memory at full net-
`work bandwidth. GMS read-ahead hides fetch latency
`and delivers peak bandwidth by pipelining the network
`with a continuous stream of page fetches.
`
`3.4.1 Nonblocking RPC
`
`Any form of prefetching imposes new demands on the
`communication layer and is highly dependent on its per-
`formance. In particular, prefetching requests are RPC
`calls that generate replies, but the replies must be han-
`dled asynchronously and outside of the issuing thread
`context, so as not to block the issuing thread while the re-
`quest is pending. NFS client implementations typically
`solve this problem by handing off read-ahead calls to a
`system I/O daemon that can wait for RPC replies with-
`out affecting user processes [22]. This solution requires
`a context switch to the I/O daemon for each request and
`response.
`To reduce context switching overhead, GMS/Trapeze
`implements read-ahead and prefetching using nonblock-
`ing RPCs. To implement nonblocking RPC, gms net
`supplements the call record with support for continua-
`
`tion procedures invoked directly from the receiver in-
`terrupt handler to process the reply. These continua-
`tions are similar to Draves et. al. [12], but they ex-
`ecute at interrupt time with no associated thread con-
`text. Also, each nonblocking RPC call may have sev-
`eral continuations; the issuing stub pushes pointers to
`these continuation procedures and their arguments onto
`a continuation stack linked to the call record returned by
`gms net makebuf. When the reply arrives, the gms net
`receiver interrupt handler locates the call record for the
`reply as before, pops the continuations from the stack,
`and calls them in order with their arguments. Like other
`interrupt handling code, continuation procedures are not
`permitted to sleep.
`Continuations in gms net nonblocking RPC are re-
`lated to callbacks in Rover’s QRPC [19]. In Rover, asyn-
`chronous RPC calls are used to allow applications to tol-
`erate slow and unreliable mobile networks, whereas in
`GMS/Trapeze their purpose is to support pipelined RPC
`operations (e.g., prefetching) on a fast and reliable clus-
`ter interconnect.
`
`3.4.2 Read-Ahead from Network Memory
`
`GMS/Trapeze activates sequential read-ahead when the
`file/VM system determines that accesses are sequential,
`and that subsequent pages are resident in the global
`cache but not in the local cache.
`It issues nonblock-
`ing RPCs to prefetch the next N pages for some config-
`urable depth N; these requests are issued in the context
`of the user process accessing the data. Each prefetch re-
`quest is an ordinary getpage operation; to the receiver,
`they are indistinguishable from synchronous fetch re-
`quests. The caller allocates the target page frame be-
`fore the prefetch, maps it through the IPT as described
`in Section 3.2, and includes a reply token in the message.
`The server generates a reply as described in Section 3.2,
`possibly delegating the request to a peer as described in
`Section 3.3.
`A record of each pending prefetch request is hashed
`into the local page directory so that the frame can be lo-
`cated if a process references the page before the prefetch
`completes. If a process references a page with a pending
`prefetch, the process is put to sleep on a call record until
`the read-ahead catches up. If no process blocks await-
`ing completion, nonblocking RPCs do not have specific
`timeouts. However, call records for nonblocking RPCs
`are maintained as an LRU cache, so each call record
`eventually reports failure and is reused if no reply ar-
`rives.
`When each prefetch reply arrives, Trapeze transfers
`the payload into the waiting frame and interrupts the
`host. The interrupt handler uses the reply token to locate
`the call record, which holds a pointer to the continuation
`
`NetApp Ex. 1011, pg. 7
`
`
`
`handler for asynchronous prefetch. The interrupt han-
`dler invokes the continuation, which “injects” (hashes)
`the frame into the local page cache, and enters it into
`other structures as required, e.g., an LRU list. Note that
`prefetched pages are not copied.
`
`3.4.3 Deferred Continuations
`
`Since continuations execute from the receiver interrupt
`handler, they must be synchronized with any kernel code
`that accesses the same data structures. For example, a
`file prefetch “inject” continuation could corrupt the in-
`ternal file/VM data structures if it interrupts a process
`that was operating on those structures in kernel mode.
`An obvious solution is to disable receive interrupts for
`every operation on any data structure that is shared with
`a continuation procedure, but this would require signifi-
`cant reengineering of existing kernel code that does not
`expect to be interrupted.
`We use an optimistic approach that defers execution
`of continuations in the rare instances when races occur.
`Continuation procedures are boolean functions that val-
`idate their preconditions by probing the state of relevant
`kernel locks before executing. If any needed locks are
`held, this indicates that an operation was in progress
`when the interrupt was delivered, and the continuation
`cannot execute.
`In this case, the continuation returns
`false with no side effects, and is placed on a deferred
`continuations queue serviced by a kernel daemon thread.
`Deferred continuations incur higher latency and over-
`head, but they execute safely. This technique is similar
`to optimistic active messages [25].
`
`4 Performance
`
`section presents performance measurements
`This
`of gms net and sequential file access using the
`GMS/Trapeze prototype in Digital Unix 4.0. We
`measure all the RPC variants presented in order to
`illustrate the costs and benefits of the gms net and
`Trapeze mechanisms discussed in the previous sections.
`The file access tests are intended to show the rate at
`which a GMS/Trapeze client can source and sink data
`to network storage servers using these communication
`mechanisms for payload transfer and asynchronous
`read-ahead. A secondary goal is to show the effect of
`the operating system kernel interface chosen to read or
`write file data at these speeds, which are close to the
`limits of the hard