throbber
The following paper was originally published in the
`Proceedings of the USENIX Annual Technical Conference (NO 98)
`New Orleans, Louisiana, June 1998
`
`Cheating the I/O Bottleneck:
`Network Storage with Trapeze/Myrinet
`
`Darrell C. Anderson, Jeffrey S. Chase, Syam Gadde, Andrew J. Gallatin, and Kenneth G. Yocum,
`Duke University
`Michael J. Feeley,
`University of British Columbia
`
`For more information about USENIX Association contact:
`1. Phone:
`510 528-8649
`2. FAX:
`510 548-5738
`3. Email:
`office@usenix.org
`4. WWW URL:http://www.usenix.org/
`
`Oracle Ex. 1011, pg. 1
`
`

`

`Cheating the I/O Bottleneck:
`Network Storage with Trapeze/Myrinet
`Darrell C. Anderson, Jeffrey S. Chase, Syam Gadde, Andrew J. Gallatin, and Kenneth G. Yocum
`Department of Computer Science
`Duke University
`fanderson, chase, gadde, gallatin, grantg@cs.duke.edu
`Michael J. Feeley
`Department of Computer Science
`University of British Columbia
`feeley@cs.ubc.ca
`
`Abstract
`
`Recent advances in I/O bus structures (e.g., PCI), high-
`speed networks, and fast, cheap disks have signifi-
`cantly expanded the I/O capacity of desktop-class sys-
`tems. This paper describes a messaging system de-
`signed to deliver the potential of these advances for
`network storage systems including cluster file systems
`and network memory. We describe gms net, an RPC-
`like kernel-kernel messaging system based on Trapeze,
`a new firmware program for Myrinet network interfaces.
`We show how the communication features of Trapeze
`and gms net are used by the Global Memory Service
`(GMS), a kernel-based network memory system.
`The paper focuses on support for zero-copy page mi-
`gration in GMS/Trapeze using two RPC variants im-
`portant for peer-peer distributed services: (1) delegated
`RPC in which a request is delegated to a third party,
`and (2) nonblocking RPC in which replies are pro-
`cessed from the Trapeze receive interrupt handler. We
`present measurements of sequential file access from
`network memory in the GMS/Trapeze prototype on a
`Myrinet/Alpha cluster, showing the bandwidth effects
`of file system interfaces and communication choices.
`GMS/Trapeze delivers a peak read bandwidth of 96
`MB/s using memory-mapped file I/O.
`
`1 Introduction
`
`Two recent hardware advances boost the potential of
`cluster computing: switched cluster interconnects that
`can carry 1Gb/s or more of point-to-point bandwidth,
`and high-quality PCI bus implementations that can han-
`dle data streams at gigabit speeds. We are develop-
`ing system facilities to realize the potential for high-
`
` This work is supported by the National Science Foundation un-
`der grants CCR-96-24857 and CDA-95-12356, equipment grants from
`Intel Corporation and Myricom, and a software grant from the Open
`Group.
`
`speed data transfer over Myricom’s 1.28 Gb/s Myrinet
`LAN [2], and harness it for cluster file systems, network
`memory systems, and other distributed OS services that
`cooperatively share data across the cluster. Our broad
`goal is to use the power of the network to “cheat” the
`I/O bottleneck for data-intensive computing on worksta-
`tion clusters.
`This paper describes use of the Trapeze messag-
`ing system [27, 5] for high-speed data transfer in a
`network memory system, the Global Memory Service
`(GMS) [14, 18]. Trapeze is a firmware program for
`Myrinet/PCI adapters, and an associated messaging li-
`brary for DEC AlphaStations running Digital Unix 4.0
`and Intel platforms running FreeBSD 2.2. Trapeze com-
`munication delivers the performance of the underlying
`I/O bus hardware, balancing low latency with high band-
`width. Since the Myrinet firmware is customer-loadable,
`any Myrinet network site with PCI-based machines can
`use Trapeze.
`GMS [14] is a Unix kernel facility that manages the
`memories of cluster nodes as a shared, distributed page
`cache. GMS supports remote paging [8, 15] and co-
`operative caching [10] of file blocks and virtual mem-
`ory pages, unified at a low level of the Digital Unix 4.0
`kernel (a FreeBSD port is in progress). The purpose of
`GMS is to exploit high-speed networks to improve per-
`formance of data-intensive workloads by replacing disk
`activity with memory-memory transfers across the net-
`work whenever possible. The GMS mechanisms man-
`age the movement of VM pages and file blocks between
`each node’s local page cache — the file buffer cache
`and the set of resident virtual pages — and the network
`memory global page cache.
`This paper deals with the communication mecha-
`nisms and network performance of GMS systems us-
`ing Trapeze/Myrinet, with particular focus on the sup-
`port for zero-copy read-ahead and write-behind of se-
`quentially accessed files. Cluster file systems that stripe
`data across multiple servers are typically limited by
`
`Oracle Ex. 1011, pg. 2
`
`

`

`the bandwidth of the network and communication sys-
`tem [23, 16, 1]. We measure synthetic bandwidth tests
`that access files in network memory, in order to deter-
`mine the maximum bandwidth achievable through the
`file system interface by any network storage system us-
`ing Trapeze. The current GMS/Trapeze prototype can
`read files from network memory at 96 MB/s on an Al-
`phaStation/Myrinet network. Since these speeds ap-
`proach the physical limits of the hardware, unnecessary
`overheads (e.g., copying) can have significant effects on
`performance. These overheads can occur in the file ac-
`cess interface as well as in the messaging system. We
`evaluate three file access interfaces, including two that
`use the Unix mmap system call to eliminate copying.
`Central to GMS is an RPC-like messaging facility
`(gms net) that works with the Trapeze interface to sup-
`port the messaging patterns and block migration traffic
`characteristic of GMS and other network storage ser-
`vices. This includes a mix of asynchronous and re-
`quest/response messaging (RPC) that is peer-to-peer in
`the sense that each “client” may also act as a “server”.
`The support for RPC includes two variants important
`for network storage: (1) delegated RPCs in which re-
`quests are delegated to third parties, and (2) nonblock-
`ing RPC in which the replies are processed by contin-
`uation procedures executing from an interrupt handler.
`These features are important for peer-to-peer network
`storage services: the first supports directory lookups for
`fetched data, and the second supports lightweight asyn-
`chronous calls, which are useful for prefetching. When
`using these features, GMS and gms net cooperate with
`Trapeze to unify buffering of migrated pages, eliminat-
`ing all page copies by sending and receiving directly
`from the file buffer cache and local VM page cache.
`This paper is organized as follows. Section 2 gives an
`overview of the Trapeze network interface and the fea-
`tures relevant to GMS communication. Section 3 deals
`with the gms net messaging layer for Trapeze, focus-
`ing on the RPC variants and zero-copy handling of page
`transfers. Section 4 presents performance results from
`the GMS/Trapeze prototype. We conclude in Section 5.
`
`2 High-Speed Data Transfer with Trapeze
`
`The Trapeze messaging system consists of two compo-
`nents: a messaging library that is linked into programs
`using the package, and a firmware program that runs on
`the Myrinet network interface card (NIC). The Trapeze
`firmware and the host interact by exchanging commands
`and data through a block of memory on the NIC, which
`is addressable in the host’s physical address space using
`programmed I/O. The firmware defines the interface be-
`tween the host CPU and the network device; it interprets
`commands issued by the host and controls the movement
`
`of data between the host and the network link. The host
`accesses the network using macros and procedures in the
`Trapeze library, which defines the lowest level API for
`network communication across the Myrinet.
`
`user applications
`
`File/VM system
`
`Global Memory Service
`
`sockets
`
`TCP/IP
`
`RPC-like message layer
`Trapeze-API
`
`network driver
`
`Trapeze/Myrinet NIC
`
`kernel
`
`PCI
`
`Figure 1: Using Trapeze for TCP/IP and for kernel-
`kernel messaging for network memory.
`
`Like other network interfaces based on Myrinet (e.g.,
`Hamlyn [4], VMMC-2 [13], Active Messages [9],
`FM [21]), Trapeze can be used as a memory-mapped
`network interface for user applications, e.g., parallel pro-
`grams. However, Trapeze was designed primarily to
`support fast kernel-to-kernel messaging alongside con-
`ventional TCP/IP networking. The Trapeze distribu-
`tion includes a network device driver that allows the
`native TCP/IP protocol stack to use a Trapeze network
`alongside the gms net layer. Figure 1 depicts this struc-
`ture. The kernel-to-kernel messaging layer is intended
`for GMS and other services that assume mutually trust-
`ing kernels.
`
`2.1 Trapeze Overview
`
`Trapeze messages are short control messages (maximum
`128 bytes) with optional attached payloads typically
`containing application data not interpreted by the mes-
`sage system, e.g., a file block, a virtual memory page,
`or a TCP segment. Each message can have at most one
`payload attached to it. Separation of control messages
`and bulk data transfer is common to a large number of
`messaging systems since the V system [6].
`A Trapeze control message and its payload (if any) are
`sent as a single packet on the network. Since Myrinet has
`no fixed maximum packet size (MTU), the maximum
`payload size of a Trapeze network is configurable, and is
`typically set to the virtual memory page size (4K or 8K).
`The Trapeze MTU is the maximum control message size
`plus the payload size.
`Payloads are sent and received using DMA to/from
`aligned buffers residing anywhere in host memory. The
`host attaches a payload to an outgoing message using a
`Trapeze macro that stores the payload’s DMA address
`and length into designated fields of the send ring entry.
`On the receiving side, Trapeze deposits the payload into
`a host memory buffer before delivering the control mes-
`sage.
`
`Oracle Ex. 1011, pg. 3
`
`

`

`NIC Memory
`
`Send Ring
`
`Key
`
`Key
`
`Key
`
`Key
`
`Key
`
`Key
`
`Receive Ring
`
`Incoming
`Payload Table
`
`Host Memory
`Frames
`
`Figure 2: NIC Memory Structures for a Trapeze end-
`point.
`
`The data structures in NIC memory include an end-
`point structure shared with the host. A Trapeze endpoint
`(shown in Figure 2) includes two message rings, one for
`sending and one for receiving. Each message ring is a
`circular array of 128-byte control message buffers and
`related state, managed as a producer/consumer queue.
`From the perspective of a host CPU, the NIC produces
`incoming messages in the receive ring and consumes
`outgoing messages in the send ring. The host sends a
`message by forming it in the next free send ring entry
`and setting a bit to indicate that the message is ready
`to send. When a message arrives from the network, the
`firmware deposits it into the next free receive ring entry,
`sets a bit to inform the host that the message is ready to
`consume, and optionally signals the host with an inter-
`rupt.
`Handling of incoming messages is interrupt-driven
`when Trapeze is used from within the kernel. Each ker-
`nel protocol module using Trapeze (i.e., gms net and the
`IP network driver) registers a receiver interrupt handler
`upcalled from the Trapeze interrupt handler.
`Trapeze is designed to optimize handling of payloads
`as well as to deliver good performance for small mes-
`sages.
`In a network memory system, page fault stall
`time is determined primarily by the time to transfer the
`requested page on the network. On the other hand,
`bursts of page transfers (e.g., for read-ahead for se-
`quential access) require high bandwidth. The Trapeze
`firmware employs a message pipelining technique called
`cut-through delivery [27] to balance low payload latency
`with high bandwidth under load. With this technique,
`the one-way raw Trapeze latency for a 4K page trans-
`fer is 70s on 300MHz Pentium-II/440LX systems with
`LANai 4.1 M2M-PCI32 Myrinet adapters. On these sys-
`tems, Trapeze delivers 112 MB/s for a stream of 8K pay-
`loads; with 64K payloads, Trapeze can use over 95% of
`the peak bandwidth of the I/O bus, achieving 126 MB/s
`of user-to-user point-to-point bandwidth.1
`
`1These bandwidth numbers define a “megabyte” as one million
`
`2.2 Unified Buffering
`Trapeze
`
`for
`
`In-Kernel
`
`All kernel-based Trapeze protocol modules share a com-
`mon pool of receive buffers allocated from the virtual
`memory page frame pool; the maximum payload size is
`set to the virtual memory page size. Since Digital Unix
`allocates its file block buffers from the virtual memory
`page frame pool as well, this allows unified buffering
`among the network, file, and VM systems. For example,
`the system can send any virtual memory page or cached
`file block out to the network by attaching it as a payload
`to an outgoing message. Similarly, every incoming pay-
`load is deposited in an aligned physical frame that can
`mapped into a user process or hashed into the file cache.
`Since file caching and virtual memory management are
`reasonably unified, we often refer to the two subsystems
`collectively as “the file/VM system”, and use the term
`“page” to include file blocks.
`The TCP/IP stack can also benefit from the unified
`buffering of Trapeze payloads to reduce copying over-
`head by payload remapping (similar to [11, 3, 17]). On
`a normal transmission, IP message data is copied from
`a user memory buffer into an mbuf chain [20] on the
`sending side; on the receiving side, the driver copies the
`header into a small mbuf, points a BSD-style external
`mbuf at the payload buffer, and passes the chain through
`the IP stack to the socket layer, which copies the payload
`into user memory and frees the kernel buffer. We have
`modified the Digital Unix socket layer to avoid copying
`when size and alignment properties allow. On the send-
`ing side, the socket layer builds mbuf chains by pinning
`the user buffer frames, marking them copy-on-write, ref-
`erencing them with external mbufs, and passing them
`through the TCP/IP stack to the network driver, which
`attaches them to outgoing messages as payloads. On
`the receiving side, the socket layer unmaps the frames
`of the user buffer, replaces them with the kernel pay-
`load buffer frames, and frees the user frames. With pay-
`load remapping, AlphaStations running the standard net-
`perf TCP benchmark over Trapeze sustain point-to-point
`bandwidth of 87 MB/s.2
`Since outgoing payload frames attached to the send
`ring may be owned by the file/VM system, they must
`be protected from modification or reuse while a trans-
`mit is in progress. Trapeze notifies the system that it is
`safe to overwrite an outgoing frame by upcalling a spec-
`ified transmit completion handler routine. For example,
`when an IP send on a user frame completes, Trapeze up-
`calls the completion routine, which unpins the frame and
`
`bytes. All other bandwidth numbers in this paper define 1MB as
`1024*1024 bytes.
`2Measured Alcor (266 MHz AS 500) to Miata (500 MHz PWS
`500au), 8320-byte MTU, 1M netperf transfers, socket buffers at 1M,
`software TCP checksums disabled (hardware CRC only): 732 Mb/s.
`
`Oracle Ex. 1011, pg. 4
`
`

`

`releases its copy-on-write protection.
`However, to reduce overhead Trapeze does not gener-
`ate transmit-complete interrupts. Instead, Trapeze saves
`the handler pointer in host memory and upcalls the han-
`dler only when the send ring entry is reused for another
`send. Since messages may be sent from interrupt han-
`dlers, a completion routine could be called in the context
`of an interrupt handler that happened to reuse the same
`send ring entry as the original message. For this rea-
`son, completion handlers must not block, and the struc-
`tures they manipulate must be protected by disabling
`interrupts. Since completion upcalls may be arbitrar-
`ily delayed, the Trapeze API includes a routine to poll
`all pending transmits and call their handlers if they have
`completed.
`
`2.3 Incoming Payload Table
`
`The benefits of high-speed networking are easily over-
`shadowed by processing costs and copying overhead
`in the hosts. To support zero-copy communication, a
`Trapeze receiver can designate a region of memory as
`the receive buffer space for a specific incoming payload
`identified by a tag field. When the message arrives, the
`firmware recognizes the tag and deposits the payload di-
`rectly into the waiting buffer. Handling of tagged pay-
`loads is governed by a third structure in NIC memory,
`the incoming payload table (IPT).
`GMS uses the Trapeze IPT for copy-free handling of
`fetched pages in RPC replies, as described in Section 3.
`Ordinarily, Trapeze payloads are received into buffers
`attached by the host to the receive ring entries; since the
`firmware places messages in the ring in the order they
`arrive, the host cannot know in advance which generic
`buffer will be selected to receive any given payload, and
`the payload may need to be copied within the host if it
`cannot be remapped. Early demultiplexing with the IPT
`avoids this copy.
`To set up an IPT mapping, the host calls a Trapeze
`API routine to allocate a free entry in the IPT, initialize it
`with the DMA address of the designated payload buffer,
`and return a tag value (payload token) consisting of an
`IPT index and a protection key. The payload token is a
`weak form of capability that can be passed in a message
`to another node; any node that knows the token can use
`it to tag a message and transmit a payload into the buffer.
`When the firmware receives a tagged message from the
`network, it validates the key against the indexed IPT en-
`try before initiating a DMA into the designated receive
`buffer. The receiving host may cancel the IPT entry at
`any time (e.g., request timeout); similarly, the firmware
`protects against dangling tokens and duplicate messages
`by cancelling the entry when a matching message is re-
`ceived. If the key is not valid, the NIC drops the payload
`
`and delivers the control portion with a payload length of
`zero, so the receive message handler can recognize and
`handle the error.
`At present, the IPT maps only a few megabytes of
`host memory, enough for the reply payloads of all out-
`standing requests (e.g., outstanding page fetches). This
`is a modest approach that meets our needs, relative to
`more ambitious approaches that indirect through TLB-
`like structures on the NIC [13, 26, 7]. We have con-
`sidered a larger IPT with support for multiple transfers
`to the same buffer at different offsets, as in Hamlyn’s
`sender-based memory management [4], but we have not
`found a need for these features in our current uses of
`Trapeze.
`
`3 Page Transfers in GMS/Trapeze
`
`This section outlines a Trapeze-based kernel-kernel
`RPC-like messaging layer designed to support coop-
`erative cluster services. The package is derived from
`the original RPC package for the Global Memory Ser-
`vice [14] (gms net), extended to use Trapeze and to sup-
`port a richer set of communication styles, primarily for
`asynchronous prefetching at high bandwidth [24]. Al-
`though the package is generic, we draw on GMS exam-
`ples to motivate its features and to illustrate their use.
`Since many aspects of RPC and messaging systems
`are well-understood, we focus on those aspects that ben-
`efit from the Trapeze features discussed in the previ-
`ous section.
`In particular, we explain the features for
`transferring pages (or file blocks) efficiently within the
`RPC framework, and their use by the protocol operations
`most critical for GMS performance: page fetches (get-
`page) from the global page cache to a local page cache,
`and page pushes or evictions (putpage or movepage)
`from a local cache to the global cache.
`Section 3.2 discusses the zero-copy handling of
`fetched pages using the Trapeze incoming payload ta-
`ble (IPT); Sections 3.3 and 3.4 extend the zero-copy re-
`ply scheme to delegated and nonblocking RPC variants
`useful in GMS and other peer-to-peer network services.
`We illustrate use of nonblocking RPC to extend standard
`read-ahead for files and virtual memory to GMS; this al-
`lows processes to access data from network memory or
`storage servers at close to network bandwidth.
`
`3.1 Basic Mechanisms
`
`The gms net messaging layer includes basic support for
`typed messages, stub procedures, dispatching to ser-
`vice procedures based on message types, and matching
`replies with requests. The Trapeze receiver interrupt
`handler directs incoming messages to gms net by up-
`calling a registered service routine; the service routine
`
`Oracle Ex. 1011, pg. 5
`
`

`

`waiting frame using DMA.
`The RPC system uses the Trapeze incoming payload
`table (IPT) described in Section 2.3 for this purpose. The
`client-side getpage stub calls Trapeze to allocate an IPT
`entry and obtain a payload token, which is added to the
`reply token for the call. When the server-side getpage
`stub generates a reply, it attaches the frame containing
`the requested page as a payload, extracts the Trapeze
`payload token, and tags the outgoing reply message by
`placing the token in its Trapeze header. Back on the
`client side, the Trapeze firmware recognizes the tag in
`the message header as the reply payload begins to arrive
`on the adapter; once the tag is decoded and validated,
`the firmware initiates DMA of the message payload into
`the waiting frame.
`
`Directory Site
`...
`A: ...
`...
`
`D elegate
`
`R equ est
`
`Reply
`
`Page A
`
`Requesting Node
`
`Caching Site
`
`Figure 3: GMS getpage operation through a directory
`site using a delegated RPC.
`
`3.3 Delegated RPC
`
`Unlike traditional RPC, some GMS protocol operations
`involve more than one server. To fetch a remote page,
`for example, the getpage operation must first locate the
`page’s caching site. To keep track of pages, GMS uses a
`distributed hash directory for pages potentially sharable
`by multiple nodes [14]. A requesting node locates a
`page’s directory site by applying a globally replicated
`hash function to a unique page identifier. It then issues
`a getpage RPC to the directory site, which looks up the
`page in its portion of the directory, and forwards the re-
`quest to the caching site. The caching site completes the
`three-way RPC by returning the page directly to the re-
`quester. We call this type of operation a delegated RPC.
`The key idea behind delegated RPC is to allow the re-
`ply token to be passed from node to node until the call is
`complete; the last node in the RPC sequence then uses
`the reply token to complete the RPC and reply to the
`original requester. The delegation is transparent to the
`requester, which creates its reply token and includes it
`in the request message exactly as for an ordinary RPC.
`
`hands off incoming requests to a server thread. How-
`ever, gms net it is not a true RPC system: many protocol
`messages do not produce replies, and there is no sup-
`port for automatic stub generation. The package is best
`thought of as a library of procedures and macros used by
`the messaging stubs to build and decode messages and
`to direct their flow through the system. It is designed for
`messages with relatively simple arguments and bulk data
`payloads (e.g., file blocks) that are not interpreted by the
`message handlers themselves.
`To send a message, a stub allocates a message buffer
`with gms net makebuf, calls routines and macros to
`build the message, e.g., by pushing data items into the
`message, and sends the message to a destination with
`gms net sendto. In Trapeze, gms net makebuf returns a
`pointer to a send ring entry, and gms net sendto releases
`it. Messages are typed by an operation code and a re-
`quest/reply bit. Incoming requests are dispatched by us-
`ing the operation code to index into a vector of registered
`server-side stubs. Incoming replies are handled directly
`by the receiver interrupt handler, either by waking up a
`waiting thread or by calling a reply continuation proce-
`dure as described in Section 3.4.
`An important function of the RPC layer is to match in-
`coming replies with requests. If a reply is expected, the
`caller makes an entry in a call record table before send-
`ing the request message, and places a reply token con-
`taining a unique call record ID into the outgoing request
`message. After sending the request, the calling thread
`or process may block on the call record entry. When the
`server side generates a reply, it places a copy of the reply
`token in the reply message. When the reply arrives, the
`receiver interrupt handler decodes the reply token and
`retrieves the call record. The call record includes all in-
`formation needed to process the reply, e.g., by awaken-
`ing the calling thread or process.
`To transfer a page or file block in a request or re-
`ply, the stub attaches the memory frame to the mes-
`sage buffer as a payload. The system is inhibited from
`reusing the frame or overwriting it until the frame con-
`tents have been transferred to the network adapter using
`DMA (Section 2.2). On the receiving side, Trapeze uses
`DMA to deposit each received payload into a memory
`frame designated by the receiver.
`
`3.2 Zero-Copy Reply Handling
`
`GMS performance depends on efficient handling of get-
`page replies containing page payloads. When the virtual
`memory system or file system initiates a page fetch, it
`first selects the target page frame according to its poli-
`cies for page replacement and other factors such as page
`coloring. The goal of the GMS getpage client stub is
`to arrange to transfer the incoming page directly to the
`
`Oracle Ex. 1011, pg. 6
`
`

`

`With delegated RPC, however, a remote procedure can
`either return a reply or delegate the request to another
`peer by generating a new request message with a copy
`of the original reply token. Each node has the same op-
`tion of either replying to the original requester or del-
`egating the request to yet another peer. Note that each
`delegation call is unlike a normal RPC in that the dele-
`gated procedure never replies to its immediate caller; it
`either forwards the request again or replies directly to the
`node that initiated the delegated RPC. Most importantly,
`the zero-copy reply handling scheme is preserved, since
`the Trapeze payload token is embedded within the re-
`ply token, and so is available to the node that ultimately
`generates the reply.
`Figure 3 shows how GMS getpage uses delegated
`RPC. The directory site for the requested page delegates
`the request to the caching site, forwarding the original
`request parameters and reply token. The caching site
`generates a reply, attaches the requested page as a pay-
`load, and tags the message with the payload token, as
`described in the previous section. It then sends the reply
`directly to the original requester, where it is handled in
`the same way as a direct reply.
`
`3.4 Read-Ahead with Nonblocking RPC
`
`Most file system implementations implement read-ahead
`to prefetch file data before it is requested. This signif-
`icantly improves bandwidth for sequential file access,
`which is easy to detect and exploit [20]. We have ex-
`tended GMS to support read-ahead in order to meet our
`goal of delivering files from network memory at full net-
`work bandwidth. GMS read-ahead hides fetch latency
`and delivers peak bandwidth by pipelining the network
`with a continuous stream of page fetches.
`
`3.4.1 Nonblocking RPC
`
`Any form of prefetching imposes new demands on the
`communication layer and is highly dependent on its per-
`formance. In particular, prefetching requests are RPC
`calls that generate replies, but the replies must be han-
`dled asynchronously and outside of the issuing thread
`context, so as not to block the issuing thread while the re-
`quest is pending. NFS client implementations typically
`solve this problem by handing off read-ahead calls to a
`system I/O daemon that can wait for RPC replies with-
`out affecting user processes [22]. This solution requires
`a context switch to the I/O daemon for each request and
`response.
`To reduce context switching overhead, GMS/Trapeze
`implements read-ahead and prefetching using nonblock-
`ing RPCs. To implement nonblocking RPC, gms net
`supplements the call record with support for continua-
`
`tion procedures invoked directly from the receiver in-
`terrupt handler to process the reply. These continua-
`tions are similar to Draves et. al. [12], but they ex-
`ecute at interrupt time with no associated thread con-
`text. Also, each nonblocking RPC call may have sev-
`eral continuations; the issuing stub pushes pointers to
`these continuation procedures and their arguments onto
`a continuation stack linked to the call record returned by
`gms net makebuf. When the reply arrives, the gms net
`receiver interrupt handler locates the call record for the
`reply as before, pops the continuations from the stack,
`and calls them in order with their arguments. Like other
`interrupt handling code, continuation procedures are not
`permitted to sleep.
`Continuations in gms net nonblocking RPC are re-
`lated to callbacks in Rover’s QRPC [19]. In Rover, asyn-
`chronous RPC calls are used to allow applications to tol-
`erate slow and unreliable mobile networks, whereas in
`GMS/Trapeze their purpose is to support pipelined RPC
`operations (e.g., prefetching) on a fast and reliable clus-
`ter interconnect.
`
`3.4.2 Read-Ahead from Network Memory
`
`GMS/Trapeze activates sequential read-ahead when the
`file/VM system determines that accesses are sequential,
`and that subsequent pages are resident in the global
`cache but not in the local cache.
`It issues nonblock-
`ing RPCs to prefetch the next N pages for some config-
`urable depth N; these requests are issued in the context
`of the user process accessing the data. Each prefetch re-
`quest is an ordinary getpage operation; to the receiver,
`they are indistinguishable from synchronous fetch re-
`quests. The caller allocates the target page frame be-
`fore the prefetch, maps it through the IPT as described
`in Section 3.2, and includes a reply token in the message.
`The server generates a reply as described in Section 3.2,
`possibly delegating the request to a peer as described in
`Section 3.3.
`A record of each pending prefetch request is hashed
`into the local page directory so that the frame can be lo-
`cated if a process references the page before the prefetch
`completes. If a process references a page with a pending
`prefetch, the process is put to sleep on a call record until
`the read-ahead catches up. If no process blocks await-
`ing completion, nonblocking RPCs do not have specific
`timeouts. However, call records for nonblocking RPCs
`are maintained as an LRU cache, so each call record
`eventually reports failure and is reused if no reply ar-
`rives.
`When each prefetch reply arrives, Trapeze transfers
`the payload into the waiting frame and interrupts the
`host. The interrupt handler uses the reply token to locate
`the call record, which holds a pointer to the continuation
`
`Oracle Ex. 1011, pg. 7
`
`

`

`handler for asynchronous prefetch. The interrupt han-
`dler invokes the continuation, which “injects” (hashes)
`the frame into the local page cache, and enters it into
`other structures as required, e.g., an LRU list. Note that
`prefetched pages are not copied.
`
`3.4.3 Deferred Continuations
`
`Since continuations execute from the receiver interrupt
`handler, they must be synchronized with any kernel code
`that accesses the same data structures. For example, a
`file prefetch “inject” continuation could corrupt the in-
`ternal file/VM data structures if it interrupts a process
`that was operating on those structures in kernel mode.
`An obvious solution is to disable receive interrupts for
`every operation on any data structure that is shared with
`a continuation procedure, but this would require signifi-
`cant reengineering of existing kernel code that does not
`expect to be interrupted.
`We use an optimistic approach that defers execution
`of continuations in the rare instances when races occur.
`Continuation procedures are boolean functions that val-
`idate their preconditions by probing the state of relevant
`kernel locks before executing. If any needed locks are
`held, this indicates that an operation was in progress
`when the interrupt was delivered, and the continuation
`cannot execute.
`In this case, the continuation returns
`false with no side effects, and is placed on a deferred
`continuations queue serviced by a kernel daemon thread.
`Deferred continuations incur higher latency and over-
`head, but they execute safely. This technique is similar
`to optimistic active messages [25].
`
`4 Performance
`
`section presents performance measurements
`This
`of gms net and sequential file access using the
`GMS/Trapeze prototype in Digital Unix 4.0. We
`measure all the RPC variants presented in order to
`illustrate the costs and benefits of the gms net and
`Trapeze mechanisms discussed in the previous sections.
`The file access tests are intended to show the rate at
`which a GMS/Trapeze client can source and sink data
`to network storage servers using these communication
`mechanisms for payload transfer and asynchronous
`read-ahead. A secondary goal is to show the effect of
`the operating system kernel interface chosen to read or
`write file data at these speeds, which are close to the
`limits of the hard

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket