throbber
A Case for High Performance Computing with Virtual
`Machines
`
`Jiuxing Liuz
`Wei Huangy
`y Computer Science and Engineering
`The Ohio State University
`Columbus, OH 43210
`fhuanwei, pandag@cse.ohio-state.edu
`
`Dhabaleswar K. Panday
`Bulent Abaliz
`z IBM T. J. Watson Research Center
`19 Skyline Drive
`Hawthorne, NY 10532
`fjl, abalig@us.ibm.com
`
`ABSTRACT
`Virtual machine (VM) technologies are experiencing a resur-
`gence in both industry and research communities. VMs of-
`fer many desirable features such as security, ease of man-
`agement, OS customization, performance isolation, check-
`pointing, and migration, which can be very bene(cid:12)cial to
`the performance and the manageability of high performance
`computing (HPC) applications. However, very few HPC ap-
`plications are currently running in a virtualized environment
`due to the performance overhead of virtualization. Further,
`using VMs for HPC also introduces additional challenges
`such as management and distribution of OS images.
`In this paper we present a case for HPC with virtual ma-
`chines by introducing a framework which addresses the per-
`formance and management overhead associated with VM-
`based computing. Two key ideas in our design are: Virtual
`Machine Monitor (VMM) bypass I/O and scalable VM im-
`age management. VMM-bypass I/O achieves high commu-
`nication performance for VMs by exploiting the OS-bypass
`feature of modern high speed interconnects such as In(cid:12)ni-
`Band. Scalable VM image management signi(cid:12)cantly reduces
`the overhead of distributing and managing VMs in large
`scale clusters. Our current implementation is based on the
`Xen VM environment and In(cid:12)niBand. However, many of
`our ideas are readily applicable to other VM environments
`and high speed interconnects.
`We carry out detailed analysis on the performance and
`management overhead of our VM-based HPC framework.
`Our evaluation shows that HPC applications can achieve
`almost the same performance as those running in a native,
`non-virtualized environment. Therefore, our approach holds
`promise to bring the bene(cid:12)ts of VMs to HPC applications
`with very little degradation in performance.
`
`1.
`
`INTRODUCTION
`Virtual machine (VM) technologies were (cid:12)rst introduced
`
`Permission to make digital or hard copies of all or part of this work for
`personal or classroom use is granted without fee provided that copies are
`not made or distributed for pro(cid:2)t or commercial advantage and that copies
`bear this notice and the full citation on the (cid:2)rst page. To copy otherwise, to
`republish, to post on servers or to redistribute to lists, requires prior speci(cid:2)c
`permission and/or a fee.
`ICS’06 June 28›30, Cairns, Queensland, Australia.
`Copyright 2006 ACM 1›59593›282›8/06/0006 ...$5.00.
`
`in the 1960s [9], but are experiencing a resurgence in both
`industry and research communities. A VM environment pro-
`vides virtualized hardware interfaces to VMs through a Vir-
`tual Machine Monitor (VMM) (also called hypervisor). VM
`technologies allow running di(cid:11)erent guest VMs in a phys-
`ical box, with each guest VM possibly running a di(cid:11)erent
`guest operating system. They can also provide secure and
`portable environments to meet the demanding requirements
`of computing resources in modern computing systems.
`Recently, network interconnects such as In(cid:12)niBand [16],
`Myrinet [24] and Quadrics [31] are emerging, which provide
`very low latency (less than 5 (cid:22)s) and very high bandwidth
`(multiple Gbps). Due to those characteristics, they are be-
`coming strong players in the (cid:12)eld of high performance com-
`puting (HPC). As evidenced by the Top 500 Supercomputer
`list [35], clusters, which are typically built from commodity
`PCs connected through high speed interconnects, have be-
`come the predominant architecture for HPC since the past
`decade.
`Although originally more focused on resource sharing, cur-
`rent virtual machine technologies provide a wide range of
`bene(cid:12)ts such as ease of management, system security, per-
`formance isolation, checkpoint/restart and live migration.
`Cluster-based HPC can take advantage of these desirable
`features of virtual machines, which is especially important
`when ultra-scale clusters are posing additional challenges on
`performance, scalability, system management, and adminis-
`tration of these systems [29, 17].
`In spite of these advantages, VM technologies have not
`yet been widely adopted in the HPC area. This is due to
`the following challenges:
`
`(cid:15) Virtualization overhead: To ensure system integrity,
`the virtual machine monitor (VMM) has to trap and
`process privileged operations from the guest VMs. This
`overhead is especially visible for I/O virtualization,
`where the VMM or a privileged host OS has to in-
`tervene every I/O operation. This added overhead is
`not favored by HPC applications where communica-
`tion performance may be critical. Moreover, memory
`consumption for VM-based environment is also a con-
`cern because a physical box is usually hosting several
`guest virtual machines.
`
`(cid:15) Management E(cid:14)ciency: Though it is possible to uti-
`lize VMs as static computing environments and run ap-
`plications in pre-con(cid:12)gured systems, high performance
`computing cannot fully bene(cid:12)t from VM technologies
`
`Microsoft Ex. 1019, p. 1
`Microsoft v. Daedalus Blue
`IPR2021-00832
`
`

`

`unless there exists a management framework which
`helps to map VMs to physical machines, dynamically
`distribute VM OS images to physical machines, boot-
`up and shutdown VMs with low overhead in cluster
`environments.
`
`In this paper, we take on these challenges and propose
`a VM-based framework for HPC which addresses various
`performance and management issues associated with virtu-
`alization. In this framework, we reduce the overhead of net-
`work I/O virtualization through VMM-bypass I/O [18]. The
`concept of VMM-bypass extends the idea of OS-bypass [39,
`38], which takes the shortest path for time critical oper-
`ations through user-level communication. An example of
`VMM-bypass I/O was presented in Xen-IB [18], a proto-
`type we developed to virtualize In(cid:12)niBand under Xen [6].
`Bypassing the VMM for time critical communication oper-
`ations, Xen-IB provides virtualized In(cid:12)niBand devices for
`Xen virtual machines with near-native performance. Our
`framework also provides the (cid:13)exibility of using customized
`kernels/OSes for individual HPC applications.
`It also al-
`lows building very small VM images which can be managed
`very e(cid:14)ciently. With detailed performance evaluations, we
`demonstrate that high performance computing jobs can run
`as e(cid:14)ciently in our Xen-based cluster as in a native, non-
`virtualized In(cid:12)niBand cluster. Although we focus on In(cid:12)ni-
`Band and Xen, we believe that our framework can be read-
`ily extended for other high-speed interconnects and other
`VMMs. To the best of our knowledge, this is the (cid:12)rst study
`to adopt VM technologies for HPC in modern cluster envi-
`ronments equipped with high speed interconnects.
`In summary, the main contributions of our work are:
`
`(cid:15) We propose a framework which allows high perfor-
`mance computing applications to bene(cid:12)t from the de-
`sirable features of virtual machines. To demonstrate
`the framework, we have developed a prototype system
`using Xen virtual machines on an In(cid:12)niBand cluster.
`
`(cid:15) We describe how the disadvantages of virtual machines,
`such as virtualization overhead, memory consumption,
`management issues, etc., can be addressed using cur-
`rent technologies with our framework.
`
`(cid:15) We carry out detailed performance evaluations on the
`overhead of using virtual machines for high perfor-
`mance computing (HPC). This evaluation shows that
`our virtualized In(cid:12)niBand cluster is able to deliver al-
`most the same performance for HPC applications as
`those in a non-virtualized In(cid:12)niBand cluster.
`
`The rest of the paper is organized as follows: To further
`justify our motivation, we start with more discussion on the
`bene(cid:12)ts of VM for HPC in Section 2. In Section 3, we pro-
`vide the background knowledge for this work. We identify
`the key challenges for VM-based computing in Section 4.
`Then, we present our framework for VM-based high perfor-
`mance computing in Section 5 and carry out detailed perfor-
`mance analysis in Section 6. In Section 7, we discuss several
`issues in our current implementation and how they can be
`addressed in future. Last, we discuss the related work in
`Section 8 and conclude the paper in Section 9.
`
`2. BENEFITS OF VIRTUAL MACHINES FOR
`HPC APPLICATIONS
`
`With the deployment of large scale clusters for HPC appli-
`cations, management and scalability issues on these clusters
`are becoming increasingly important. Virtual machines can
`greatly bene(cid:12)t cluster computing in such systems, especially
`from the following aspects:
`
`(cid:15) Ease of management: A system administrator can
`view a virtual machine based cluster as consisting of a
`set of virtual machine templates and a pool of physi-
`cal resources [33]. To create the runtime environment
`for certain applications, the administrator needs only
`pick the correct template and instantiate the virtual
`machines on physical nodes. VMs can be shutdown
`and brought up much easier than real physical ma-
`chines, which eases the task of system recon(cid:12)guration.
`VMs also provide clean solutions for live migration
`and checkpoint/restart, which are helpful to deal with
`hardware problems like hardware upgrades and failure,
`which happen frequently in large-scale systems.
`
`(cid:15) Customized OS: Currently most clusters are utiliz-
`ing general purpose operating systems, such as Linux,
`to meet a wide range of requirements from various user
`applications. Although researchers have been suggest-
`ing that light-weight OSes customized for each type
`of application can potentially gain performance bene-
`(cid:12)ts [10], this has not yet been widely adopted because
`of management di(cid:14)culties. However, with VMs, it is
`possible to highly customize the OS and the run-time
`environment for each application. For example, ker-
`nel con(cid:12)guration, system software parameters, loaded
`modules, as well as memory and disk space can be
`changed conveniently.
`
`(cid:15) System Security: Some applications, such as system
`level pro(cid:12)ling [27], may require additional kernel ser-
`vices to run. In a traditional HPC system, this requires
`either the service to run for all applications, or users to
`be trusted with privileges for loading additional kernel
`modules, which may lead to compromised system in-
`tegrity. With VM environments, it is possible to allow
`normal users to execute such privileged operations on
`VMs. Because the resource integrity for the physical
`machine is controlled by the virtual machine monitor
`instead of the guest operating system, faulty or mali-
`cious code in guest OS may in the worst case crash a
`virtual machine, which can be easily recovered.
`
`3. BACKGROUND
`In this section, we provide the background information
`for our work. Our implementation of VM-based comput-
`ing is based on the Xen VM environment and In(cid:12)niBand.
`Therefore, we start with introducing Xen in Section 3.1 and
`the In(cid:12)niBand in Section 3.2. In Section 3.3, we introduce
`Xenoprof, a pro(cid:12)ling toolkit for Xen we used in this paper.
`3.1 Overview of the Xen Virtual Machine
`Monitor
`Xen is a popular high performance VMM originally devel-
`oped at the University of Cambridge. It uses paravirtualiza-
`tion [41], which requires that the host operating systems be
`explicitly ported to the Xen architecture, but brings higher
`performance. However, Xen does not require changes to the
`
`Microsoft Ex. 1019, p. 2
`Microsoft v. Daedalus Blue
`IPR2021-00832
`
`

`

`application binary interface (ABI), so existing user applica-
`tions can run without any modi(cid:12)cation.
`
`VM0
`(Domain0)
`Device Manager
`and Control
`Software
`
`Guest OS
`(XenoLinux)
`Back-end driver
`
`native
`Device
`Driver
`
`VM1
`(Guest Domain)
`Unmodified
`User
`Software
`
`VM2
`(Guest Domain)
`Unmodified
`User
`Software
`
`Guest OS
`(XenoLinux)
`
`Guest OS
`(XenoLinux)
`
`front-end driver
`
`front-end driver
`
`Safe HW IF Control IF
`
`Xen Hypervisor
`Event Channel Virtual CPU
`
`Virtual MMU
`
`Hardware (SMP, MMU, Physical Memory, Ehternet, SCSI/IDE)
`
`Figure 1: The structure of the Xen hypervisor, host-
`ing three xenoLinux operating systems
`
`Figure 1(courtesy [30]) illustrates the structure of a phys-
`ical machine running Xen. The Xen hypervisor (the VMM)
`is at the lowest level and has direct access to the hardware.
`The hypervisor is running in the most privileged processor-
`level. Above the hypervisor are the Xen domains (VMs).
`Guest OSes running in guest domains are prevented from
`directly executing privileged processor instructions. A spe-
`cial domain called Domain0 (or Dom0), which is created
`at boot time, is allowed to access the control interface pro-
`vided by the hypervisor and performs the tasks to create,
`terminate or migrate other guest domains (User Domain or
`DomU) through the control interfaces.
`In Xen, domains communicate with each other through
`shared pages and event channels, which provide an asyn-
`chronous noti(cid:12)cation mechanism between domains. A \send"
`operation on one side of the event channel will cause an event
`to be received by the destination domain, which may in turn
`cause an interrupt. If a domain wants to send data to an-
`other, the typical scheme is for a source domain to grant
`access to local memory pages to the destination domain and
`then send a noti(cid:12)cation event. Then, these shared pages are
`used to transfer data.
`Most existing device drivers assume they have complete
`control of the device, so there cannot be multiple instantia-
`tions of such drivers in di(cid:11)erent domains for a single device.
`To ensure manageability and safe access, device virtualiza-
`tion in Xen follows a split device driver model [12]. Each
`device driver is expected to run in an Isolated Device Do-
`main (or IDD, typically the same as Dom0), which also hosts
`a backend driver, running as a daemon and serving the ac-
`cess requests from DomUs. The guest OS in DomU uses a
`frontend driver to communicate with the backend. The split
`driver organization provides security: misbehaving code in
`one DomU will not result in failure of other domains.
`3.2
`InfiniBand
`In(cid:12)niBand [16] is an emerging interconnect o(cid:11)ering high
`performance and features such as OS-bypass.
`In(cid:12)niBand
`host channel adapters (HCAs) are the equivalent of network
`interface cards (NICs) in traditional networks. A queue-
`based model is presented to the customers. A Queue Pair
`(QP) consists of a send work queue and a receive work queue.
`Communication instructions are described in Work Queue
`
`Requests (WQR), or descriptors, and are submitted to the
`QPs. Submitted WQRs are executed by the HCA and the
`completions are reported through a Completion Queue (CQ)
`as Completion Queue Entries (CQE). In(cid:12)niBand requires all
`bu(cid:11)ers involved in communication be registered before they
`can be used in data transfers. Upon the success of registra-
`tion, a pair of local and remote keys are returned. They will
`be used later for local and remote (RDMA) accesses.
`Initiating data transfer (posting WQRs) and completion
`of work requests noti(cid:12)cation (poll for completion) are time-
`critical tasks that need to be performed by the application in
`a OS-bypass manner. In the Mellanox [21] approach, which
`represents a typical implementation of In(cid:12)niBand speci(cid:12)-
`cation, these operations are done by ringing a doorbell.
`Doorbells are rung by writing to the registers that form
`the User Access Region (UAR). UAR is memory-mapped
`directly from a physical address space that is provided by
`HCA. It allows access to HCA resources from privileged as
`well as unprivileged mode. Posting a work request includes
`putting the descriptors (WQR) to a QP bu(cid:11)er and writing
`the doorbell to the UAR, which is completed without the
`involvement of the operating system. CQ bu(cid:11)ers, where the
`CQEs are located, can also be directly accessed from the pro-
`cess virtual address space. These OS-bypass features make
`it possible for In(cid:12)niBand to provide very low communication
`latency.
`
`User−level Application
`
`User −level Infiniband Service
`
`User−level HCA Driver
`
`Core Infiniband Modules
`
`HCA Driver
`
`InfiniBand HCA
`
`User-space
`Kernel
`
`OS-bypass
`
`Figure 2: Architectural overview of OpenIB Gen2
`stack
`
`OpenIB Gen2 [26] is a popular IB stack provided by the
`OpenIB community. Xen-IB presented in this paper is based
`on the OpenIB-Gen2 driver. Figure 2 illustrates the archi-
`tecture of the Gen2 stack.
`3.3 Xenoprof: Profiling in Xen VM Environ-
`ments
`Xenoprof [2] is an open source system-wide statistical pro-
`(cid:12)ling toolkit developed by HP for the Xen VM environment.
`The toolkit enables coordinated pro(cid:12)ling of multiple VMs in
`a system to obtain the distribution of hardware events such
`as clock cycles, cache and TLB misses, etc.
`Xenoprof is an extension to Opro(cid:12)le [27], the statistical
`pro(cid:12)ling toolkit for Linux system. For example, Opro(cid:12)le
`can collect the PC value whenever the clock cycle counter
`reaches a speci(cid:12)c count to generate the distribution of time
`spent in various routines at Xen domain level. And through
`multi-domain coordination, Xenoprof is able to (cid:12)gure out
`the distribution of execution time spent in various Xen do-
`mains and the Xen hypervisor (VMM).
`In this paper, we use Xenoprof as an important pro(cid:12)ling
`tool to identify the bottleneck of virtualization overhead for
`HPC application running in Xen environments.
`
`Microsoft Ex. 1019, p. 3
`Microsoft v. Daedalus Blue
`IPR2021-00832
`
`

`

`4. CHALLENGES FOR VM-BASED COM-
`PUTING
`In spite of the advantages of virtual machines, their usage
`in cluster computing has hardly been adopted. Performance
`overhead and management issues are the major bottlenecks
`for VM-based computing. In this section, we identify the key
`challenges to reduce the virtualization overhead and enhance
`the management e(cid:14)ciency for VM-based computing.
`4.1
`Performance Overhead
`The performance overhead of virtualization includes three
`main aspects: CPU virtualization, memory virtualization
`and I/O virtualization.
`Recently high performance VM environments such as Xen
`and VMware are able to achieve low cost CPU and mem-
`ory virtualization [6, 40]. Most of the instructions can be
`executed natively in the guest VMs except a few privileged
`operations, which are trapped by the virtual machine mon-
`itor. This overhead is not a big issue because applications
`in the HPC area seldom call these privileged operations.
`I/O virtualization, however, poses a more di(cid:14)cult prob-
`lem. Because I/O devices are usually shared among all VMs
`in a physical machine, the virtual machine monitor (VMM)
`has to verify that accesses to them are legal. Currently, this
`requires the VMM or a privileged domain to intervene on ev-
`ery I/O access from guest VMs [34, 40, 12]. The intervention
`leads to longer I/O latency and higher CPU overhead due
`to context switches between the guest VMs and the VMM.
`
`Figure 3: NAS Parallel Benchmarks
`
`Figure 3 shows the performance of the NAS [25] Paral-
`lel Benchmark suite on 4 processes with MPICH [3] over
`TCP/IP. The processes are on 4 Xen guest domains (Do-
`mUs), which are hosted on 4 physical machines. We com-
`pare the relative execution time with the same test on 4
`nodes with native environment. For communication inten-
`sive benchmarks such as IS and CG, applications running
`in virtualized environments perform 12% and 17% worse,
`respectively. The EP benchmark, however, which is com-
`putation intensive with only a few barrier synchronization,
`maintains the native level of performance.
`Table 1 illustrates the distribution of execution time col-
`lected by Xenoprof. We (cid:12)nd that for CG and IS, around
`30% of the execution time is consumed by the isolated de-
`vice domain (Dom0 in our case) and the Xen VMM for pro-
`cessing the network I/O operations. On the other hand, for
`EP benchmark, 99% of the execution time is spent natively
`in the guest domain (DomU) due to a very low volume of
`communication. We notice that if we run two DomUs per
`
`CG
`IS
`EP
`BT
`SP
`
`Dom0
`16.6%
`18.1%
`00.6%
`06.1%
`09.7%
`
`Xen
`10.7%
`13.1%
`00.3%
`04.0%
`06.5%
`
`DomU
`72.7%
`68.8%
`99.0%
`89.9%
`83.8%
`
`Table 1: Distribution of execution time for NAS
`node to utilize the dual CPUs, Dom0 starts competing for
`CPU resources with user domains when processing I/O re-
`quests, which leads to even larger performance degradation
`compared to the native case.
`This example identi(cid:12)es I/O virtualization as the main
`bottleneck for virtualization, which leads to the observa-
`tion that I/O virtualization with near-native performance
`would allow us to achieve application performance in VMs
`that rivals native environments.
`4.2 Management Efficiency
`VM technologies decouple the computing environments
`from the physical resources. As mentioned in Section 2, this
`di(cid:11)erent view of computing resources bring numerous ben-
`e(cid:12)ts such as ease of management, ability to use customized
`OSes, and (cid:13)exible security rules. Although with VM en-
`vironments, many of the administration tasks can be com-
`pleted without restarting the physical systems, it is required
`that VM images be dynamically distributed to the physical
`computing nodes and VMs be instantiated each time the
`administrator recon(cid:12)gure the cluster. Since VM images are
`often housed on storage nodes which are separate from the
`computing nodes, this poses additional challenges on VM
`image management and distribution. Especially for large
`scale clusters, the bene(cid:12)ts of VMs cannot be achieved unless
`VM images can be distributed and instantiated e(cid:14)ciently.
`
`5. A FRAMEWORK FOR HIGH PERFOR-
`MANCE COMPUTING WITH VIRTUAL
`MACHINES
`In this section, we propose a framework for VM-based
`cluster computing for HPC applications. We start with
`a high-level view of the design, explaining the key tech-
`niques used to address the challenges of VM-based com-
`puting. Then we introduce our framework and its major
`components. Last, we present a prototype implementation
`of the framework { an In(cid:12)niBand cluster based on Xen vir-
`tual machines.
`5.1 Design Overview
`As mentioned in the last section, reducing the I/O virtual-
`ization overhead and enhancing the management e(cid:14)ciency
`are the key issues for VM-based computing.
`To reduce the I/O virtualization overhead we use a tech-
`nique termed Virtual Machine Monitor bypass (VMM-bypass)
`I/O, which was proposed in our previous work [18]. VMM-
`bypass I/O extends the idea of OS-bypass I/O in the context
`of VM environments. The overall idea of VMM-bypass I/O
`is illustrated in Figure 4. A guest module in the VM man-
`ages all the privileged accesses, such as creating the virtual
`access points (i.e. UAR for In(cid:12)niBand) and mapping them
`into the address space of user processes. Guest modules
`in a guest VM cannot directly access the device hardware.
`Thus a backend module provides such accesses for guest mod-
`
`Microsoft Ex. 1019, p. 4
`Microsoft v. Daedalus Blue
`IPR2021-00832
`
`

`

`Device Driver VM
`Backend Module
`
`......
`
`VM
`Application
`
`Privilegded
`
`Module
`
`Guest Module
`
`OS
`
`VMM
`
`Device
`
`Privileged Access
`VMM-Bypass Access
`
`Figure 4: VMM-Bypass I/O
`
`ules. This backend module can either reside in the device
`domain (such as Xen) or in the virtual machine monitor
`for other VM environments like VMware ESX server. The
`communication between guest modules and backend module
`is achieved through the inter-VM communication schemes
`provided by the VM environment. After the initial setup
`process, communication operations can be initiated directly
`from the user process. By removing the VMM from the
`critical path of communication, VMM-bypass I/O is able to
`achieve near-native I/O performance in VM environments.
`To achieve the second goal, improving the VM image man-
`agement e(cid:14)ciency, our approach has three aspects: cus-
`tomizing small kernels/OSes for HPC applications to reduce
`the size of VM images that need to be distributed, devel-
`oping fast and scalable distributing schemes for large-scale
`clusters and VM image caching on computing nodes. Reduc-
`ing the VM image size is possible because the customized
`OS/kernel only needs to support a special type of HPC ap-
`plications and thus needs very few services. We will discuss
`the detailed schemes in Section 5.3.
`5.2 A Framework for VM-Based Computing
`
`Physical Resources
`
`Front-End
`
`Submit Jobs/
`VMs
`
`Instantiate VMs/
`Execute Jobs
`
`Management
`Module
`
`Query/
`User Submitted
`VMs
`
`Control
`
`VM Image
`Manager
`
`Computing Nodes
`Guest VMs
`
`VMM
`VM Image Distribution/
`User Generated Data
`
`Storage Nodes
`
`Update
`
`Figure 5: Framework for Cluster with Virtual Ma-
`chine Environment
`
`Figure 5 illustrates the framework that we propose for
`a cluster running in virtual machine environments. The
`framework takes a similar approach as batch scheduling,
`which is a common paradigm for modern production clus-
`ter computing environments. We have a highly modularized
`design so that we can focus on the overall framework and
`leave the detailed issues like resource matching, scheduling
`algorithms, etc., which may vary depending on di(cid:11)erent en-
`vironments, to the implementations.
`The framework consists of the following (cid:12)ve major parts:
`
`(cid:15) Front-end Nodes: End users submit batch job re-
`
`quests to the management module. Besides the appli-
`cation binaries and the number of processes required,
`a typical job request also includes the runtime environ-
`ment requirements for the application, such as required
`OS type, kernel version, system shared libraries etc.
`Users can execute privileged operations such as load-
`ing kernel modules in their jobs, which is a major ad-
`vantage over normal cluster environment where users
`only have access to a restricted set of non-privilege op-
`erations. Users can be allowed to submit customized
`VMs containing special purpose kernels/OSes or to up-
`date the VM which they previously submitted. As dis-
`cussed in Section 2, hosting user-customized VMs, just
`like running user-space applications, will not compro-
`mise the system integrity. The detailed rules to submit
`such user-customized VM images can be dependent on
`various cluster administration policies and we will not
`discuss them further in this paper.
`
`(cid:15) Physical Resources: These are the cluster comput-
`ing nodes connected via high speed interconnects. Each
`of the physical nodes hosts a VMM and perhaps a priv-
`ileged host OS (such as Xen Dom0) to take charge of
`the start-up and shut-down of the guest VMs. To avoid
`unnecessary kernel noise and OS overhead, the host
`OS/VMM may only provide limited services which are
`necessary to host VM management software and dif-
`ferent backend daemons to enable physical device ac-
`cesses from guest VMs. For HPC applications, a physi-
`cal node will typically host no more VM instances than
`the number of CPUs/cores. This is di(cid:11)erent from the
`requirements of other computing areas such as high-
`end data centers, where tens of guest VMs can be
`hosted in a single physical box.
`
`(cid:15) Management Module: The management module is
`the key part of our framework. It maintains the map-
`ping between VMs and the physical resources. The
`management module receives user requests from the
`front-end node and determines if there are enough idle
`VM instances that match the user requirements.
`If
`there are not enough VM instances available, the man-
`agement module will query the VM image manger to
`(cid:12)nd the VM image matching the runtime requirements
`for the submitted user applications. This matching
`process can be similar to those found in previous re-
`search work on resource matching in Grid environ-
`ments, such as Matchmaker in Condor [37]. The man-
`agement module will then schedule physical resources,
`distribute the image, and instantiate the VMs on the
`allocated computing nodes. A good scheduling method-
`ology will not only consider those merits that have
`been studied in current cluster environment [8], but
`also take into account the cost of distributing and in-
`stantiating VMs, where reuse of already instantiated
`VMs is preferred.
`
`(cid:15) VM Image Manager: The VM image manager man-
`ages the pool of VM images, which are stored in the
`storage system. To accomplish this, VM image man-
`ager hosts a database containing all the VM image
`information, such as OS/kernel type, special user li-
`braries installed, etc.
`It provides the management
`module with information so correct VM images are
`
`Microsoft Ex. 1019, p. 5
`Microsoft v. Daedalus Blue
`IPR2021-00832
`
`

`

`selected to match the user requirements. The VM
`images can be created by system administrator, who
`understands the requirements for the most commonly
`used high performance computing applications, or VM
`images can also be submitted by users if their applica-
`tions have special requirements.
`
`(cid:15) Storage: The performance of the storage system as
`well as the cluster (cid:12)le system will be critical in the
`framework. VM images, which are stored in the stor-
`age nodes, may need to be transferred to the comput-
`ing nodes during runtime, utilizing the underlying (cid:12)le
`system. To reduce the management overhead, high
`performance storage and (cid:12)le systems are desirable.
`5.3 An InfiniBand Cluster with Xen VM En-
`vironment
`To further demonstrate our ideas of VM-based comput-
`ing, we present a case of an In(cid:12)niBand cluster with Xen
`virtual machine environments. Our prototype implementa-
`tion does not address the problems of resource matching and
`node scheduling, which we believe have been well studied in
`literature. Instead, we focus on reducing the virtualization
`overhead and the VM image management cost.
`
`5.3.1 Reducing Virtualization Overhead
`
`Xen Device Domain
`
`Xen Guest Domain
`User-level IB Provider
`
`Core IB Module
`
`Back-end Daemon
`
`Virtual HCA Provider
`
`Core IB Module
`
`HCA Provider
`
`HCA Hardware
`
`VMM-Bypass Direct Access
`
`Figure 6: Xen-IB Architecture: Virtualizing In(cid:12)ni-
`Band with VMM-bypass I/O
`
`As mentioned in Section 5.1, I/O virtualization overhead
`can be signi(cid:12)cantly reduced with VMM-bypass I/O tech-
`niques. Xen-IB [18] is a VMM-bypass I/O prototype imple-
`mentation we developed to virtualize the In(cid:12)niBand device
`for Xen. Figure 6 illustrates the general design of Xen-IB.
`Xen-IB involves the device domain for privileged In(cid:12)niBand
`operations such as initializing HCA, creating queue pair and
`completion queues, etc. However, Xen-IB is able to expose
`the User Access Regions to guest VMs and allow direct DMA
`operations. Thus, Xen-IB executes time critical operations
`such as posting work requests and polling for completion
`natively without sacri(cid:12)cing the system integrity. With the
`very low I/O virtualization overhead through Xen-IB, HPC
`applications have the potential to achieve near-native per-
`formance in VM-based computing environments.
`Additional memory consumption is also a part of virtual-
`ization overhead. In the Xen virtual machine environment,
`the memory overhead mainly consists of the memory foot-
`prints of three components: the Xen VMM, the privileged
`domain, and the operating systems of the guest domains.
`The memory footprints of the virtual machine monitor is
`examined in detail in [6], which can be as small as 20KB
`of state per guest domain. The OS in Dom0 needs only
`
`support VM management software and the physical device
`accesses. And customized OSes in DomUs are providing the
`\minimum" services needed for HPC applications. Thus the
`memory footprints can be reduced to a small value. Simply
`by removing the unnecessary kernel options and OS services
`for HPC applications, we are able to reduce the memory
`usage of the guest OS in DomU to around 23MB when the
`OS is idle, more than two-thirds of reduction compared with
`running normal Linux AS4 OS. We believe it can be further
`reduced if more careful tuning is carried out here.
`
`5.3.2 Reducing VM Image Management Cost
`As previously mentioned, we take three approaches to
`minimize the VM image management overhead: making the
`virtual machine images as small as possible, developing scal-
`able schemes to distribute the VM images to di(cid:11)erent com-
`puting nodes, and VM image caching.
`Minimizing VM Images: In literature, there are lots
`of research e(cid:11)orts on customized OSes [14, 20, 22, 4], whose
`purpose is to develop infrastructures for parallel resource
`management and to enhance the OS for the ability to sup-
`port systems with very large number of processors. In our
`Xen-based computing environment, our

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket