`US007484208Bl
`
`c12) United States Patent
`Nelson
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,484,208 Bl
`Jan.27,2009
`
`(54) VIRTUAL MACHINE MIGRATION
`
`(76)
`
`Inventor: Michael Nelson, 888 Forest La., Alamo,
`CA (US) 94507
`
`( * ) Notice:
`
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 425 days.
`
`(21) Appl. No.: 10/319,217
`
`(22) Filed:
`
`Dec. 12, 2002
`
`(51)
`
`Int. Cl.
`G06F 9/455
`(2006.01)
`G06F 12100
`(2006.01)
`(52) U.S. Cl. ............................................... 718/1; 711/6
`(58) Field of Classification Search ..................... 718/1;
`709/224; 711/153, 6
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`6,075,938 A *
`6,698,017 Bl *
`2004/0010787 Al *
`
`6/2000 Bugnion et al. ............... 703/27
`2/2004 Adamovits et al. .......... 717/168
`1/2004 Traut et al ...................... 718/1
`
`OTHER PUBLICATIONS
`Theimer, Marvin M., Lantz, Keith A, and Cheriton, David R.,
`"Preemptable Remote Execution Facilities for the V-System," Asso(cid:173)
`ciation for Computing Machinery, pp. 2-12, Dec. 1985.
`* cited by examiner
`
`Primary Examiner-Li B Zhen
`
`(57)
`
`ABSTRACT
`
`A source virtual machine (VM) hosted on a source server is
`migrated to a destination VM on a destination server without
`first powering down the source VM. After optional pre-copy(cid:173)
`ing of the source VM's memory to the destination VM, the
`source VM is suspended and its non-memory state is trans(cid:173)
`ferred to the destination VM; the destination VM is then
`resumed from the transferred state. The source VM memory
`is either paged in to the destination VM on demand, or is
`transferred asynchronously by pre-copying and write-pro(cid:173)
`tecting the source VM memory, and then later transferring
`only the modified pages after the destination VM is resumed.
`The source and destination servers preferably share common
`storage, in which the source VM's virtual disk is stored; this
`avoids the need to transfer the virtual disk contents. Network
`connectivity is preferably also made transparent to the user by
`arranging the servers on a common subnet, with virtual net(cid:173)
`work connection addresses generated from a common name
`space of physical addresses.
`
`4 Claims, 3 Drawing Sheets
`
`,,.? DESTINATION SERVER
`1002
`,,..-1300
`Server Daemon
`
`✓ 1000
`
`SOURCE SERVER
`
`,,..-1200
`Source VM
`
`f
`) 6. Begin save
`17. Pre-copy memory f ~
`1 s. Save state
`19. End save
`
`)4 Ready
`
`I
`
`)2. CreateVM)
`I
`
`-t
`
`,,....1202
`Destination VM
`
`) 10. Restore state
`
`13. Ready I
`) 3A. Wait for migration I
`I
`
`~
`
`f~ "' i°"-~ I
`f;:: I'--.
`l
`r------. ~ ~ Destination Kernel
`
`) 11. End restore
`
`~1602\
`
`I Source Kernel j
`
`'-----1600
`
`: 12. Page In
`
`)
`
`: 13. Done
`
`I
`
`) 1. Migrate prepare )
`
`15. Suspend
`and migrate
`
`I
`I
`
`Migration
`
`"-- 2000
`
`Microsoft Ex. 1011, p. 1
`Microsoft v. Daedalus Blue
`IPR2021-00832
`
`
`
`~
`
`0 ....
`....
`.....
`rJJ =(cid:173)
`
`('D
`('D
`
`1,0
`0
`0
`N
`~--..J
`N
`?
`~
`~
`
`~ = ~
`
`~
`~
`~
`•
`00
`~
`
`'N = 00 = "'""'
`
`~
`00
`~
`--..l
`
`d r.,;_
`
`700
`
`~~ 130 1~160~170
`~ IMM1J 11 ME~ORY I ~ ~ I NET
`
`AND DRIVERS
`
`MODULES
`KERNEL
`
`LOADABLE
`
`s10
`
`~ 650_,,. I HANDLER E}csoa
`~ J INTERRUPT
`I SCHEDULER I
`~ WORLDS I
`
`616 612
`
`KERNEL
`
`614
`
`C
`
`HARDWARE
`SYSTEM
`
`100
`
`600
`
`I • • • I VM
`
`I I VCPU(S) I VMEM VDISK I VDEVICE(S) I
`
`210:::,_ ~4@ 230:::,_
`
`1 I 200n
`
`220--1 GUEST OS 224-1 DRIVERS I
`
`y
`
`VM
`
`200
`
`f
`
`1000
`
`FIG. 1 ---------------------------.
`
`----------._
`
`---
`
`860 I
`
`~ MEMORY I I • • • I VMM
`
`I 300n
`I
`
`EMULATORS
`
`MIGRATION J DEVICE
`
`VMM
`
`MGMT
`
`350
`
`330
`
`360
`
`3061
`
`I
`
`420
`cos ~
`
`Microsoft Ex. 1011, p. 2
`Microsoft v. Daedalus Blue
`IPR2021-00832
`
`
`
`U.S. Patent
`
`Jan.27,2009
`
`Sheet 2 of 3
`
`US 7,484,208 Bl
`
`[I]
`[I]
`
`"<:t
`0
`.....
`0
`
`[I]
`
`[I]
`
`N
`0
`C\I
`..-
`
`C
`0
`:.;:;
`~~
`·.;:::;>
`
`Cl)
`0,)
`□
`
`0,)
`~~
`5>
`
`(J)
`
`0
`0
`.....
`N
`
`[I]
`[I]
`
`C\I
`0
`0
`..-
`
`0
`0
`0
`..-
`
`0,)
`C
`,.._
`0,)
`~
`
`~
`(/)
`0
`
`~
`(/)
`0
`
`~
`(/)
`0
`
`0,)
`C
`I.....
`0,)
`~
`C
`0
`:.;:;
`CCI
`C
`:.;:;
`Cl)
`0,)
`0
`
`0,)
`C
`I.....
`0,)
`~
`0,)
`u
`I.....
`::J
`0
`(J)
`
`C\I
`0
`.....
`CD
`
`0
`0
`.....
`CD
`
`0
`
`0 r---..-
`
`er:
`w
`>
`er:
`w
`(/)
`
`er:
`w
`>
`er:
`w
`(/)
`
`er:
`w
`>
`er:
`w
`(J)
`
`□·
`
`CD
`0
`O')
`
`□·
`
`"<:t
`0
`O')
`
`□·
`
`C\I
`0
`O')
`
`0 □·
`
`0
`O')
`
`■
`
`N
`CJ
`
`-LL
`
`Microsoft Ex. 1011, p. 3
`Microsoft v. Daedalus Blue
`IPR2021-00832
`
`
`
`U.S. Patent
`
`Jan.27,2009
`
`Sheet 3 of 3
`
`US 7,484,208 Bl
`
`DESTINATION SERVER
`
`/"1300
`Server Daemon
`
`j2. Create VM I
`
`'
`
`,r /" 1202
`
`j 4. Ready
`j"
`
`I
`
`Destination VM
`
`,3. Readyj
`j3A Wait for migration I
`I,
`I 10.
`I "
`
`Restore state
`
`j 11. End restore
`
`FIG. 3
`
`/
`1002
`
`~1000
`
`SOURCE SERVER
`
`/"1200
`Source VM
`
`Begin save
`
`6.
`
`7.
`
`8. Save state
`
`9. End save
`
`l•
`
`-
`-
`-
`
`~
`
`Source Kernel
`
`'
`
`""1600
`
`~
`Pre-copy memory ~ ~
`~~ ""
`~;::::
`~ ~
`"
`....______ ~ N
`"
`r---,. Destination Kernel
`I 12 Page In I
`I
`
`1602\
`
`1, "
`
`I
`I
`I 13.
`
`Done
`
`j 1. Migrate prepare I
`
`5. Suspend
`and migrate
`
`-
`-
`
`Migration
`~ 2000
`
`Microsoft Ex. 1011, p. 4
`Microsoft v. Daedalus Blue
`IPR2021-00832
`
`
`
`US 7,484,208 Bl
`
`1
`VIRTUAL MACHINE MIGRATION
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`This invention relates to a computer architecture, in par(cid:173)
`ticular, to an architecture that coordinates the operation of
`multiple virtual machines.
`2. Description of the Related Art
`The advantages of virtual machine technology have
`become widely recognized. Among these advantages is the
`ability to run multiple virtual machines on a single host plat(cid:173)
`form. This makes better use the capacity of the hardware,
`while still ensuring that each user enjoys the features of a
`"complete," isolated computer.
`General Virtualized Computer System
`As is well known in the field of computer science, a virtual
`machine (VM) is a software abstraction-a "virtualiza(cid:173)
`tion"----of an actual physical computer system. FIG. 1 illus(cid:173)
`trates, in part, the general configuration of a virtual machine
`200, which is installed as a "guest" on a "host" hardware
`platform 100.
`As FIG. 1 shows, the hardware platform 100 includes one
`or more processors (CPU's) 110, system memory 130, and a
`storage device, which will typically be a disk 140. The system
`memory will typically be some form of high-speed RAM,
`whereas the disk (one or more) will typically be a non-vola(cid:173)
`tile, mass storage device. The hardware 100 will also include
`other conventional mechanisms such as a memory manage(cid:173)
`ment unit MMU 150, various registers 160, and any conven(cid:173)
`tional network connection device 170 (such as a network
`adapter or network interface card-"NIC") for transfer of
`data between the various components of the system and a
`network 700, which may be any known public or proprietary
`local or wide-area network such as the Internet, an internal
`enterprise network, etc.
`Each VM 200 will typically include at least one virtual
`CPU 210, a virtual disk 240, a virtual system memory 230, a
`guest operating system (which may simply be a copy of a
`conventional operating system) 220, and various virtual
`devices 230, in which case the guest operating system ("guest
`OS") will include corresponding drivers 224. All of the com(cid:173)
`ponents of the VM may be implemented in software using
`known techniques to emulate the corresponding components
`of an actual computer.
`If the VM is properly designed, then it will not be apparent
`to the user that any applications 260 running within the VM
`are running indirectly, that is, via the guest OS and virtual
`processor. Applications 260 running within the VM will act
`just as they would if run on a "real" computer, except for a
`decrease in running speed that will be noticeable only in
`exceptionally time-critical applications. Executable files will
`be accessed by the guest OS from the virtual disk or virtual
`memory, which will simply be portions of the actual physical
`disk or memory allocated to that VM. Once an application is
`installed within the VM, the guest OS retrieves files from the
`virtual disk just as if they had been pre-stored as the result of
`a conventional installation of the application. The design and
`operation of virtual machines is well known in the field of
`computer science.
`Some interface is usually required between a VM and the
`underlying host platform (in particular, the CPU), which is
`responsible for actually executing VM-issued instructions
`and transferring data to and from the actual memory and
`storage devices.A common term for this interface is a "virtual
`machine monitor" (VMM), shown as component 300. A
`VMM is usually a thin piece of software that runs directly on
`
`2
`top of a host, or directly on the hardware, and virtualizes all
`the resources of the machine. Among other components, the
`VMM therefore usually includes device emulators 330,
`which may constitute the virtual devices (230) that the VM
`5 200 addresses. The interface exported to the VM is then the
`same as the hardware interface of the machine, so that the
`guest OS cannot determine the presence of the VMM. The
`VMM also usually tracks and either forwards (to some form
`of operating system) or itself schedules and handles all
`10 requests by its VM for machine resources, as well as various
`faults and interrupts.
`Although the VM (and thus the user of applications run(cid:173)
`ning in the VM) cannot usually detect the presence of the
`VMM, the VMM and the VM may be viewed as together
`15 forming a single virtual computer. They are shown in FIG. 1
`as separate components for the sake of clarity.
`Virtual and Physical Memory
`As in most modern computers, the address space of the
`memory 130 is partitioned into pages (for example, in the
`20 Intel x86 architecture) or regions (for example, Intel IA-64
`architecture). Applications then address the memory 130
`using virtual addresses (VAs), which include virtual page
`numbers (VPNs). The VAs are then mapped to physical
`addresses (PAs) that are used to address the physical memory
`25 130. (VAs and PAs have a common offset from a base address,
`so that only the VPN needs to be converted into a correspond(cid:173)
`ing PPN.) The concepts ofVPN s and PPN s, as well as the way
`in which the different page numbering schemes are imple(cid:173)
`mented and used, are described in many standard texts, such
`30 as "Computer Organization and Design: The Hardware/Soft(cid:173)
`ware Interface," by David A. Patterson and John L. Hennessy,
`Morgan Kaufmann Publishers, Inc., San Francisco, Calif.,
`1994, pp. 579-603 (chapter 7.4 "Virtual Memory"). Similar
`mappings are used in region-based architectures or, indeed, in
`35 any architecture where relocatability is possible.
`An extra level of addressing indirection is typically imple(cid:173)
`mented in virtualized systems in that a VPN issued by an
`application 260 in the VM 200 is remapped twice in order to
`determine which page of the hardware memory is intended.
`40 The first mapping is provided by a mapping module within
`the guest OS 202, which translates the guest VPN (GVPN)
`into a corresponding guest PPN (GPPN) in the conventional
`manner. The guest OS therefore "believes" that it is directly
`addressing the actual hardware memory, but in fact it is not.
`45 Of course, a valid address to the actual hardware memory
`must ultimately be generated. A memory management mod(cid:173)
`ule 350 in the VMM 300 therefore performs the second map(cid:173)
`ping by taking the GPPN issued by the guest OS 220 and
`mapping it to a hardware ( or "machine") page number PPN
`50 that can be used to address the hardware memory 130. This
`GPPN-to-PPN mapping is typically done in the main system(cid:173)
`level software layer (such as the kernel 600 described below),
`depending on the implementation: From the perspective of
`the guest OS, the GVPN and GPPN might be virtual and
`55 physical page numbers just as they would be if the guest OS
`were the only OS in the system. From the perspective of the
`system software, however, the GPPN is a page number that is
`then mapped into the physical memory space of the hardware
`memory as a PPN.
`System Software Configurations in Virtualized Systems
`In some systems, such as the Workstation product of
`VMware, Inc., of Palo Alto, Calif., the VMM is co-resident at
`system level with a host operating system. Both the VMM and
`the host OS can independently modify the state of the host
`65 processor, but the VMM calls into the host OS via a driver and
`a dedicated user-level application to have the host OS perform
`certain I/O operations of behalf of the VM. The virtual com-
`
`60
`
`Microsoft Ex. 1011, p. 5
`Microsoft v. Daedalus Blue
`IPR2021-00832
`
`
`
`US 7,484,208 Bl
`
`3
`puter in this configuration is thus fully hosted in that it runs on
`an existing host hardware platform and together with an exist(cid:173)
`ing host OS.
`In other implementations, a dedicated kernel takes the
`place of and performs the conventional functions of the host
`OS, and virtual computers run on the kernel. FIG. 1 illustrates
`a kernel 600 that serves as the system software for several
`VM/VMM pairs 200/300, ... , 200n/300n. Compared with a
`system in which VMMs run directly on the hardware plat(cid:173)
`form, use of a kernel offers greater modularity and facilitates 10
`provision of services that extend across multiple VMs (for
`example, for resource management). Compared with the
`hosted deployment, a kernel may offer greater performance
`because it can be co-developed with the VMM and be opti(cid:173)
`mized for the characteristics of a workload consisting of
`VMMs. The ESX Server product ofVMware, Inc., has such a
`configuration.
`A kernel-based virtualization system of the type illustrated
`in FIG. 1 is described in U.S. patent application Ser. No.
`09/877,378 ("Computer Configuration for Resource Man- 20
`agement in Systems Including a Virtual Machine"), which is
`incorporated here by reference. The main components of this
`system and aspects of their interaction are, however, outlined
`below.
`At a boot-up time, an existing operating system 420 may be
`at system level and the kernel 600 may not yet even be
`operational within the system. In such case, one of the func(cid:173)
`tions of the OS 420 may be to make it possible to load the
`kernel 600, after which the kernel runs on the native hardware
`and manages system resources. In effect, the kernel, once
`loaded, displaces the OS 420. Thus, the kernel 600 may be
`viewed either as displacing the OS 420 from the system level
`and taking this place itself, or as residing at a "sub-system
`level." When interposed between the OS 420 and the hard(cid:173)
`ware 100, the kernel 600 essentially turns the OS 420 into an
`"application," which has access to system resources only
`when allowed by the kernel 600. The kernel then schedules
`the OS 420 as if it were any other component that needs to use
`system resources.
`The OS 420 may also be included to allow applications
`unrelated to virtualization to run; for example, a system
`administrator may need such applications to monitor the
`hardware 100 or to perform other administrative routines. The
`OS 420 may thus be viewed as a "console" OS (COS). In this
`case, the kernel 600 preferably also provides a remote proce(cid:173)
`dure call (RPC) mechanism 614 to enable communication
`between, for example, the VMM 300 and any applications
`800 installed to run on the COS 420.
`Worlds
`The kernel 600 handles not only the various VMMNMs,
`but also any other applications running on the kernel, as well
`as the COS 420 and even the hardware CPU(s) 110, as entities
`that can be separately scheduled. In this disclosure, each
`schedulable entity is referred to as a "world," which contains
`a thread of control, an address space, machine memory, and
`handles to the various device objects that it is accessing.
`Worlds, represented in FIG.1 within the kernel 600 as module
`612, are stored in a portion of the memory space controlled by
`the kernel. Each world also has its own task structure, and
`usually also a data structure for storing the hardware state
`currently associated with the respective world.
`There will usually be different types of worlds: 1) system
`worlds, which are used for idle worlds, one per CPU, and a
`helper world that performs tasks that need to be done asyn(cid:173)
`chronously; 2) a console world, which is a special world that 65
`runs in the kernel and is associated with the COS 420; and 3)
`virtual machine worlds.
`
`4
`Worlds preferably run at the most-privileged level (for
`example, in a system with the Intel x86 architecture, this will
`be level CPL0), that is, with full rights to invoke any privi(cid:173)
`leged CPU operations. A VMM, which, along with its VM,
`5 constitutes a separate world, therefore may use these privi(cid:173)
`leged instructions to allow it to run its associated VM so that
`it performs just like a corresponding "real" computer, even
`with respect to privileged operations.
`Switching Worlds
`When the world that is rumiing on a particular CPU (which
`may be the only one) is preempted by or yields to another
`world, then a world switch has to occur. A world switch
`involves saving the context of the current world and restoring
`the context of the new world such that the new world can
`15 begin executing where it left off the last time that it is was
`running.
`The first part of the world switch procedure that is carried
`out by the kernel is that the current world's state is saved in a
`data structure that is stored in the kernel's data area. Assum(cid:173)
`ing the common case of an underlying Intel x86 architecture,
`the state that is saved will typically include: 1) the exception
`flags register; 2) general purpose registers; 3) segment regis(cid:173)
`ters; 4) the instruction pointer (EIP) register; 5) the local
`descriptor table register; 6) the task register; 7) debug regis-
`25 ters; 8) control registers; 9) the interrupt descriptor table
`register; 10) the global descriptor table register; and 11) the
`floating point state. Similar state information will need to be
`saved in systems with other hardware architectures.
`After the state of the current world is saved, the state of the
`30 new world can be restored. During the process of restoring the
`new world's state, no exceptions are allowed to take place
`because, if they did, the state of the new world would be
`inconsistent upon restoration of the state. The same state that
`was saved is therefore restored. The last step in the world
`35 switch procedure is restoring the new world's code segment
`and instruction pointer (EIP) registers.
`When worlds are initially created, the saved state area for
`the world is initialized to contain the proper information such
`that when the system switches to that world, then enough of
`its state is restored to enable the world to start running. The
`EIP is therefore set to the address of a special world start
`function. Thus, when a running world switches to a new world
`that has never run before, the act of restoring the EIP register
`will cause the world to begin executing in the world start
`function.
`Switching from and to the COS world requires additional
`steps, which are described in U.S. patent application Ser. No.
`09/877,378, mentioned above. Understanding this process is
`not necessary for understanding the present invention, how(cid:173)
`ever so further discussion is omitted.
`Memory Management in Kernel-Based System
`The kernel 600 includes a memory management module
`616 that manages all machine memory that is not allocated
`55 exclusively to the COS 420. When the kernel 600 is loaded,
`the information about the maximum amount of memory
`available on the machine is available to the kernel, as well as
`information about how much of it is being used by the COS.
`Part of the machine memory is used for the kernel 600 itself
`60 and the rest is used for the virtual machine worlds.
`Virtual machine worlds use machine memory for two pur(cid:173)
`poses. First, memory is used to back portions of each world's
`memory region, that is, to store code, data, stacks, etc., in the
`VMM page table. For example, the code and data for the
`VMM 300 is backed by machine memory allocated by the
`kernel 600. Second, memory is used for the guest memory of
`the virtual machine. The memory management module may
`
`50
`
`40
`
`45
`
`Microsoft Ex. 1011, p. 6
`Microsoft v. Daedalus Blue
`IPR2021-00832
`
`
`
`US 7,484,208 Bl
`
`6
`2) the user of a VM, in particular, of an application running
`on the VM, will usually not be able to notice that the appli(cid:173)
`cation is running on a VM (which is implemented wholly as
`software) as opposed to a "real" computer;
`3) assuming that different VMs have the same configura(cid:173)
`tion and state, the user will not know and would have no
`reason to care which VM he is currently using;
`4) the entire state (including memory) of any VM is avail(cid:173)
`able to its respective VMM, and the entire state of any VM and
`10 of any VMM is available to the kernel 600;
`5) as a consequence of the above facts, a VM is "relocat(cid:173)
`able."
`Except for the network 700, the entire multi-VM system
`shown in FIG. 1 can be implemented in a single physical
`15 machine, such as a server. This is illustrated by the single
`functional boundary 1000. (Of course devices such as key(cid:173)
`boards, monitors, etc., will also be included to allow users to
`access and use the system, possibly via the network 700; these
`are not shown merely for the sake of simplicity.)
`In systems configured as in FIG. 1, the focus is on manag-
`ing the resources of a single physical machine: Virtual
`machines are installed on a single hardware platform and the
`CPU(s), network, memory, and disk resources for that
`machine are managed by the kernel 600 or similar server
`25 software. This represents a limitation that is becoming
`increasingly undesirable and increasingly unnecessary. For
`example, if the server 1000 needs to be shut down for main(cid:173)
`tenance, then the VMs loaded in the server will become
`inaccessible and therefore useless to those who need them.
`30 Moreover, since the VMs must share the single physical
`memory space 130 and the cycles of the single (or single
`group of) CPU, these resources are substantially "zero-sum,"
`such that particularly memory- or processor-intensive tasks
`may cause noticeably worse performance.
`One way to overcome this problem would be to provide
`multiple servers, each with a set of VMs. Before shutting
`down one server, its VMs could be powered down or check(cid:173)
`pointed and then restored on another server. The problem
`40 with this solution is that it still disrupts on-going VM use, and
`even a delay often seconds may be noticeable and irritating to
`users; delays on the order of minutes will normally be wholly
`unacceptable.
`What is needed is a system that allows greater flexibility in
`45 the deployment and use ofVMs, but with as little disruption to
`users as possible. This invention provides such a system, as
`well as a related method of operation.
`
`35
`
`20
`
`5
`include any algorithms for dynamically allocating memory
`among the different VM's 200.
`Interrupt Handling in Kernel-Based System
`The kernel 600 preferably also includes an interrupt han(cid:173)
`dler 650 that intercepts and handles interrupts for all devices 5
`on the machine. This includes devices such as the mouse that
`are used exclusively by the COS. Depending on the type of
`device, the kernel 600 will either handle the interrupt itself or
`forward the interrupt to the COS.
`Device Access in Kernel-Based System
`In the preferred embodiment of the invention, the kernel
`600 is responsible for providing access to all devices on the
`physical machine. In addition to other modules that the
`designer may choose to load into the kernel, the kernel will
`therefore typically include conventional drivers as needed to
`control access to devices. Accordingly, FIG. 1 shows within
`the kernel 600 a module 610 containing loadable kernel mod(cid:173)
`ules and drivers.
`Kernel File System
`In the ESX Server product ofVMware, Inc., the kernel 600
`includes a fast, simple file system, referred to here as the VM
`kernel file system (VMKFS), that has proven itself to be
`particularly efficient for storing virtual disks 240, which typi(cid:173)
`cally comprise a small number oflarge (at least 1 GB) files.
`By using very large file system blocks, the file system is able
`to keep the amount of metadata (that is, the data that indicates
`where data blocks are stored on disk) needed to access all of
`the data in a file to an arbitrarily small size. This allows all of
`the metadata to be cached in main memory so that all file
`system reads and writes can be done without any extra meta(cid:173)
`data reads or writes.
`The VMKFS in ESX Server takes up only a single disk
`partition. When it is created, it sets aside space for the file
`system descriptor, space for file descriptor information,
`including the file name, space for block allocation informa(cid:173)
`tion, and space for block pointer blocks. The vast majority of
`the partition's space is used for data blocks, whose size is set
`when the file system is created. The larger the partition size,
`the larger the block size should be in order to minimize the
`size of the metadata.
`As mentioned earlier, the main advantage of the VMKFS is
`that it ensures that all metadata may be cached in high-speed,
`main system memory. This can be done by using large data
`block sizes, with small block pointers. Since virtual disks are
`usually at least one gigabyte in size, using large block sizes on
`the order of 64 Megabytes will cause virtually no wasted disk
`space and all metadata for the virtual disk can be cached
`simultaneously in system memory.
`Besides being able to always keep file metadata cached in 50
`memory, the other key to high performance file I/O is to
`reduce the number of metadata updates. Note that the only
`reason why the VMKFS metadata will need to be updated is
`if a file is created or destroyed, or if it changes in size. Since
`these files are used primarily for virtual disks ( or, for example, 55
`for copy-on-write redo logs), files are not often created or
`destroyed. Moreover, because virtual disks are usually fixed
`in size upon creation, the file size of a virtual disk does not
`usually change. In order to reduce the number of metadata
`updates on a virtual disk to zero, the system may therefore 60
`preallocate all data blocks for virtual disks when the file is
`created.
`Key VM Features
`For the purposes of understanding the advantages of this
`invention, the salient points of the discussion above are:
`1) each VM 200, ... , 200n has its own state and is an entity
`that can operate completely independently of other VMs;
`
`SUMMARY OF THE INVENTION
`
`In a networked system of computers (preferably, servers),
`including a source computer and a destination computer and
`a source virtual machine (VM) installed on the source com(cid:173)
`puter, the invention provides a virtualization method and
`system according to which the source VM is migrated to a
`destination VM while the source VM is still powered on.
`Execution of the source VM is suspended, and then non(cid:173)
`memory source VM state information is transferred to the
`destination VM; the destination VM is then resumed from the
`transferred non-memory source VM state.
`Different methods are provided for transferring the source
`VM' s memory to the destination VM. In the preferred
`embodiment of the invention, the destination VM may be
`resumed before transfer of the source VM memory is com-
`65 pleted. One way to do this is to page in the source VM
`memory to the destination VM on demand. Following an
`alternative procedure, the source VM memory is pre-copied
`
`Microsoft Ex. 1011, p. 7
`Microsoft v. Daedalus Blue
`IPR2021-00832
`
`
`
`US 7,484,208 Bl
`
`7
`to the destination VM before the non-memory source VM
`state information is transferred.
`In one refinement of the invention, any units (such as
`pages) of the source VM memory that are modified (by the
`source VM or by any other component) during the interval 5
`between pre-copying and completing transfer of all pages are
`retransferred to the destination VM. Modification may be
`detected in different ways, preferably by write-protecting the
`source VM memory and then sensing page faults when the
`source VM attempts to write to any of the protected memory
`units. An iterative procedure for retransferring modified
`memory units is also disclosed.
`In the preferred embodiment of the invention, the source
`VM' s non-memory state information includes the contents of
`a source virtual disk. The contents of the source virtual disk 15
`are preferably stored in a storage arrangement shared by both
`the source and destination computers. The destination VM's
`virtual disk is then prepared by mapping the virtual disk of the
`destination VM to the same physical addresses as the source
`virtual disk in the shared storage arrangement.
`In the most commonly anticipated implementation of the
`invention, communication between a user and the source and
`destination VMs takes place over a network. Network con(cid:173)
`nectivity is preferably also made transparent to the user by
`arranging the servers on a common subnet, with virtual net- 25
`work connection addresses generated from a common name
`space of physical addresses.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 illustrates the main components of a server that
`includes one or more virtual machines running on a system(cid:173)
`level kernel.
`FIG. 2 illustrates a farm of interconnected servers accord(cid:173)
`ing to the invention, with each server hosting a plurality of
`virtual machines.
`FIG. 3 illustrates the steps the invention takes to migrate a
`virtual machine from a source to a destination.
`
`DETAILED DESCRIPTION
`
`In broadest terms, the invention provides a farm of servers,
`each of which may host one or more virtual machines (VMs ),
`as well as mechanisms for migrating a VM from one server
`(the source server) to another (the destination server) while
`the VM is still running. There are many reasons why efficient,
`substantially transparent VM migration is beneficial. Load(cid:173)
`balancing is mentioned above, as is the possibility that a
`machine may need to be taken out of service for maintenance.
`Another reason may be to add or remove resources from
`the server. This need not be related to the requirements of the
`hardware itself, but rather it may also be to meet the desires of
`a particular user/customer. For example, a particular user may
`request ( and perhaps pay for) more memory, more CPU time,
`etc., all of which may necessitate migration of his VM to a
`different server.
`The general configuration of the server farm according to
`the invention is illustrated in FIG. 2, in which a plurality of
`users 900, 902, 904, ... , 906 access a farm of servers 1000,
`1002, ... , 1004 via the network 700. Each of the servers is
`preferably configured as the server 1000 shown in FIG. 1, and
`will include at least one and possibly many VMs. Thus, server
`1000 is shown in both FIG. 1 and FIG. 2.
`In a server farm, all of the resources of all of the machines
`in the farm can be aggregated into one common resource pool.
`From the perspective of a user, the farm will appear to be one
`big machine with lots of resources. As is described below, the
`
`8
`invention provides a mechanism for managing the set of
`resources so that they are utilized as efficiently and as reliably
`as possible.
`Advantages ofVM Migration
`The ability to quickly migrate VMs while they are running
`between individual nodes of the server farm has several
`advantages, among which are the following:
`1) It allows the load to be balanced across all nodes in the
`cluster. If one node is out of resources while other nodes have
`10 free resources, then VMs can be moved around between
`nodes to balance the load.
`2) It allows individual nodes of the cluster to be shut down
`for maintenance without requiring that the VMs that are run(cid:173)
`ning on the node be shut down: Instead of shutting the VMs
`down, the VMs can simply be migrated to other machines in
`the cluster.
`3) It allows the immediate utilization of new nodes as they
`are added to the cluster. Currently running VMs can be
`migrated from machines that are over ut