`United States Patent
`6,075,938
`[45] Date of Patent:
`Bugnionetal.
`Jun. 13, 2000
`
`[11] Patent Number:
`
`US006075938A
`
`[54] VIRTUAL MACHINE MONITORS FOR
`SCALABLE MULTIPROCESSORS
`
`[75]
`
`Inventors: Edouard Bugnion, Menlo Park; Scott
`W. Devine; Mendel Rosenblum, both
`of Stanford, all of Calif.
`
`[73] Assignee: The Board of Trustees of the Leland
`Stanford Junior University, Palo Alto,
`Calif.
`
`Rosenblum, M. et al., Implementing efficient fault contain-
`ment for multiprocessors, Comm. of the ACM,vol. 39, No.
`9, 1996.
`
`Chapin,J. et al., Hive: fault containment for shared-memory
`multiprocessors, Standford Computer Systems Lab Publica-
`tion, ACM Symposium, Dec. 1995.
`
`Bugnion, E. et al., Design and implementation of DISCO,
`May 1997.
`
`Primary Examiner—Kevin J. Teska
`Assistant Examiner—Thai Phan
`
`[57]
`
`ABSTRACT
`
`The problem of extending modern operating systems to run
`efficiently on large-scale shared memory multiprocessors
`without a large implementation effort is solved by a unique
`type of virtual machine monitor. Virtual machines are used
`to run multiple commodity operating systems on a scalable
`multiprocessor. To reduce the memory overheads associated
`with running multiple operating systems, virtual machines
`transparently share major data structures such as the oper-
`ating system code and thefile system buffer cache. We use
`the distributed system support of modem operating systems
`to export a partial single system image to the users. Two
`techniques, copy-on-write disks and the use of a special
`network device, enable transparent resource sharing without
`requiring the cooperation of the operating systems. This
`solution addresses manyof the challenges facing the system
`software for these machines. The overheads of the monitor
`
`are small and the approach providesscalability as well as the
`ability to deal with the non-uniform memory access time of
`these systems. ‘The overall solution achieves most of the
`benefits of operating systems customized for scalable mul-
`tiprocessors yet
`it can be achieved with a significantly
`smaller implementation effort.
`
`[21] Appl. No.: 09/095,283
`
`[22]
`
`Filed:
`
`Jun. 10, 1998
`
`Related U.S. Application Data
`Provisional application No. 60/049,244, Jun. 10, 1997.
`[60]
`Int. Ch? oe GO06F 9/455; GO6F 13/10
`[St]
`[52] US. Ch. vaceesssecnee 395/500.48; 709/214; 709/301;
`710/23; 711/148; 711/153
`[58] Field of Search ou... eee 395/500.48, 700,
`395/500.47, 843; 709/214, 30, 304, 301,
`104, 226; 711/6, 141, 122, 130, 148, 153,
`203; 710/23
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,222,224
`5,319,760
`5,511,217
`5,553,291
`5,893,144
`
`6/1993 Flynm et al. wees 395/425
`
`6/1994 Masonetal. .....
`w. 395/400
`
`... 395/800
`4/1996 Nakajimaetal.
`9/1996 Tanakaetal. ....
`.. 395/700
`4/1999 Wood et ale woes TAL/122
`OTHER PUBLICATIONS
`
`Goldberg, R., Survey of Virtual Machine Research, Com-
`puter, pp. 34-45, Jun. 1974.
`Creasy, R., The origin of the VM/370 time-sharing system,
`IBM J. Res. Develop., vol. 25, No. 5, pp. 483-490, 1981.
`
`
`
`
`
`
`
`ccNUMAMultiprocessor
`
`}
`1
`
`
`
`physical
`address
`
`20]|@| |S] [G9
`
`
`os [os|[cs _|| [seros | Thin-OS
`pIsco
`|
`Fe) 2) Fe) ) &) &)
`
`
`
`
`
`
`~«——machineaddress——_+
`
`16 Claims, 3 Drawing Sheets
`
`
`
`
`
`
`|phys. backmapy
`replicas
`
`+
`
`Google Exhibit 1030
`Google Exhibit 1030
`Google v. Valtrus
`Google v. Valtrus
`
`
`
`U.S. Patent
`
`Jun. 13, 2000
`
`Sheet 1 of 3
`
`6,075,938
`
`
`
`-os_| Les] [es] [auros|[thiros
`
`
`
`
`
`
`
`
`
`
`
`
`ccCNUMAMultiprocessor
`
`Virtual
`Pages
`
`|os
`Physical
`Pages
`
`Pages
`
`DISCO
`
`' Machine
`
`
`
`U.S. Patent
`
`Jun. 13, 2000
`
`Sheet 2 of 3
`
`6,075,938
`
`rt|rtisot14i!!1i141!I1UIli'—_Ii—1I~~11Cc11CctyCcil\o51\oa!1®aMy2Olt}2Sii]|2Or!'L_-----f---------1)[|ba~---------|LP------------vy!1otitIrolit(ioot!o72.i:eO11E@iI|<£1}>|zi!®S|=16Ott
`12moi!clomMitaOleEu£—''2x—ott>ei!=|Smt!Sitaout
`—_—atl'1reorsAydrotiasmenwneeeeeeeeeeeeeeeeeeeeeeeeea_—_oeJ———eeeeeeeeeeee=1'
`11I
`<———-ssouppesulyseuw———>
`
`
`Oleg1*Oo1SSRB}
`'ESwonI
`6/2Qj
`a}oO11
`i
`
`a
`
`FIG. 3
`
`
`
`
`U.S. Patent
`
`Jun. 13, 2000
`
`Sheet 3 of 3
`
`6,075,938
`
`Physical Memory of VM 14
`Physical Memory of VM 2
`
`
`See|(BSCS| ERTS]BS
`sae
`
`
`
`
` |feed||
`eseeeSeemLes
`
`Machine Memory
`
`FIG. 4
`
`private
`
`[J] shared [_]
`
`free
`
`FIG. 5
`
`NFS Server
`
`NFSClient
`
`Pages
`
`Buffer Cache
`
`Buffer Cache
`
`N
`
`Virtual
`Pages
`
`Physical
`Pages
`
`Machine
`
`
`
`6,075,938
`
`1
`VIRTUAL MACHINE MONITORS FOR
`SCALABLE MULTIPROCESSORS
`
`This application claims priority from U.S. Provisional
`Patent Application 60/049,244 filed Jun. 10, 1997, which is
`incorporated herein by reference.
`This invention was made with Government support
`under Contract No. DABT63-94-C-0054 awarded by ARPA.
`The Governmenthas certain rights in this invention.
`
`BACKGROUND ART
`
`Scalable computers have moved from the research lab to
`the marketplace. Multiple vendors are now shipping scalable
`systems with configurations in the tens or even hundreds of
`processors. Unfortunately, the operating system (OS) soft-
`ware for these machines has often trailed hardware in
`reaching the functionality and reliability expected by mod-
`ern computer users. A major reason for the inabilily of OS
`developers to deliver on the promises of these machines is
`that extensive modifications to the operating system are
`required to efficiently support scalable shared memory mul-
`tiprocessor machines, such as cache coherent non-uniform
`memory access (CC-NUMA)machines. With the size of the
`system software for modern computers in the millions of
`lines of code, the OS changes required to adapt them for
`CC-NUMA machines represent a significant development
`cost. These changes have an impact on manyofthe standard
`modules that make up a modern operating system, such as
`virtual memory management and the scheduler. As a result,
`the system software for these machines is generally deliv-
`ered significantly later than the hardware. Even when the
`changes are functionally complete, they are likely to intro-
`duce instabilities for a certain period of time.
`Late,
`incompatible, and possibly even buggy system
`software can significantly impact
`the success of such
`machines, regardless of the innovations in the hardware. As
`the computer industry matures, users expect to carry forward
`their
`large base of existing application programs.
`Furthermore, with the increasing role that computers play in
`today’s society, users are demanding highly reliable and
`available computing systems. The cost of achieving reliabil-
`ity in computers may even dwarf the benefits of the inno-
`vation in hardware for many application areas.
`In addition, computer hardware vendors that use com-
`modity operating systems such as Microsoft’s Windows NT
`(Custer, 1993) face an even greater problem in obtaining
`operating system support for their CC-NUMAmultiproces-
`sors. These vendors need to persuade an independent com-
`pany to make changesto the operating system to support the
`new hardware. Not only must these vendors deliver on the
`promises of the innovative hardware, they must also con-
`vince powerful software companies to port to OS (Perez,
`1995). Given this situation,it is not surprising that computer
`architects frequently complain about
`the constraints and
`inflexibility of system software. From their perspective,
`these software constraints are an impediment to innovation.
`Two opposite approachesare currently being taken to deal
`with the system software challenges of scalable shared-
`memory multiprocessors. Thefirst one is to throw a large OS
`development effort at the problem and effectively address
`these challenges in the operating system. Examples of this
`approach are the Hive (Rosenblum, 1996) and Hurricane
`(Unrau, 1995) research prototypes and the Cellular-IRIX
`operating system recently announced by Silicon Graphics to
`support
`its shared memory machine,
`the Origin2000
`(Laudon, 1997). These multi-kernel operating systems
`
`2
`handle the scalability of the machine by partitioning
`resources into “cells” that communicate to manage the
`hardware resources efficiently and export a single system
`image, effectively hiding the distributed system from the
`user. In Hive, the cells are also used to contain faults within
`cell boundaries.
`In addition,
`these systems incorporate
`resource allocators and schedulers for processors and
`memory that can handle the scalability and the NUMA
`aspects of the machine. These designs, however, require
`significant OS changes, including partitioning the system
`into scalable units, building a single system image across the
`units, as well as other features such as fault containment and
`CC-NUMA management (Verghese, 1996). This approach
`also does not enable commodity operating systems to run on
`the new hardware.
`
`The second approach to dealing with the system software
`challenges of scalable shared-memory multiprocessorsis to
`statically partition the machine and run multiple, indepen-
`dent operating systemsthat use distributed system protocols
`to export a partial single system image to the users. An
`example of this approach is the Sun Enterprise10000
`machine that handles software scalability and hardware
`reliability by allowing users to hard partition the machine
`into independentfailure units each running a copy of the
`Solaris operating system. Users still benefit from the tight
`coupling of the machine, but cannot dynamically adapt the
`partitioning to the load of the different units. This approach
`favors low implementation cost and compatibility over
`innovation. Digital’s announced Galaxies operating system,
`a multi-kernel version of VMS, also partitions the machine
`relatively statically like the Sun machine, with the additional
`support for segment drivers that allow applications to share
`memory across partitions. Galaxies reserves a portion of the
`physical memory of the machine for this purpose.
`Virtual Machine Monitors
`
`Virtual machine monitors (VMMs) implementin software
`a virtual machine identical
`to the underlying hardware.
`IBM’s VM/370 (IBM, 1972) system, for example, allows
`the simultaneous execution of independent operating sys-
`temsbyvirtualizing all the hardware resources. It can attach
`1/O devices to single virtual machines in an exclusive mode.
`VM/370 mapsvirtual disks to distinct volumes(partitions),
`and supports a combination of persistent disks and tempo-
`rary disks. Unfortunately, the advantages of using virtual
`machine monitors come with certain disadvantages as well.
`Among the well-documented problems with virtual
`machines are the overheads due to the virtualization of the
`
`10
`
`15
`
`35
`
`40
`
`45
`
`hardware resources, resource management, sharing and
`communication.
`Overheads.
`
`50
`
`in traditional virtual machine
`The overheads present
`monitors come from manysources, including the additional
`exception processing,
`instruction execution and memory
`needed for virtualizing the hardware. Operations such as the
`execution of privileged instructions cannot be safely
`exported directly to the operating system and must be
`emulated in software by the monitor. Similarly, the access to
`1,0 devices is virtualized, so requests must be intercepted
`and remapped by the monitor. In addition to execution time
`overheads, running multiple independent virtual machines
`has a cost in additional memory. The code and data of each
`operating system and application is replicated in the memory
`of each virtual machine. Furthermore, large memory struc-
`tures suchas the file system buffer cache are also replicated
`resulting in a significant increase in memoryusage. A similar
`waste occurs with the replication on disk offile systems for
`the different virtual machines.
`
`55
`
`60
`
`65
`
`
`
`6,075,938
`
`3
`
`Resource Management.
`Virtual machine monitors frequently experience resource
`management problemsdue to the lack of information avail-
`able to the monitor to make good policy decisions. For
`example, the instruction execution stream of an operating
`system’s idle loop or the code for lock busy-waiting is
`indistinguishable at the monitor’s level from some important
`calculation. The result is that the monitor may schedule
`resources for useless computation while useful computation
`may be waiting. Similarly, the monitor does not know when
`a page is no longer being actively used by a virtual machine,
`so it cannot reallocate it
`to another virtual machine. In
`general, the monitor must make resource managementdeci-
`sions without the high-level knowledge that an operating
`system would have.
`Communication and Sharing.
`Finally, running multiple independent operating systems
`makes sharing and communication difficult. For example
`under CMS on VM/370,if a virtual disk containing a user’s
`files was in use by one virtual machine it could not be
`accessed by another virtual machine. The same user could
`notstart two virtual machines, and different users could not
`easily share files. The virtual machines lookedlike a set of
`independentstand-alone systems that simply happenedto be
`sharing the same hardware.
`Tor the above reasons, the idea of virtual machines has
`been largely disregarded. Nevertheless, rudimentary VMMs
`remain popular to provide backward compatibility for
`legacy applications or architectures. Microsoft’s Windows
`95 operating system (King, 1995) uses a virtual machine to
`run older Windows 3.1 and DOS applications. DAISY
`(Ebicoglu, 1997) uses dynamic compilation techniques to
`run a single virtual machine with a different instruction set
`architecture than the host processor.
`Virtual machine monitors have been recently used to
`provide fault-tolerance to sensitive applications. The Hyper-
`visor system (Bressoud, 1996) virtualizes only certain
`resources of the machine, specifically the interrupt architec-
`ture. By running the OS in supervisor mode, it disables
`direct access to I/O resources and physical memory, without
`having to virtualize them. While this is sufficient to provide
`fault-tolerance,
`it does not allow concurrent virtual
`machines to coexist.
`Microkernels
`
`Other system structuring techniques, such as
`microkernels, are known in the art. Microkernels are an
`operating system structuring technique with a clean and
`elegant interface able to support multiple operating system
`personalities (Accetta, 1986). Exokernel
`(Engler, 1995;
`Kaashoek, 1997)
`is a software architecture that allows
`application-level resource management. The Exokerncl
`safely multiplexes resources between user-level
`library
`operating systems. Exokernel supports specialized operating
`systems such as ExOS for the Aegis exokernel. These
`specialized operating systems enable superior performance
`since they are freed from the general overheads of commod-
`ity operating systems. Exokernel multiplexes resources
`rather than virtualizing them, and cannot,
`therefore, run
`commodity operating systems without significant modifica-
`tions.
`The Fluke system (Ford, 1996) uses the virlual machine
`approachto build modular and extensible operating systems.
`Recursive virtual machines are implemented bytheir nested
`process model, and efficiency is preserved by allowing inner
`virtual machinesto directly access the underlying microker-
`nel of the machine. Ford et al. show that specialized system
`functions such as checkpointing and migration require com-
`
`10
`
`15
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`plete state encapsulation. Fluke totally encapsulatesthe state
`of virtual machines, and can therefore trivially implement
`these functions.
`
`SUMMARYOF THE INVENTION
`
`It is a primary object of the present invention to overcome
`the limitations and disadvantages associated with the known
`operating systems for scalable multiprocessor machines.
`The present invention provides an alternative approach for
`constructing the system software for these large computers.
`Rather than making extensive changes to existing operating
`systems, an additional layer of software is inserted between
`the hardware and operating system. This layer acts like a
`virtual machine monitor in that multiple copies of “com-
`modily” operating systems can be run on a single scalable
`computer. The monitor also allows these commodity oper-
`ating systems to efficiently cooperate and share resources
`with each other. The resulting system contains most of the
`features of custom scalable operating systems developed
`specifically for these machines at only a fraction of their
`complexity and implementation cost. The use of commodity
`operating systemsleads to systemsthat are both reliable and
`compatible with the existing computing base.
`The unique virtual machine monitor of the present inven-
`tion virtualizes all the resources of the machine, exporting a
`more conventional hardware interface to the operating sys-
`tem. The monitor managesall the resources so that multiple
`virtual machines can coexist on the same multiprocessor.
`The virtual machine monitor allows multiple copies of
`potentially different operating systems to coexist on the
`multiprocessor. Some virtual machines can run commodity
`uniprocessor or multiprocessor operating systems, and oth-
`ers can run specialized operating systems fine-tuned for
`specific workloads. The virtual machine monitor schedules
`the virtual resources (processor and memory) orthe virtual
`machines on the physical resources of the scalable multi-
`processor.
`
`The unique virtual machine monitors of the present
`invention, in combination with commodity and specialized
`operating systems, form a flexible system software solution
`for multiprocessor machines. A large CC-NUMA
`multiprocessor, for example, can be configured with mul-
`tiple virtual machines each running a commodity operating
`system such as Microsoft’s Windows NT or some variant of
`UNIX. Each virtual machine is configured with the proces-
`sor and memory resources that the operating system can
`effectively handle. The virtual machines communicate using
`standard distributed protocols to export
`the image of a
`cluster of machines.
`
`Although the system looks like a cluster of looscly-
`coupled machines, the virtual machine monitor uses global
`policies to manageall the resources of the machine, allowing
`workloads to exploit the fine-grain resource sharing poten-
`tial of the hardware. For example, the monitor can move
`memory between virtual machines to keep applications from
`paging to disk when free memory is available in the
`machine. Similarly, the monitor dynamically schedules vir-
`tual processors on the physical processors to balance the
`load across the machine. The use of commodity software
`leverages the significant engineering effort invested in these
`operating systems and allows CC-NUMAmachinesto sup-
`port their large application base. Since the monitor is a
`relatively simple piece of code compared to large operating
`systems, this can be done with a small implementation effort
`as well as with a low risk of introducing software bugs and
`incompatibilities.
`
`
`
`6,075,938
`
`5
`The approach of the present invention offers two different
`possible solutions to handle applications whose resource
`needs exceed the scalability of commodity operating sys-
`tems. First, a relatively simple change to the commodity
`operating system can allow applications to explicitly s hare
`memory regions across virtual machine boundaries. The
`monitor contains a simple interface to sctup these shared
`regions. The operating system is extended with a special
`virtual memory segment driver to allow processes running
`on multiple virtual machines to share memory. For example,
`a parallel] database server could putits buffer cache in such
`a shared memoryregion and have query engines running on
`multiple virtual machines.
`Second, the flexibility of the approach supports special-
`ized operating systems for resource-intensive applications
`that do not need the full functionality of the commodity
`operating systems. These simpler, specialized operating sys-
`tems better support the needs of the applications and can
`easily scale to the size of the machine. For example, a virtual
`machine running a highly-scalable lightweight operating
`system such as Puma (Shuler, 1995) allows large scientific
`applications to scale to the size of the machine. Since the
`specialized operating system runsin a virtual machine,it can
`run alongside commodity operating systems running stan-
`dard application programs. Similarly, other important appli-
`cations such as database and web servers could be run in
`
`highly-customized operating systems such as database
`accelerators.
`
`Besides the flexibility to support a wide variety of work-
`loadsefficiently, the approach of the present invent ion has
`a numberof additional advantages over other system soft-
`ware designs targeted for CC-NUMA machines. Running
`multiple copies of an operating system handles the chal-
`lenges presented by CC-NUMA machinessuchas scalability
`and fault-containment. The virtual machine becomes the
`
`unit of scalability. With this approach, only the monitoritself
`and the distributed systems protocols need to scale to the
`size of the machine. ‘lhe simplicity of the monitor makesthis
`task easier than building a scalable operating system.
`The virtual machine also becomes the unit of fault con-
`
`tainment where failures in the system software can be
`contained in the virtual machine without spreading over the
`entire machine. To provide hardware fault-containment, the
`monitor itself must be structured into cells. Again,
`the
`simplicity of the monitor makes this easier than to protect a
`full-blown operating system against hardware faults.
`NUMA memory managementissues can also be handled
`by the monitor, effectively hiding the entire problem from
`the operating systems. With the careful placement of the
`pagesof a virtual machine’s memory andthe use of dynamic
`page migration and pagereplication, the monitor can export
`a more conventional view of memory as a uniform memory
`access (UMA) machine. This allows the non-NUMA-aware
`memory managementpolicics of commodity opcrating sys-
`tems to work well, even on a NUMA machine.
`the
`Besides handling CC-NUMA multiprocessors,
`the
`approach of the present
`invention also inherits all
`advantagesof traditional virtual machine monitors. Many of
`these benefits are still appropriate today and some have
`grown in importance. By exporting multiple virtual
`machines, a single CC-NUMA multiprocessor can have
`multiple different operating system s simultaneously run-
`ning, onit. Older versions of the system software can be kept
`around to provide a stable platform for keeping legacy
`applications running. Newer versions can be staged in
`carefully with critical applications residing on the older
`
`6
`the newer versions have proven
`operating systems until
`themselves. This approach provides an excellent way of
`introducing new and innovative system software whilestill
`providing a stable computing base for applications that favor
`stability over innovation.
`In one aspect of the invention, a computational system is
`provided that comprises a multiprocessor hardware layer, a
`virtual machine monitor layer, and a plurality of opcrating
`systems. The multiprocessor hardware layer comprises a
`plurality of computer processors, a plurality of physical
`resources associated with the processors, and an intercon-
`nect providing mutual communication between the proces-
`sors and resources. The virtual machine monitor (VMM)
`layer executes directly on the hardware layer and comprises
`a resource manager that manages the physical resources of
`the multiprocessor, a processor manager that manages the
`computer processors, and a hardware emulator that creates
`and managesa plurality of virtual machines. The operating
`systems execute on the plurality of virtual machines and
`transparently share the plurality of computer processors and
`physical resources through the VMM layer. In a preferred
`embodiment, the VMM layer further comprises a virtual
`network device providing communication between the oper-
`ating systems executing on the virtual machines, and allow-
`ing for transparent sharing optimizations between a sender
`operating system and a receiver operating system.
`In
`addition, the resource manager maintains a global buffer
`cache that
`is transparently shared among the virtual
`machines using read-only mappings in portions of an
`address space of the virtual machines. The VMMlayeralso
`maintains copy-on-write disks that allow virtual machines to
`transparently share main memoryresources and disk storage
`resources, and performs dynamic page migration/replication
`that hidesdistributed characteristics of the physical memory
`resources from the operating systems. The VMMlayer may
`also comprise a virtual memory resource interface to allow
`processes running on multiple virtual machines to share
`memory.
`
`Comparison with System Software for Scalable
`Shared Memory Machines
`The present invention is a unique combination of the
`advantages of both the OS-intensive and the OS-lght
`approaches, without the accompanying, disadvantages. Io
`particular, the present invention allows commodity and other
`operating systems to be mn efficiently on multiprocessors
`with little or no modification. Thus, the present invention
`does not require a major OS developmenteffort that is
`required by the known OS-intensive approaches. Yet,
`because it can share resources between the virtual machines
`
`and supports highly dynamic reconfiguration of the
`machine, the present invention enjoys all the performance
`advantages of OS-intcnsive approaches that have been
`adapted to the multiprocessor hardware. The present
`invention, therefore, does not suffer from the performance
`disadvantages of the OS-light approaches, such as Hive and
`Cellular-IRIX, that are hard-partioned and do not take full
`advantage of the hardware resources of the scalable multi-
`processor. Yet, the present invention enjoys the advantages
`of the OS-light approaches because it is independent of any
`particular OS, and can even support different operating
`systems concurrently. In addition, the present invention is
`capable of gradually getting out of the way as the OS
`improves. Operating systems with improved scalability can
`just request larger virtual machines that manage moreof the
`machiners resources. The present invention, therefore, pro-
`vides a low-cost solution that enables a smooth transition
`and maintains compatibility with commodity operating sys-
`tems.
`
`10
`
`15
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`
`
`6,075,938
`
`7
`Comparison with Virtual Machine Monitors
`
`The present invention is implemented as a unique type of
`virtual machine monitor specially designed for scalable
`multiprocessors and their particular issues. The present
`invention differs from VM/370 and othervirtual machines in
`several respects. Among others, it supports scalable shared-
`memory multiprocessors, handles modern operating
`systems, and transparently shares capabilitics of copy-on-
`write disks and the global buffer cache. Whereas VM/370
`mapped virtual disks to distinct volumes (partitions),
`the
`present invention has the notion of shared copy-on-write
`disks.
`
`In contrast with DAISY, which uses dynamic compilation
`techniques to run a single virtual machine with a different
`instruction set architecture than the host processor,
`the
`present invention exports the same instruction set as the
`underlying hardware and can therefore use direct execution
`rather than dynamic compilation.
`The Ilypervisor system virtualizes only the interrupt
`architecture of the machine. While this is sufficient
`to
`
`provide fault-tolerance, it does not allowconcurrent virtual
`machines to coexist, as the present invention does.
`
`Comparison with Other System Software
`Structuring Techniques
`
`As an operating system structuring technique, the present
`invention is similar in some respects to a microkernel with
`an unimaginative interface. Rather than developing the clean
`and elegant interface used by microkernels,
`the present
`invention simply mirrors the interface of the raw hardware.
`‘The present invention differs from Exokernel in that it
`virtualizes resources rather than multiplexing them, and can
`therefore run commodity operating systems without signifi-
`cant modifications.
`
`Conclusion
`
`The present invention has overcome many of the prob-
`lems associated with traditional virtual machines. In the
`
`present invention, the overheads imposed by the virtualiza-
`tion are modest both in terms of processing time and
`memory footprint. The present invention uses a combination
`of innovative emulation of the DMA engine and standard
`distributed file system protocols to support a global buffer
`cachethat is transparently shared across all virtual machines.
`The approach provides a simple solution to the scalability,
`reliability and NUMA management problems otherwise
`faced by the system software of large-scale machines.
`BRIEF DESCRIPTION OF THE DRAWING
`FIGURES
`
`FIG. 1 is a schematic diagram illustrating the architecture
`of a computer system according to the invention. Disco, a
`virtual machine monitor,
`is a software layer between a
`multiprocessor hardware layer and multiple virtual
`machinesthat run independent operating systems and appli-
`cation programs.
`FIG. 2 is a schematic diagram illustrating transparent
`page replication according to the present invention. Disco
`uses the physical-to-machine mapping to replicate user and
`kernel pages. Virtual pages from VCPUs0 and 1 both map
`the same physical page of their virtual machine. However,
`Disco transparently maps cach virtual page to a machine
`page replica that is located on the local node.
`FIG. 3 is a schematic diagram illustrating major data
`structures according to the present invention.
`
`10
`
`15
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`FIG. 4 isa schematic diagramillustrating memorysharing,
`according to the present invention. Read-only pages brought
`in from disk, such as the kernel text and the buffer cache, can
`be transparently shared between virtual machines. This
`creates in machine memory a global buffer cache shared
`across virtual machines and helps reduce the memory foot-
`print of the system.
`FIG. 5 is a schematic diagram illustrating an example of
`transparent sharing of pages over NFS according to the
`present invention. The diagram shows the case when the
`NFSreply, to a read request, includes a data page. (1) The
`monitor’s networking device remaps the data page from the
`source’s machine address space to the destination’s. (2) The
`monitor remaps the data page from the driver’s mbufto the
`client’s buffer cache. This remapis initiated by the operating
`system through a monitor call.
`DETAILED DESCRIPTION
`
`To demonstrate the approach of the present invention, we
`discuss for ilustrative purposes an embodiment of the
`invention that combines commodity operating systems, not
`originally designed for large-scale multiprocessors, to form
`a high performance system software base. This embodiment,
`called Disco, will be described as implemented on the
`Stanford FLASH shared memory multiprocessor (Kuskin,
`1994), an experimental cache coherent non-uniform
`memory architecture (CC-NUMA) machine. The FLASH
`multiprocessor consists of a collection of nodes each con-
`taining a processor, main memory, and I/O devices. The
`nodes are connected together with a high-performancescal-
`able interconnect. The machines use a directory to maintain
`cache coherency, providing to the software the view of a
`shared-memory multiprocessor with non-uniform memory
`access times. Although written for the FLASH machine, the
`hardware model assumed by Discois also available on a
`number of commercial machines including the Convex
`Exemplar
`(Brewer, 1997), Silicon Graphics Origin2000
`(Laudon, 1997), Sequent NUMAQ (Lovett, 1996), and
`DataGeneral NUMALiine. Accordingly, Disco illustrates
`the fundamental principles of the invention which may be
`adapted by those skilled in the art to implement the invention
`on other similar machines.
`
`Disco contains many features that reduce or eliminate the
`problemsassociated with traditional virtual machine moni-
`tors. Specifically,
`it minimizes the overhead of virtual
`machines and enhances the resource sharing betweenvirtual
`machines running on the same system. Disco allows oper-
`ating systems running on different virtual machines to be
`coupled using standard distributed systems protocols such as
`‘TCPAP and NES. It also allows for efficient sharing of
`memory and disk resources between virtual machines. The
`sharing support allows Disco to maintain a global buffer
`cache which is transparently shared by all
`the virtual
`machines, even when the virtual machines communicate
`through standard distributed protocols.
`FIG. 1 shows how the virtual machine monitor allows
`
`multiple copies of potentially different operating systems to
`coexist. In this figure, five virtual machines coexist on the
`multiprocessor. Some virtual machines run commodity uni-
`processor or multiprocessor operating systems, and others
`run specialized operating systems fine-tuned for specific
`workloads. The virtual machine monitor schedules the vir-
`tual
`resources (processor and memory) or
`th