throbber
1:5
`United States Patent
`6,075,938
`[45] Date of Patent:
`Bugnionetal.
`Jun. 13, 2000
`
`[11] Patent Number:
`
`US006075938A
`
`[54] VIRTUAL MACHINE MONITORS FOR
`SCALABLE MULTIPROCESSORS
`
`[75]
`
`Inventors: Edouard Bugnion, Menlo Park; Scott
`W. Devine; Mendel Rosenblum, both
`of Stanford, all of Calif.
`
`[73] Assignee: The Board of Trustees of the Leland
`Stanford Junior University, Palo Alto,
`Calif.
`
`Rosenblum, M. et al., Implementing efficient fault contain-
`ment for multiprocessors, Comm. of the ACM,vol. 39, No.
`9, 1996.
`
`Chapin,J. et al., Hive: fault containment for shared-memory
`multiprocessors, Standford Computer Systems Lab Publica-
`tion, ACM Symposium, Dec. 1995.
`
`Bugnion, E. et al., Design and implementation of DISCO,
`May 1997.
`
`Primary Examiner—Kevin J. Teska
`Assistant Examiner—Thai Phan
`
`[57]
`
`ABSTRACT
`
`The problem of extending modern operating systems to run
`efficiently on large-scale shared memory multiprocessors
`without a large implementation effort is solved by a unique
`type of virtual machine monitor. Virtual machines are used
`to run multiple commodity operating systems on a scalable
`multiprocessor. To reduce the memory overheads associated
`with running multiple operating systems, virtual machines
`transparently share major data structures such as the oper-
`ating system code and thefile system buffer cache. We use
`the distributed system support of modem operating systems
`to export a partial single system image to the users. Two
`techniques, copy-on-write disks and the use of a special
`network device, enable transparent resource sharing without
`requiring the cooperation of the operating systems. This
`solution addresses manyof the challenges facing the system
`software for these machines. The overheads of the monitor
`
`are small and the approach providesscalability as well as the
`ability to deal with the non-uniform memory access time of
`these systems. ‘The overall solution achieves most of the
`benefits of operating systems customized for scalable mul-
`tiprocessors yet
`it can be achieved with a significantly
`smaller implementation effort.
`
`[21] Appl. No.: 09/095,283
`
`[22]
`
`Filed:
`
`Jun. 10, 1998
`
`Related U.S. Application Data
`Provisional application No. 60/049,244, Jun. 10, 1997.
`[60]
`Int. Ch? oe GO06F 9/455; GO6F 13/10
`[St]
`[52] US. Ch. vaceesssecnee 395/500.48; 709/214; 709/301;
`710/23; 711/148; 711/153
`[58] Field of Search ou... eee 395/500.48, 700,
`395/500.47, 843; 709/214, 30, 304, 301,
`104, 226; 711/6, 141, 122, 130, 148, 153,
`203; 710/23
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,222,224
`5,319,760
`5,511,217
`5,553,291
`5,893,144
`
`6/1993 Flynm et al. wees 395/425
`
`6/1994 Masonetal. .....
`w. 395/400
`
`... 395/800
`4/1996 Nakajimaetal.
`9/1996 Tanakaetal. ....
`.. 395/700
`4/1999 Wood et ale woes TAL/122
`OTHER PUBLICATIONS
`
`Goldberg, R., Survey of Virtual Machine Research, Com-
`puter, pp. 34-45, Jun. 1974.
`Creasy, R., The origin of the VM/370 time-sharing system,
`IBM J. Res. Develop., vol. 25, No. 5, pp. 483-490, 1981.
`
`
`
`
`
`
`
`ccNUMAMultiprocessor
`
`}
`1
`
`
`
`physical
`address
`
`20]|@| |S] [G9
`
`
`os [os|[cs _|| [seros | Thin-OS
`pIsco
`|
`Fe) 2) Fe) ) &) &)
`
`
`
`
`
`
`~«——machineaddress——_+
`
`16 Claims, 3 Drawing Sheets
`
`
`
`
`
`
`|phys. backmapy
`replicas
`
`+
`
`Google Exhibit 1030
`Google Exhibit 1030
`Google v. Valtrus
`Google v. Valtrus
`
`

`

`U.S. Patent
`
`Jun. 13, 2000
`
`Sheet 1 of 3
`
`6,075,938
`
`
`
`-os_| Les] [es] [auros|[thiros
`
`
`
`
`
`
`
`
`
`
`
`
`ccCNUMAMultiprocessor
`
`Virtual
`Pages
`
`|os
`Physical
`Pages
`
`Pages
`
`DISCO
`
`' Machine
`
`

`

`U.S. Patent
`
`Jun. 13, 2000
`
`Sheet 2 of 3
`
`6,075,938
`
`rt|rtisot14i!!1i141!I1UIli'—_Ii—1I~~11Cc11CctyCcil\o51\oa!1®aMy2Olt}2Sii]|2Or!'L_-----f---------1)[|ba~---------|LP------------vy!1otitIrolit(ioot!o72.i:eO11E@iI|<£1}>|zi!®S|=16Ott
`12moi!clomMitaOleEu£—''2x—ott>ei!=|Smt!Sitaout
`—_—atl'1reorsAydrotiasmenwneeeeeeeeeeeeeeeeeeeeeeeeea_—_oeJ———eeeeeeeeeeee=1'
`11I
`<———-ssouppesulyseuw———>
`
`
`Oleg1*Oo1SSRB}
`'ESwonI
`6/2Qj
`a}oO11
`i
`
`a
`
`FIG. 3
`
`
`

`

`U.S. Patent
`
`Jun. 13, 2000
`
`Sheet 3 of 3
`
`6,075,938
`
`Physical Memory of VM 14
`Physical Memory of VM 2
`
`
`See|(BSCS| ERTS]BS
`sae
`
`
`
`
` |feed||
`eseeeSeemLes
`
`Machine Memory
`
`FIG. 4
`
`private
`
`[J] shared [_]
`
`free
`
`FIG. 5
`
`NFS Server
`
`NFSClient
`
`Pages
`
`Buffer Cache
`
`Buffer Cache
`
`N
`
`Virtual
`Pages
`
`Physical
`Pages
`
`Machine
`
`

`

`6,075,938
`
`1
`VIRTUAL MACHINE MONITORS FOR
`SCALABLE MULTIPROCESSORS
`
`This application claims priority from U.S. Provisional
`Patent Application 60/049,244 filed Jun. 10, 1997, which is
`incorporated herein by reference.
`This invention was made with Government support
`under Contract No. DABT63-94-C-0054 awarded by ARPA.
`The Governmenthas certain rights in this invention.
`
`BACKGROUND ART
`
`Scalable computers have moved from the research lab to
`the marketplace. Multiple vendors are now shipping scalable
`systems with configurations in the tens or even hundreds of
`processors. Unfortunately, the operating system (OS) soft-
`ware for these machines has often trailed hardware in
`reaching the functionality and reliability expected by mod-
`ern computer users. A major reason for the inabilily of OS
`developers to deliver on the promises of these machines is
`that extensive modifications to the operating system are
`required to efficiently support scalable shared memory mul-
`tiprocessor machines, such as cache coherent non-uniform
`memory access (CC-NUMA)machines. With the size of the
`system software for modern computers in the millions of
`lines of code, the OS changes required to adapt them for
`CC-NUMA machines represent a significant development
`cost. These changes have an impact on manyofthe standard
`modules that make up a modern operating system, such as
`virtual memory management and the scheduler. As a result,
`the system software for these machines is generally deliv-
`ered significantly later than the hardware. Even when the
`changes are functionally complete, they are likely to intro-
`duce instabilities for a certain period of time.
`Late,
`incompatible, and possibly even buggy system
`software can significantly impact
`the success of such
`machines, regardless of the innovations in the hardware. As
`the computer industry matures, users expect to carry forward
`their
`large base of existing application programs.
`Furthermore, with the increasing role that computers play in
`today’s society, users are demanding highly reliable and
`available computing systems. The cost of achieving reliabil-
`ity in computers may even dwarf the benefits of the inno-
`vation in hardware for many application areas.
`In addition, computer hardware vendors that use com-
`modity operating systems such as Microsoft’s Windows NT
`(Custer, 1993) face an even greater problem in obtaining
`operating system support for their CC-NUMAmultiproces-
`sors. These vendors need to persuade an independent com-
`pany to make changesto the operating system to support the
`new hardware. Not only must these vendors deliver on the
`promises of the innovative hardware, they must also con-
`vince powerful software companies to port to OS (Perez,
`1995). Given this situation,it is not surprising that computer
`architects frequently complain about
`the constraints and
`inflexibility of system software. From their perspective,
`these software constraints are an impediment to innovation.
`Two opposite approachesare currently being taken to deal
`with the system software challenges of scalable shared-
`memory multiprocessors. Thefirst one is to throw a large OS
`development effort at the problem and effectively address
`these challenges in the operating system. Examples of this
`approach are the Hive (Rosenblum, 1996) and Hurricane
`(Unrau, 1995) research prototypes and the Cellular-IRIX
`operating system recently announced by Silicon Graphics to
`support
`its shared memory machine,
`the Origin2000
`(Laudon, 1997). These multi-kernel operating systems
`
`2
`handle the scalability of the machine by partitioning
`resources into “cells” that communicate to manage the
`hardware resources efficiently and export a single system
`image, effectively hiding the distributed system from the
`user. In Hive, the cells are also used to contain faults within
`cell boundaries.
`In addition,
`these systems incorporate
`resource allocators and schedulers for processors and
`memory that can handle the scalability and the NUMA
`aspects of the machine. These designs, however, require
`significant OS changes, including partitioning the system
`into scalable units, building a single system image across the
`units, as well as other features such as fault containment and
`CC-NUMA management (Verghese, 1996). This approach
`also does not enable commodity operating systems to run on
`the new hardware.
`
`The second approach to dealing with the system software
`challenges of scalable shared-memory multiprocessorsis to
`statically partition the machine and run multiple, indepen-
`dent operating systemsthat use distributed system protocols
`to export a partial single system image to the users. An
`example of this approach is the Sun Enterprise10000
`machine that handles software scalability and hardware
`reliability by allowing users to hard partition the machine
`into independentfailure units each running a copy of the
`Solaris operating system. Users still benefit from the tight
`coupling of the machine, but cannot dynamically adapt the
`partitioning to the load of the different units. This approach
`favors low implementation cost and compatibility over
`innovation. Digital’s announced Galaxies operating system,
`a multi-kernel version of VMS, also partitions the machine
`relatively statically like the Sun machine, with the additional
`support for segment drivers that allow applications to share
`memory across partitions. Galaxies reserves a portion of the
`physical memory of the machine for this purpose.
`Virtual Machine Monitors
`
`Virtual machine monitors (VMMs) implementin software
`a virtual machine identical
`to the underlying hardware.
`IBM’s VM/370 (IBM, 1972) system, for example, allows
`the simultaneous execution of independent operating sys-
`temsbyvirtualizing all the hardware resources. It can attach
`1/O devices to single virtual machines in an exclusive mode.
`VM/370 mapsvirtual disks to distinct volumes(partitions),
`and supports a combination of persistent disks and tempo-
`rary disks. Unfortunately, the advantages of using virtual
`machine monitors come with certain disadvantages as well.
`Among the well-documented problems with virtual
`machines are the overheads due to the virtualization of the
`
`10
`
`15
`
`35
`
`40
`
`45
`
`hardware resources, resource management, sharing and
`communication.
`Overheads.
`
`50
`
`in traditional virtual machine
`The overheads present
`monitors come from manysources, including the additional
`exception processing,
`instruction execution and memory
`needed for virtualizing the hardware. Operations such as the
`execution of privileged instructions cannot be safely
`exported directly to the operating system and must be
`emulated in software by the monitor. Similarly, the access to
`1,0 devices is virtualized, so requests must be intercepted
`and remapped by the monitor. In addition to execution time
`overheads, running multiple independent virtual machines
`has a cost in additional memory. The code and data of each
`operating system and application is replicated in the memory
`of each virtual machine. Furthermore, large memory struc-
`tures suchas the file system buffer cache are also replicated
`resulting in a significant increase in memoryusage. A similar
`waste occurs with the replication on disk offile systems for
`the different virtual machines.
`
`55
`
`60
`
`65
`
`

`

`6,075,938
`
`3
`
`Resource Management.
`Virtual machine monitors frequently experience resource
`management problemsdue to the lack of information avail-
`able to the monitor to make good policy decisions. For
`example, the instruction execution stream of an operating
`system’s idle loop or the code for lock busy-waiting is
`indistinguishable at the monitor’s level from some important
`calculation. The result is that the monitor may schedule
`resources for useless computation while useful computation
`may be waiting. Similarly, the monitor does not know when
`a page is no longer being actively used by a virtual machine,
`so it cannot reallocate it
`to another virtual machine. In
`general, the monitor must make resource managementdeci-
`sions without the high-level knowledge that an operating
`system would have.
`Communication and Sharing.
`Finally, running multiple independent operating systems
`makes sharing and communication difficult. For example
`under CMS on VM/370,if a virtual disk containing a user’s
`files was in use by one virtual machine it could not be
`accessed by another virtual machine. The same user could
`notstart two virtual machines, and different users could not
`easily share files. The virtual machines lookedlike a set of
`independentstand-alone systems that simply happenedto be
`sharing the same hardware.
`Tor the above reasons, the idea of virtual machines has
`been largely disregarded. Nevertheless, rudimentary VMMs
`remain popular to provide backward compatibility for
`legacy applications or architectures. Microsoft’s Windows
`95 operating system (King, 1995) uses a virtual machine to
`run older Windows 3.1 and DOS applications. DAISY
`(Ebicoglu, 1997) uses dynamic compilation techniques to
`run a single virtual machine with a different instruction set
`architecture than the host processor.
`Virtual machine monitors have been recently used to
`provide fault-tolerance to sensitive applications. The Hyper-
`visor system (Bressoud, 1996) virtualizes only certain
`resources of the machine, specifically the interrupt architec-
`ture. By running the OS in supervisor mode, it disables
`direct access to I/O resources and physical memory, without
`having to virtualize them. While this is sufficient to provide
`fault-tolerance,
`it does not allow concurrent virtual
`machines to coexist.
`Microkernels
`
`Other system structuring techniques, such as
`microkernels, are known in the art. Microkernels are an
`operating system structuring technique with a clean and
`elegant interface able to support multiple operating system
`personalities (Accetta, 1986). Exokernel
`(Engler, 1995;
`Kaashoek, 1997)
`is a software architecture that allows
`application-level resource management. The Exokerncl
`safely multiplexes resources between user-level
`library
`operating systems. Exokernel supports specialized operating
`systems such as ExOS for the Aegis exokernel. These
`specialized operating systems enable superior performance
`since they are freed from the general overheads of commod-
`ity operating systems. Exokernel multiplexes resources
`rather than virtualizing them, and cannot,
`therefore, run
`commodity operating systems without significant modifica-
`tions.
`The Fluke system (Ford, 1996) uses the virlual machine
`approachto build modular and extensible operating systems.
`Recursive virtual machines are implemented bytheir nested
`process model, and efficiency is preserved by allowing inner
`virtual machinesto directly access the underlying microker-
`nel of the machine. Ford et al. show that specialized system
`functions such as checkpointing and migration require com-
`
`10
`
`15
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`plete state encapsulation. Fluke totally encapsulatesthe state
`of virtual machines, and can therefore trivially implement
`these functions.
`
`SUMMARYOF THE INVENTION
`
`It is a primary object of the present invention to overcome
`the limitations and disadvantages associated with the known
`operating systems for scalable multiprocessor machines.
`The present invention provides an alternative approach for
`constructing the system software for these large computers.
`Rather than making extensive changes to existing operating
`systems, an additional layer of software is inserted between
`the hardware and operating system. This layer acts like a
`virtual machine monitor in that multiple copies of “com-
`modily” operating systems can be run on a single scalable
`computer. The monitor also allows these commodity oper-
`ating systems to efficiently cooperate and share resources
`with each other. The resulting system contains most of the
`features of custom scalable operating systems developed
`specifically for these machines at only a fraction of their
`complexity and implementation cost. The use of commodity
`operating systemsleads to systemsthat are both reliable and
`compatible with the existing computing base.
`The unique virtual machine monitor of the present inven-
`tion virtualizes all the resources of the machine, exporting a
`more conventional hardware interface to the operating sys-
`tem. The monitor managesall the resources so that multiple
`virtual machines can coexist on the same multiprocessor.
`The virtual machine monitor allows multiple copies of
`potentially different operating systems to coexist on the
`multiprocessor. Some virtual machines can run commodity
`uniprocessor or multiprocessor operating systems, and oth-
`ers can run specialized operating systems fine-tuned for
`specific workloads. The virtual machine monitor schedules
`the virtual resources (processor and memory) orthe virtual
`machines on the physical resources of the scalable multi-
`processor.
`
`The unique virtual machine monitors of the present
`invention, in combination with commodity and specialized
`operating systems, form a flexible system software solution
`for multiprocessor machines. A large CC-NUMA
`multiprocessor, for example, can be configured with mul-
`tiple virtual machines each running a commodity operating
`system such as Microsoft’s Windows NT or some variant of
`UNIX. Each virtual machine is configured with the proces-
`sor and memory resources that the operating system can
`effectively handle. The virtual machines communicate using
`standard distributed protocols to export
`the image of a
`cluster of machines.
`
`Although the system looks like a cluster of looscly-
`coupled machines, the virtual machine monitor uses global
`policies to manageall the resources of the machine, allowing
`workloads to exploit the fine-grain resource sharing poten-
`tial of the hardware. For example, the monitor can move
`memory between virtual machines to keep applications from
`paging to disk when free memory is available in the
`machine. Similarly, the monitor dynamically schedules vir-
`tual processors on the physical processors to balance the
`load across the machine. The use of commodity software
`leverages the significant engineering effort invested in these
`operating systems and allows CC-NUMAmachinesto sup-
`port their large application base. Since the monitor is a
`relatively simple piece of code compared to large operating
`systems, this can be done with a small implementation effort
`as well as with a low risk of introducing software bugs and
`incompatibilities.
`
`

`

`6,075,938
`
`5
`The approach of the present invention offers two different
`possible solutions to handle applications whose resource
`needs exceed the scalability of commodity operating sys-
`tems. First, a relatively simple change to the commodity
`operating system can allow applications to explicitly s hare
`memory regions across virtual machine boundaries. The
`monitor contains a simple interface to sctup these shared
`regions. The operating system is extended with a special
`virtual memory segment driver to allow processes running
`on multiple virtual machines to share memory. For example,
`a parallel] database server could putits buffer cache in such
`a shared memoryregion and have query engines running on
`multiple virtual machines.
`Second, the flexibility of the approach supports special-
`ized operating systems for resource-intensive applications
`that do not need the full functionality of the commodity
`operating systems. These simpler, specialized operating sys-
`tems better support the needs of the applications and can
`easily scale to the size of the machine. For example, a virtual
`machine running a highly-scalable lightweight operating
`system such as Puma (Shuler, 1995) allows large scientific
`applications to scale to the size of the machine. Since the
`specialized operating system runsin a virtual machine,it can
`run alongside commodity operating systems running stan-
`dard application programs. Similarly, other important appli-
`cations such as database and web servers could be run in
`
`highly-customized operating systems such as database
`accelerators.
`
`Besides the flexibility to support a wide variety of work-
`loadsefficiently, the approach of the present invent ion has
`a numberof additional advantages over other system soft-
`ware designs targeted for CC-NUMA machines. Running
`multiple copies of an operating system handles the chal-
`lenges presented by CC-NUMA machinessuchas scalability
`and fault-containment. The virtual machine becomes the
`
`unit of scalability. With this approach, only the monitoritself
`and the distributed systems protocols need to scale to the
`size of the machine. ‘lhe simplicity of the monitor makesthis
`task easier than building a scalable operating system.
`The virtual machine also becomes the unit of fault con-
`
`tainment where failures in the system software can be
`contained in the virtual machine without spreading over the
`entire machine. To provide hardware fault-containment, the
`monitor itself must be structured into cells. Again,
`the
`simplicity of the monitor makes this easier than to protect a
`full-blown operating system against hardware faults.
`NUMA memory managementissues can also be handled
`by the monitor, effectively hiding the entire problem from
`the operating systems. With the careful placement of the
`pagesof a virtual machine’s memory andthe use of dynamic
`page migration and pagereplication, the monitor can export
`a more conventional view of memory as a uniform memory
`access (UMA) machine. This allows the non-NUMA-aware
`memory managementpolicics of commodity opcrating sys-
`tems to work well, even on a NUMA machine.
`the
`Besides handling CC-NUMA multiprocessors,
`the
`approach of the present
`invention also inherits all
`advantagesof traditional virtual machine monitors. Many of
`these benefits are still appropriate today and some have
`grown in importance. By exporting multiple virtual
`machines, a single CC-NUMA multiprocessor can have
`multiple different operating system s simultaneously run-
`ning, onit. Older versions of the system software can be kept
`around to provide a stable platform for keeping legacy
`applications running. Newer versions can be staged in
`carefully with critical applications residing on the older
`
`6
`the newer versions have proven
`operating systems until
`themselves. This approach provides an excellent way of
`introducing new and innovative system software whilestill
`providing a stable computing base for applications that favor
`stability over innovation.
`In one aspect of the invention, a computational system is
`provided that comprises a multiprocessor hardware layer, a
`virtual machine monitor layer, and a plurality of opcrating
`systems. The multiprocessor hardware layer comprises a
`plurality of computer processors, a plurality of physical
`resources associated with the processors, and an intercon-
`nect providing mutual communication between the proces-
`sors and resources. The virtual machine monitor (VMM)
`layer executes directly on the hardware layer and comprises
`a resource manager that manages the physical resources of
`the multiprocessor, a processor manager that manages the
`computer processors, and a hardware emulator that creates
`and managesa plurality of virtual machines. The operating
`systems execute on the plurality of virtual machines and
`transparently share the plurality of computer processors and
`physical resources through the VMM layer. In a preferred
`embodiment, the VMM layer further comprises a virtual
`network device providing communication between the oper-
`ating systems executing on the virtual machines, and allow-
`ing for transparent sharing optimizations between a sender
`operating system and a receiver operating system.
`In
`addition, the resource manager maintains a global buffer
`cache that
`is transparently shared among the virtual
`machines using read-only mappings in portions of an
`address space of the virtual machines. The VMMlayeralso
`maintains copy-on-write disks that allow virtual machines to
`transparently share main memoryresources and disk storage
`resources, and performs dynamic page migration/replication
`that hidesdistributed characteristics of the physical memory
`resources from the operating systems. The VMMlayer may
`also comprise a virtual memory resource interface to allow
`processes running on multiple virtual machines to share
`memory.
`
`Comparison with System Software for Scalable
`Shared Memory Machines
`The present invention is a unique combination of the
`advantages of both the OS-intensive and the OS-lght
`approaches, without the accompanying, disadvantages. Io
`particular, the present invention allows commodity and other
`operating systems to be mn efficiently on multiprocessors
`with little or no modification. Thus, the present invention
`does not require a major OS developmenteffort that is
`required by the known OS-intensive approaches. Yet,
`because it can share resources between the virtual machines
`
`and supports highly dynamic reconfiguration of the
`machine, the present invention enjoys all the performance
`advantages of OS-intcnsive approaches that have been
`adapted to the multiprocessor hardware. The present
`invention, therefore, does not suffer from the performance
`disadvantages of the OS-light approaches, such as Hive and
`Cellular-IRIX, that are hard-partioned and do not take full
`advantage of the hardware resources of the scalable multi-
`processor. Yet, the present invention enjoys the advantages
`of the OS-light approaches because it is independent of any
`particular OS, and can even support different operating
`systems concurrently. In addition, the present invention is
`capable of gradually getting out of the way as the OS
`improves. Operating systems with improved scalability can
`just request larger virtual machines that manage moreof the
`machiners resources. The present invention, therefore, pro-
`vides a low-cost solution that enables a smooth transition
`and maintains compatibility with commodity operating sys-
`tems.
`
`10
`
`15
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`

`

`6,075,938
`
`7
`Comparison with Virtual Machine Monitors
`
`The present invention is implemented as a unique type of
`virtual machine monitor specially designed for scalable
`multiprocessors and their particular issues. The present
`invention differs from VM/370 and othervirtual machines in
`several respects. Among others, it supports scalable shared-
`memory multiprocessors, handles modern operating
`systems, and transparently shares capabilitics of copy-on-
`write disks and the global buffer cache. Whereas VM/370
`mapped virtual disks to distinct volumes (partitions),
`the
`present invention has the notion of shared copy-on-write
`disks.
`
`In contrast with DAISY, which uses dynamic compilation
`techniques to run a single virtual machine with a different
`instruction set architecture than the host processor,
`the
`present invention exports the same instruction set as the
`underlying hardware and can therefore use direct execution
`rather than dynamic compilation.
`The Ilypervisor system virtualizes only the interrupt
`architecture of the machine. While this is sufficient
`to
`
`provide fault-tolerance, it does not allowconcurrent virtual
`machines to coexist, as the present invention does.
`
`Comparison with Other System Software
`Structuring Techniques
`
`As an operating system structuring technique, the present
`invention is similar in some respects to a microkernel with
`an unimaginative interface. Rather than developing the clean
`and elegant interface used by microkernels,
`the present
`invention simply mirrors the interface of the raw hardware.
`‘The present invention differs from Exokernel in that it
`virtualizes resources rather than multiplexing them, and can
`therefore run commodity operating systems without signifi-
`cant modifications.
`
`Conclusion
`
`The present invention has overcome many of the prob-
`lems associated with traditional virtual machines. In the
`
`present invention, the overheads imposed by the virtualiza-
`tion are modest both in terms of processing time and
`memory footprint. The present invention uses a combination
`of innovative emulation of the DMA engine and standard
`distributed file system protocols to support a global buffer
`cachethat is transparently shared across all virtual machines.
`The approach provides a simple solution to the scalability,
`reliability and NUMA management problems otherwise
`faced by the system software of large-scale machines.
`BRIEF DESCRIPTION OF THE DRAWING
`FIGURES
`
`FIG. 1 is a schematic diagram illustrating the architecture
`of a computer system according to the invention. Disco, a
`virtual machine monitor,
`is a software layer between a
`multiprocessor hardware layer and multiple virtual
`machinesthat run independent operating systems and appli-
`cation programs.
`FIG. 2 is a schematic diagram illustrating transparent
`page replication according to the present invention. Disco
`uses the physical-to-machine mapping to replicate user and
`kernel pages. Virtual pages from VCPUs0 and 1 both map
`the same physical page of their virtual machine. However,
`Disco transparently maps cach virtual page to a machine
`page replica that is located on the local node.
`FIG. 3 is a schematic diagram illustrating major data
`structures according to the present invention.
`
`10
`
`15
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`FIG. 4 isa schematic diagramillustrating memorysharing,
`according to the present invention. Read-only pages brought
`in from disk, such as the kernel text and the buffer cache, can
`be transparently shared between virtual machines. This
`creates in machine memory a global buffer cache shared
`across virtual machines and helps reduce the memory foot-
`print of the system.
`FIG. 5 is a schematic diagram illustrating an example of
`transparent sharing of pages over NFS according to the
`present invention. The diagram shows the case when the
`NFSreply, to a read request, includes a data page. (1) The
`monitor’s networking device remaps the data page from the
`source’s machine address space to the destination’s. (2) The
`monitor remaps the data page from the driver’s mbufto the
`client’s buffer cache. This remapis initiated by the operating
`system through a monitor call.
`DETAILED DESCRIPTION
`
`To demonstrate the approach of the present invention, we
`discuss for ilustrative purposes an embodiment of the
`invention that combines commodity operating systems, not
`originally designed for large-scale multiprocessors, to form
`a high performance system software base. This embodiment,
`called Disco, will be described as implemented on the
`Stanford FLASH shared memory multiprocessor (Kuskin,
`1994), an experimental cache coherent non-uniform
`memory architecture (CC-NUMA) machine. The FLASH
`multiprocessor consists of a collection of nodes each con-
`taining a processor, main memory, and I/O devices. The
`nodes are connected together with a high-performancescal-
`able interconnect. The machines use a directory to maintain
`cache coherency, providing to the software the view of a
`shared-memory multiprocessor with non-uniform memory
`access times. Although written for the FLASH machine, the
`hardware model assumed by Discois also available on a
`number of commercial machines including the Convex
`Exemplar
`(Brewer, 1997), Silicon Graphics Origin2000
`(Laudon, 1997), Sequent NUMAQ (Lovett, 1996), and
`DataGeneral NUMALiine. Accordingly, Disco illustrates
`the fundamental principles of the invention which may be
`adapted by those skilled in the art to implement the invention
`on other similar machines.
`
`Disco contains many features that reduce or eliminate the
`problemsassociated with traditional virtual machine moni-
`tors. Specifically,
`it minimizes the overhead of virtual
`machines and enhances the resource sharing betweenvirtual
`machines running on the same system. Disco allows oper-
`ating systems running on different virtual machines to be
`coupled using standard distributed systems protocols such as
`‘TCPAP and NES. It also allows for efficient sharing of
`memory and disk resources between virtual machines. The
`sharing support allows Disco to maintain a global buffer
`cache which is transparently shared by all
`the virtual
`machines, even when the virtual machines communicate
`through standard distributed protocols.
`FIG. 1 shows how the virtual machine monitor allows
`
`multiple copies of potentially different operating systems to
`coexist. In this figure, five virtual machines coexist on the
`multiprocessor. Some virtual machines run commodity uni-
`processor or multiprocessor operating systems, and others
`run specialized operating systems fine-tuned for specific
`workloads. The virtual machine monitor schedules the vir-
`tual
`resources (processor and memory) or
`th

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket