An operating system hypervisor provides an ability to run multiple operating systems on the same computer hardware system. This allows users to run applications which require different operating systems.
A useful computer system requires both computer hardware and an operating system. Unlike early computer systems, many hardware systems have multiple operating systems which can be used on them. For the most common hardware systems, those based on the Intel x86 architecture, users can choose between both Windows, Unix and many other operating systems. They may even choose between variants of these: Windows 95, Windows 98, Windows NT, Windows me, Windows 2000, or Windows XP, and Linux systems from Red Hat, Mandrake, Suse, or the BSD-based systems such as FreeBSD, NetBSD, and OpenBSD.
These differing operating systems provide differing computing services. While some application programs have been ported to run on several OSes, many applications run on only one OS, or one class of OSes. The desire to run a particular application often drives the need to use a particular operating system. If a desired application only runs on one operating system, that operating system is run, to provide the environment needed for the desired application.
But what if multiple applications are desired, and no single operating system runs all of them? In that case, we may need to have multiple computer systems, each with a different operating system.
There are other reasons to run multiple operating systems on the same hardware system. We may want to test a new version of an OS before it becomes the production system. If problems arise with the new OS, the previous OS is still available. This can be of particular value when the new OS is being developed and debugged.
A related use is running a separate OS in a sandbox, isolating the new system from the normal production environment to run an untrusted application. Running the untrusted application under a separate OS can limit is potential negative impact on the production OS and its applications.
But using different hardware computer systems to run different operating systems can be very expensive, in hardware costs, administrative time and costs, space, and power. Particularly when lots of systems are involved, it might be much easier to have just one system, or one large system, rather than a number of smaller ones. Server consolidation suggests that one large hardware system, with multiple processors, may be a more cost effective way to provide computing services. Effectively managing this large system may be easier if it is able to run several different OSes.
The simplest mechanism to allow multiple OSes to run on the same hardware is provided by a dual-boot capability. When the system is powered on, control is transferred, not to an OS, but to a boot program which allows one of several OSes to be run. For example, the Linux Loader (lilo) allows the user to select which of several OSes to boot.
While conceptually easy, implementing this concept can be quite difficult ["The Multi-Boot Configuration Handbook" by Roderick W. Smith, QUE, 2000]. The Multi-Boot specification was created to define the boot sequence and allow multiple OSes to be booted on the same hardware.
A particular problem is the use and interpretation of disk images. An OS typically supports a specific file system format, and as it boots, it expects to find the files that it needs on the disk in its file system. It is necessary to manage the disk so that each OS does not interfere with the file systems of other OSes.
Trying to run two OSes on the same hardware exposes two classes of systems; different solutions may be necessary for the two classes. One class of OS is the shrink-wrapped OS (SWOS). The SWOS is unaware of other OS and is designed to be in sole control of the computer system. It is provided in binary form, and so cannot be changed. The other class is the cooperative OS (COS). A cooperative OS can be modified to work in a shared environment. For most people, windows would be a SWOS, while the open-source OSes (such as Linux or BSD) would be COS.
Notice, however, that even a SWOS whose source is not available and which is provided only in binary, may allow some customization, possibly thru the dynamic loading and unloading of device drivers or a hardware abstraction layer (HAL).
Dual-booting between different OSes while it may allow different applications to be run on different OSes, is a time-consuming and mostly manual solution. It does not allow multiple OSes to be run "at the same time" such as is needed for server consolidation.
An alternative approach to running multiple OSes on the same hardware at the same time is the use of simulators. SimOS ["Complete Computer System Simulation: The SimOS Approach", by Mendel Rosenblum, Stephen A. Herrod, Emmett Witchel, and Anoop Gupta, IEEE Parallel and Distributed Technology: Systems and Applications, volume 3, number 4, pages 34-43, Winter 1995] and other programs, simulate to such a degree that they can run an operating system and its applications.
Running multiple OSes then involves booting a host OS on the hardware and running the simulator to support other OSes (or additional instances of the host OS).
Simulation provides many advantages. Many resources used by the simulated OS can be provided by mapping or virtualizing the resources of the host OS. The files systems for the simulated OSes can be remapped to files or partitions on the disk. The consoles of the simulated OSes can be virtualized into windows on the screen of the host OS.
Virtualization is a powerful technique used in the design of systems. A resource in a computer system can be physical or virtual. A virtual resource uses software to support the important semantics of a resource, typically using some other resource. A windowing system, such as X11, can be used to create virtual graphics terminals. Files in a file system can be used to create virtual disks. Simulators, in particular, may run their application programs with virtual resources.
In the most general case, a simulator does in software what a computer system does in hardware. Each instruction of the simulated program is fetched, decoded, and executed. This may require thousands of instructions, so that the simulated program runs thousands of times slower than it would if executed directly on the hardware.
One approach to speeding up this process is block translation : blocks of simulated instructions are translated into blocks of native instructions which accomplish the same thing. While block translation may take a little longer to get started (since each block must first be translated and then executed), the translated blocks can be saved, and subsequent execution can reuse a previously translated block, saving the translation time, and execute faster.
If the simulated instruction set is the same as the host instruction set, it is possible to use a technique called direct execution to gain even more speed.
OSes cannot be run as user programs because they execute privileged instructions. User programs do to execute privileged instructions. When a user program executes a privileged instruction, a trap occurs. It is possible on many systems (such as Unix) to register a signal handler to catch and handle such events. Direct execution simulation takes the simulated system and branches to its start address, using the hardware to execute instructions directly. If the simulated program executes a privileged instruction, the hardware traps to the host OS which relays the problem to the signal handler of the simulator. The simulator then simulates the effect of the privileged instruction and resumes control of the simulated program. (System calls can be intercepted by using ptrace on a Unix box. ptrace will call a handler when a system call is made. This is used to determine the system call and its arguments, bypass the system call, and execute user space kernel code for this system call.)
With direct execution, a simulated program runs at hardware speed except when it is executing privileged instructions.
The primary disadvantage of a simulator is normally its speed. However, using direct execution and block translate techniques can minimize the slowdown. This allow simulations to be very effective for many purposes, especially for running infrequent and interactive applications and for developing and debugging OSes. VMWare ["Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor", by Jeremy Sugerman, Ganesh Venkitachalam, and Beng-Hong Lim, Proceedings 2001 USENIX, Boston, MA, June 2001, pages 1-14], for example, can be used to run Windows applications on a Windows OS on a host Linux OS.
Introduce kernel vs. user mode.
To run a program correctly using direct execution, any instruction whose result can vary between direct execution under the simulator and native execution on the hardware must cause a trap to the simulator. A machine with this property is "virtualizable" and allows a virtual machine monitor to be written. A virtual machine monitor, such as CP/67 or VM/370 ["The origin of the VM/370 time-sharing system", by R.J. Creasy, IBM Journal of Research and Development, volume 25, number 5, September 1981, pages 483-490] is designed to allow multiple OSes to run directly on the hardware.
Most hardware architectures are not virtualizable, however.
All of the techniques which we have discussed to run multiple OSes on the same hardware were developed mainly for uniprocessor systems. For large multiprocessor systems, it seems that partitioning is a simpler approach. For a large multiprocessor (SMP or NUMA system), partitioning separates the set of hardware resources into two (or more) disjoint subsets. Each set of resources, with processors, memory, and I/O devices, is uses to run a separate OS. The hypervisor sets up the partitions and starts the OSes. Once the OSes are started, they execute on their own. Each OS has its own set of processors, memory and I/O devices.
As with the dual-boot case, the system boots up, not to an OS, but to the hypervisor. The hypervisor determines the set of available resources, partitions them, and then boots each OS with its set of resources.
Processors are the units which execute instructions. In most systems, processors are homogeneous, they all execute the same instruction set and are equivalent from an architectural point of view. In a multiprocessor system, each processor may have a unique processor ID, so that one processor can refer to another. Processors are generally fungible -- when a processor is needed, any one will do; all processors are generally equivalent.
Memory is generally homogeneous in SMP systems, although there may be processor affinity in NUMA systems. In addition, there are often some small amounts of special kinds of memory: FLASH or non-volatile RAM. In memory-mapped I/O systems, what may look like memory may actually be I/O register space. In addition, various special memory components, such as caches and TLBs, must be considered. For general memory, again, memory is fungible.
In addition, almost all systems now provide virtual memory. Virtual memory creates a mapping layer between the addresses seen by the program and the actual physical address of the memory. The mapping of virtual addresses to physical addresses allows memory to be allocated to a partition relatively freely, at least to the level of the unit of mapping -- the page.
Treating memory as a memory space. Different types of memory behind a memory address:
(What is a TCE -- a translation control entry used on powerpc to control DMA addressing by the I/O devices?)
I/O devices are not generally fungible -- each one is unique and must be treated separately. In addition, some I/O devices are actually controllers and/or bus systems which may be needed for other I/O devices. For example, a disk may be attached to a controller, which is on a bus. To drive the disk, it would be necessary to use the bus, and the controller in addition to the disk. While the use of the bus may be transparent to the program, the use of the disk controller may not be. So the I/O system may be a tree of busses, controllers, and devices. Allocating a particular device to a partition may require that its entire subtree be allocated to that partition.
Should this be treated as part of the Controlling OS? The common parts can be passed around as partitioned I/O devices until you get to where two different OSes need to share it. Then it must be virtualized. Owned by the Controlling OS, but virtualized by the hypervisor on its behalf.
This is made even more complex by multi-port devices.
If I/O devices are partitioned by the hypervisor, they can be used only by the OS to which they are assigned. For some devices that may be reasonable. A printers, for example, might be assigned to a particular OS. But how would other OSes running on the same system be able to print? There are several options:
A combination of these approaches may be best. Network connectitivy in particular might be provided by virtualization. This could provide the communication between the client OSes to allow remote use of the remaining devices. Communication between the OSes allows devices to be shared.
Communication between the various systems (OSes and hypervisor) will be necessary. There are two main types of communicaton:
Of particular value, however, would be the ability of the hypervisor to send messages from one OS to another. These messages can be used to provide network access to all processors, independent of the allocation of the hardware network card.
Messages can be of two types: synchronous and asynchronous. Synchronous messages would be sent from one OS to another, waiting for a reply. Asynchronous messages would be sent without waiting for a reply. Since both OSes are running on the same hardware, shared memory buffers can be used to pass information from one OS to the other quickly, without the cost of copying.
In general, since memory is mapped by the paging hardware of the system, it is possible to map the same pages into the memory allocated to different OSes. These shared memory areas would allow messages to be passed quickly from one OS to another, or for cooperating applications to be run on different OSes which share memory to allow very fast communication.
Once the structure of the hypervisor and its client OSes is understood, we can move from simple static partitioning to dynamic partitioning. In the simplest case this allows OSes to be brought up and shutdown independently. Systems can be shutdown, their resources repartitioned, and new OSes booted with those resources, allowing systems to change to meet dynamic workloads and applications.
A more drastic change would require that the OSes themselves be prepared to add or remove resources dynamically. Some such changes might be relatively easy. As we mentioned before, Plug and Play technology might allow I/O devices to be added or removed from an OS.
Processors too might be added or removed by relatively minor changes to OS data structures.
Memory might be more difficult. While it may be relatively easy to change an OS to add new memory, most systems would find it difficult to lose memory. The data structures and algorithms to allow large blocks of physical memory to be removed from an OS are probably not in most current OSes.
Different levels of dynamic partitioning, depending upon the degree of sophistication of the client OSes.
In the classical hypervisor structure, all partitioning and resource allocation decisions are made by the hypervisor. The hypervisor is responsible for defining the initial partitions and loading and booting the OSes in their partitions. In addition, if the partitions can be modified, for example under operator control, the hypervisor must accept the operator commands, parse them, determine what changes are necessary, and cause those changes to occur.
The need to load and boot the OSes means that the hypervisor must understand the file system structure needed to find the OS images, and must understand the file format of the boot images, to load the OSes into memory. To accept operator commands, requires an I/O capability, to, for example, a serial line, or display terminal. This results in a relatively "fat" hypervisor, with substantial code for device drivers, file system code, file load image code, and user interface code.
An alternative hypervisor structure is possible which contains only limited mechanism and very little code that is not directly needed for a hypervisor. The idea is to design a hypervisor interface which allows an external program to make all policy decisions about how the system is to be partitioned -- a separation of mechanism (which remains in the hypervisor) and policy (which is determined external to the hypervisor). Thus the hypervisor is more like a micro-kernel.
The key point is that the hypervisor can be made to have very little policy content, and only mechanism. All of the interesting work is done in a "controlling OS" which has a number of "client OS" or "children OS".
When the system is powered on, the hypervisor boots, and in turn boots and transfers control to the "Controlling OS". The Controlling OS is given the use of the entire system -- all memory, processors, devices. If the Controlling OS does nothing more, we have a system just like all the existing systems -- one OS running with all resources.
In the new design, however, the Controlling OS also has the programming to create new "client OS" systems. These client OS may be the same or different systems. The Controlling OS decides how much memory, how many processors, which devices to be allocated to the Client OS. It loads the OS code into its allocated memory, sets up the boot information (device tree or other information as necessary for the client OS).
Once everything is set up, it then informs the hypervisor of the new OS -- assigns it an ID, defines its memory, processor and device information, and tells the hypervisor to run it. The hypervisor transfers resources from the controlling OS to the client OS. The hypervisor now has two running OS: the original controlling OS and the new client OS.
The Controlling OS can continue partitioning its resources and create new Client OSes as needed. The Controlling OS is the source of all deliberate actions which change the partitioning of the system.
This design makes it clear that the hypervisor is only mechanism. It is in charge of starting the Controlling OS, and then simply implementing what it is told by the Controlling OS. All policy decisions are made by the Controlling OS. The decisions as to how many client OS and how the the system is partitioned can be done by a table in a file system of the Controlling OS, or by interactive commands from a user or whatever.
This creates two APIs to the hypervisor. One API is the client OS API. This is the flow of information between the hypervisor and the client OS which allows the client OS to run in the hypervisor based system: page faults, I/O interrupts, and so on.
A second API to the hypervisor is the interface between the Controlling OS and the hypervisor. This allows the Controlling OS to define new client OS, to allocate (and deallocate) memory, processors, and devices to the client OS.
This is a different design from a more classical hypervisor design of a collection of Peer OSes running on the hypervisor. In that approach, the hypervisor itself must bring up a group of Peer OS which then have to organize themselves. The new design makes it very clear which part of the system is responsible for what.
Strict partitioning with a hypervisor allows a large system to run multiple operating systems at the same time, by partitioning resources and allocating processors, memory, and I/O devices to the various OSes. Even for a large system, there may not be as many I/O devices as there are OSes, however, and so those devices may be either shared or virtualized.
Memory and processors can also be virtualized by the hypervisor. Memory can be virtualized by use of the paging hardware; processors by cpu scheduling techniques. Such techniques would allow a hypervisor to run more OSes than would otherwise be possible. Without processor virtualization, a hypervisor could not run more OSes than there are processors; each OS needs at least one processor.
Notice that while it is possible to virtualize memory, by using paging or swapping in the hypervisor, these same techniques may be in use by the OSes themselves. Thus, we may have problems with "double paging" -- the hypervisor bringing in a page only for the OS to page it out (or vice versa). Thus it is probably unwise to virtualize memory at the hypervisor.
Virtualizing processors is also fraught with difficulty. A timer is needed to allow the hypervisor to interrupt and schedule the processor between multiple OSes, since the client OS is almost certainly also scheduling the CPU. In addition, while using CPU scheduling to virtualize the processor allows it to be shared, there are important assumptions about the passage of time that may be made by the client OS which are not true when the processor is virtualized. This may particularly be true for a real-time OS.
OSes and applications may attempt to prevent preemption to provide safe access to their data structures by turning off interrupts, at least for short periods of time. If the processor is being shared among multiple OSes, it is not possible to turn off the interrupt system for preemption control. Other techniques will be needed that do not interfere with CPU scheduling by the hypervisor.
Same issue for pinned pages in memory. Hypervisor is unable to tell pinned pages from unpinned pages. Pinned pages just mean that the OS will not attempt to page them, but hypervisor can't tell. This is important for dynamic partition of memory. OS may have pinned pages for performance or for DMA to I/O. (DMA uses physical pages).
The hypervisor will also need to interact with interrupts from I/O devices. If each OS has its processor (or set of processors), interrupts may be able to be directed directly to the processors for each OS. However if a processor is being shared amongst multiple OSes, an interrupt cannot be directed by the hardware to an OS that is not running. Rather the interrupt will need to be intercepted by the hypervisor and redirected to the appropriate OS, even if that OS may not be currently running. This may require that interrupts be queued by the hypervisor until the appropriate OS can be scheduled.
In addition, we must consider the effects of swapping the CPU away from one OS to another. For example, performance counter registers, such as on the x86, must be saved and restored, or may give misleading information. Any information about timing may be incorrect -- reading the time of day clock or an interval timer to determine how long a sequence of code takes, even in the kernel with interrupts disabled may be incorrect.
Actions taken to prepare caches or TLBs may be less than useful. Note that even multiple systems running under a hypervisor are not exactly like multiple systems -- still have one memory system, with coherence between processors which are "logically" distinct. With processor sharing this is even more of a problem.
Consider a system that runs 1000's of OSes (for example, Virtual Image Facility which allows thousands of instances of Linux to run. Processors must be shared. Similarly memory must be shared. If each of 1000 OSes needs 100Mbytes of memory, we need 100Gbytes of memory to strictly partition it. So likely that we will need to swap in/swap out entire OSes (or page). Same with I/O devices -- they will need to be virtualized.
One approach is to create a new mode of operation: hypervisor mode in addition to the system and user modes which are normally a part of the system architecture. This allows the processor to distinguish the sets of instruciton and registers which are available in hypervisor systems or user mode. Alternative is running the OS in user mode, which then becomes the same as virtual machine or direct execution simulation.
The difficulty with partitioning is the enforcement of the partitions. For example, assume a system has 3Gbytes of physical memory and this is partitioned into a 2Gbyte chunk and a 1Gbyte chunk for two OSes: one OS gets addresses from 0 to 2G, the other from 2G to 3G. One issue is how the OSes determine how much memory they have. A typical SWOS will perform discovery as it boots and initializes its operations. For memory this may mean stepping thru memory, accessing one address after another to determine which addresses are valid and which are invalid. A COS, on the other hand, can be told how much memory it has and will configure its memory usage to the addresses it is told to use.
Pulling discovery out of the OS allows it to be run separately. Some systems run discovery in BIOS or Open Firmware.
Resources are presented to the OS via a boot-time data structure. A hypervisor can interpose between the BIOS/firmware and the OS, transforming the resource description data structure to present only the resources of the partition.
Notice, however, that with COS and separate discovery, while it is possible to design and run a system with partitioned resources, it is not sufficient for all purposes. Since all memory remains accessible to all OSes, a fault in one OS can cause it to modify a memory location outside its partition. The only thing that prevents incorrect memory modification is the correct operation of the so. In the event of a failure, this may not be so. Thus for server consolidation and fault containment, it is necessary to enforce, in hardware, the separation of memory (and other resources) access.
Hardware Support. Processor modes: user, kernel, hypervisor Memory Management, paging I/O Devices -- io commands, memory mapping. API to request, release resources. Interrupts, redirection Interrupts, special addresses (interrupt vectors, DMA) Re-mapping of resources. Plug and Play Multi-port devices Bandwidth DMA, buffer management, real addresses Power management IP addresses Design Goals Isolation of operating systems. Protection of access. Dynamic partitions. virtualizing vs. partitioning processors -- partition or share or (processor-time not processor) Hypervisor design and code Linux boot Trusted Systems, non-trusted systems Related work Relationship to Virtual Machines VM/360 -- CP (control program) (hypervisor) and CMS (user operating system) (from http://www.linuxplanet.com/linuxplanet/reports/2127/3/) Virtual Image Facility: A Cheaper VM The second new product from IBM is called Virtual Image Facility, or VIF. Readers of my first S/390 article (http://linuxplanet.com/linuxplanet/reports/1532/1/) will recall that IBM's Virtual Machine (VM) hypervisor allows many hundreds--or even a few thousand--instances of Linux to run on a single physical CPU or LPAR. This is great for sites that already have a VM license for other applications, but VM is expensive, and Linux customers had trouble cost-justifying the VM purchase. In response, IBM will announce the Virtual Image Facility as a low-end alternative to VM. VM is more than just a hypervisor to allocate virtualized resources; it is a full-blown operating system that runs large-scale applications. With VIF, IBM has stripped off the general-use parts of VM, leaving only the hypervisor core and some simple management tools, and reducing the price accordingly. VIF isn't nearly as versatile as VM: you can't run OS390 inside VIF, for example, nor can VIF run inside VIF. On the other hand, VIF is a one-time license priced at around US$20,000, a fraction of the ongoing software lease price for VM itself. VMWare The third and most spectacular option to run Windows applications under Linux is most definitely VMWare! Old-timers like Doctor Unix remember an IBM operating system way back in the 80's called VM (Virtual Machine). One of VM's smashing features (and more precisely, its main function) was the possibility to run multiple virtual machines, each of which could run its own operating system. So owners of an IBM mainframe could run multiple instances of OS's like VSE and MVS, each of which thought that it had the entire machine to itself. Whenever the guest OS did something like accessing the hardware, the VM hypervisor intervened and executed the instruction on behalf of the guest OS. This came with such obscure technologies like "double paging", simulated supervisor mode and other cool operating system internals stuff. Later, IBM more or less dumped VM in favor of a hardware feature called PR/SM (pronounce: Prism) which did more or less the same, but instead buried in hardware and microcode. One of VM's most spectacular features was its crash handling: if VM for one reason or another paniced head to toe, it did a "control block swap", and one of the guest operating systems (the so-called Preferred Guest) was then unleashed unto the real hardware (without it even noticing it), and could therefore survive the hypervisor crash. Fortunately, somebody paid attention to IBM's work here, and a couple of guys created a VM for PC's (named VMWare), allowing multiple operating systems to run in a virtual machine that is entirely under VMWare's control. Now, this has to be seen to be believed! So, a picture first: VMWare allows multiple operating systems to run under control of VMWare's hypervisor. So, you can for instance run Microsoft Windows 95, 98 or NT under VMWare on Linux. It works truly beautifully, emulating things on the lowest level, Phoenix BIOS and all! In my modest opinion, VMWare is the best option for true and complete Windows capabilities under Linux. But, there are disadvantages as well. VMWare is not free, it is not very fast (there are performance losses across the board) and you need a separate Microsoft Windows license to legally run it under VMWare. But, all in all, surely an option worth considering! (Sidestep: along the same lines a group of people is trying to write a free version of this technologie called FreeMWare). My only argument with VMWare is that perhaps it works too good! I mean, if VMWare runs really, really well, and most people that run Linux run VMWare for Windows compatibility, what is the incentive for software developers to bring out Unix/Linux versions of their applications? VMWare's developers might be obstructing the revolution, and as it goes with anti-revolutionary types, when we finally take over they unfortunately will have to be put against the wall and shot. My personal optimum would be if VMWare ran just good enough to convert people to Linux, but bas enough to make them want to install a Linux native version of their applications. Oh, and by the way, VMWare is not a completely new idea even in the Unix space. A company named Insignia specialised in creating virtual PC's some time ago with products like SoftWindows95, that even ran under Solaris and HP-UX. But, since these applications had to emulate the entire Pentium instruction set on a RISC chip, they ran so horribly slow that it almost defied belief. VMWare's advantage here is that it runs on the same chip as the guest operating system. Hypervisor for z-series (S/390) Linux Virtual Image Facility -- Linux on VM Hypervisor for i-series (AS/400) Hypervisor for p-series (RS/6000) http://w3.austin.ibm.com/:/projects/firmware/doc/lpar/designdocs.html Bressoud and Schneider, Hypervisor-based fault tolerance, ACM Transactions on Computer Systems (TOCS), Volume 14, Issue 1 (February 1996) Disco: Running Commodity Operating Systems on Scalable Multiprocessors ACM Transactions on Computer Systems, Vol. 15, No. 4, November 1997, Pages 412-447. Fluke -- Microkernels Meet Recursive Virtual Machines, Proceedings of the USENIX 2nd Symposium on Operating Systems Design and Implementation (OSDI '96) Seattle, Washington, October 28-31, 1996 T. Mitchem, R. Lu, R. O'Brien, Using Kernel Hypervisors to Secure Applications, Annual Computer Security Application Conference, December 1997 http://denali.cs.washington.edu/relwork/relwork.html http://www.eros-os.org/design-notes/IA32-Emulation.html