Friday, October 9, 2020

Requirements for out-of-process device emulation

Over the past months I have participated in discussions about out-of-process device emulation. This post describes the requirements that have become apparent. I hope this will be a useful guide to understanding the big picture about out-of-process device emulation.

What is out-of-process device emulation?

Device emulation is traditionally implemented in the program that executes guest code. This approach is natural because accesses to device registers are trapped as part of the CPU run loop that sits at the core of an emulator or virtual machine monitor (VMM).

In some use cases it is advantageous to perform device emulation in separate processes. For example, software-defined network switches can minimize data copies by emulating network cards directly in the switch process. Out-of-process device emulation also enables privilege separation and tighter sandboxing for security.

Why are these requirements important?

When emulated devices are implemented in the VMM they use common VMM APIs. Adding new devices is relatively easy because the APIs are already there and the developer can focus on the device specifics. Out-of-process device emulation potentially leaves developers without APIs since the device emulation program is a separate program that literally starts from main(). Developers want to focus on implementing their specific device, not on solving general problems related to out-of-process device emulation infrastructure.

It is not only a lot of work to implement an out-of-process device completely from scratch, but there is also a risk of developing the wrong solution because some subtleties of device emulation are not obvious at first glance.

I hope sharing these requirements will help in the creation of common infrastructure so it's easy to implement high-quality out-of-process devices.

Not all use cases have the full set of requirements. Therefore it's best if requirements are addressed in separate, reusable libraries so that device implementors can pick the ones that are relevant to them.

Device emulation

Device resources

Devices provide resources that drivers interact with such as hardware registers, memory, or interrupts. The fundamental requirement of out-of-process device emulation is exposing device resources.

The following types of device resources are needed:

Synchronous MMIO/PIO accesses

The most basic device emulation operation is the hardware register access. This is a memory-mapped I/O (MMIO) or programmed I/O (PIO) access to the device. A read loads a value from a device register. A write stores a value to a device register. These operations are synchronous because the vCPU is paused until completion.

Asynchronous doorbells

Devices often have doorbell registers, allowing the driver to inform the device that new requests are ready for processing. The vCPU does not need to wait since the access is a posted write.

The kvm.ko ioeventfd mechanism can be used to implement asynchronous doorbells.

Shared device memory

Devices may have memory-like regions that the CPU can access (such as PCI Memory BARs). The device emulation process therefore needs to share a region of its memory space with the VMM so the guest can access it. This mechanism also allows device emulation to busy wait (poll) instead of using synchronous MMIO/PIO accesses or asynchronous doorbells for notifications.

Direct Memory Access (DMA)

Devices often require read and write access to a memory address space belonging to the CPU. This allows network cards to transmit packet payloads that are located in guest RAM, for example.

Early out-of-process device emulation interfaces simply shared guest RAM. The allowed DMA to any guest physical memory address. More advanced IOMMU and address space identifier mechanisms are now becoming ubiquitous. Therefore, new out-of-process device emulation interfaces should incorporate IOMMU functionality.

The key requirement for IOMMU mechanisms is allowing the VMM to grant access to a region of memory so the device emulation process can read from and/or write to it.

Interrupts

Devices notify the CPU using interrupts. An interrupt is simply a message sent by the device emulation process to the VMM. Interrupt configuration is flexible on modern devices, meaning the driver may be able to select the number of interrupts and a mapping (using one interrupt with multiple event sources). This can be implemented using the Linux eventfd mechanism or via in-band device emulation protocol messages, for example.

Extensibility for new bus types

It should be possible to support multiple bus types. vhost-user only supports vhost devices. VFIO is more extensible but currently focussed on PCI devices. It is likely that QEMU SysBus devices will be desirable for implementing ad-hoc out-of-process devices (especially for System-on-Chip target platforms).

Bus-level APIs, not protocol bindings

Developers should not need to learn the out-of-process device emulation protocol (vfio-user, etc). APIs should focus on bus-level concepts such as defining VIRTIO or PCI devices rather than protocol bindings for dealing with protocol messages, file descriptor passing, and shared memory.

In other words, developers should be thinking in terms of the problem domain, not worrying about how out-of-process device emulation is implemented. The protocol should be hidden behind bus-level APIs.

Multi-threading support from the beginning

Threading issues arise often in device emulation because asynchronous requests or multi-queue devices can be implemented using threads. Therefore it is necessary to clearly document what threading models are supported and how device lifecycle operations like reset interact with in-flight requests.

Live migration, live upgrade, and crash recovery

There are several related issues around device state and restarting the device emulation program without disrupting the guest.

Live migration

Live migration transfers the state of a device from one device emulation process to another (typically running on another host). This requires the following functionality:

Quiescing the device

Some devices can be live migrated at any point in time without any preparation, while others must be put into a quiescent state to avoid issues. An example is a storage controller that has a write request in flight. It is not safe to live migration until the write request has completed or been canceled. Failure to wait might result in data corruption if the write takes effect after the destination has resumed execution.

Therefore it is necessary to quiesce a device. After this point there is no further device activity and no guest-visible changes will be made by the device.

Saving/loading device state

It must be possible to save and load device state. Device state includes the contents of hardware registers as well as device-internal state necessary for resuming operation.

It is typically necessary to determine whether the device emulation processes on the migration source and destination are compatible before attempting migration. This avoids migration failure when the destination tries to load the device state and discovers it doesn't support it. It may be desirable to support loading device state that was generated by a different implementation of the same device type (for example, two virtio-net implementations).

Dirty memory logging

Pre-copy live migration starts with an iterative phase where dirty memory pages are copied from the migration source to the destination host. Devices need to participate in dirty memory logging so that all written pages are transferred to the destination and no pages are "missed".

Crash recovery

If the device emulation process crashes it should be possible to restart it and resume device emulation without disrupting the guest (aside from a possible pause during reconnection).

Doing this requires maintaining device state (contents of hardware registers, etc) outside the device emulation process. This way the state remains even if the process crashes and it can be resume when a new process starts.

Live upgrade

It must be possible to upgrade the device emulation process and the VMM without disrupting the guest. Upgrading the device emulation process is similar to crash recovery in that the process terminates and a new one resumes with the previous state.

Device versioning

The guest-visible aspects of the device must be versioned. In the simplest case the device emulation program would have a --compat-version=N command-line option that controls which version of the device the guest sees. When guest-visible changes are made to the program the version number must be increased.

By giving control of the guest-visible device behavior it is possible to save/load and live migrate reliably. Otherwise loading device state in a newer device emulation program could affect the running guest. Guest drivers typically are not prepared for the device to change underneath them and doing so could result in guest crashes or data corruption.

Security

The trust model

The VMM must not trust the device emulation program. This is key to implementing privilege separation and the principle of least privilege. If a compromised device emulation program is able to gain control of the VMM then out-of-process device emulation has failed to provide isolation between devices.

The device emulation program must not trust the VMM to the extent that this is possible. For example, it must validate inputs so that the VMM cannot gain control of the device emulation process through memory corruptions or other bugs. This makes it so that even if the VMM has been compromised, access to device resources and associated system calls still requires further compromising the device emulation process.

Unprivileged operation

The device emulation program should run unprivileged to the extent that this is possible. If special permissions are required to access hardware resources then these resources can sometimes be provided via file descriptor passing by a more privileged parent process.

Sandboxing

Operating system sandboxing mechanisms can be applied to device emulation processes more effectively than monolithic VMMs. Seccomp can limit the Linux system calls that may be invoked. SELinux can restrict access to system resources.

Sandboxing is a common task that most device emulation programs need. Therefore it is a good candidate for a library or launcher tool that is shared by device emulation programs.

Management

Command-line interface

A common command-line interface should be defined where possible. For example, vhost-user's standard --socket-path=PATH argument makes it easy to launch any vhost-user device backend. Protocol-specific options (e.g. socket path) and device type-specific options (e.g. virtio-net) can be standardized.

Some options are necessarily specific to the device emulation program and therefore cannot be standardized.

The advantage of standard options is that management tools like libvirt can launch the device emulation programs without further user configuration.

RPC interface

It may be necessary to issue commands at runtime. Examples include adjusting throttling limits, enabling/disabling logging, etc. These operations can be performed over an RPC interface.

Various RPC interfaces are used throughout open source virtualization software. Adopting a widely-used RPC protocol and standardizing commands is beneficial because it makes it easy to communicate with the software and management tools can support them relatively easily.

Conclusion

This was largely a brain dump but I hope it is useful food for thought as out-of-process device emulation interfaces are designed and developed. There is a lot more to it than simply implementing a protocol for device register accesses and guest RAM DMA. Developing open source libraries in Rust and C that can be used as needed will ensure that out-of-process devices are high-quality and easy for users to deploy.