Friday, 4 March 2016

Slides available for "NFS over virtio-vsock: Host/guest file sharing for virtual machines"

I have been working on the virtio-vsock host/guest communications mechanism and the first application is file sharing with NFS. NFS is a mature and stable network file system which can be reused to provide host/guest file sharing in KVM.

If you are interested in learning more, check out the slides from my Connectathon 2016 presentation:

NFS over virtio-vsock: Host/guest file sharing for virtual machines

Wednesday, 2 March 2016

QEMU accepted into Google Summer of Code & Outreachy 2016!

I'm delighted to announce that QEMU has been accepted into Google Summer of Code 2016 and Outreachy May-August 2016.

Both GSoC and Outreachy are 12-week full-time remote work internships. Interns work on a QEMU-related project with the support of a mentor from the QEMU community. Interns are paid for their work by the generous funding from Google (GSoC), Red Hat (Outreachy), and/or IBM (Outreachy).

Find out more about the project ideas that have been proposed:
http://qemu-project.org/Google_Summer_of_Code_2016
http://qemu-project.org/Outreachy_2016_MayAugust

There are two dedicated IRC channels, #qemu-gsoc and #qemu-outreachy, that candidates can use to discuss project ideas with mentors and ask questions.

Good luck to all applicants!

Friday, 8 January 2016

QEMU Internals: How guest physical RAM works

Memory is one of the key aspects of emulating computer systems. Inside QEMU the guest RAM is modelled by several components that work together. This post gives an overview of the design of guest physical RAM in QEMU 2.5 by explaining the most important components without going into all the details. After reading this post you will know enough to dig into the QEMU source code yourself.

Note that guest virtual memory is not covered here since it deserves its own post. KVM virtualization relies on hardware memory translation support and does not use QEMU's software MMU.

Guest RAM configuration

The QEMU command-line option -m [size=]megs[,slots=n,maxmem=size] specifies the initial guest RAM size as well as the maximum guest RAM size and number of slots for memory chips (DIMMs).

The reason for the maximum size and slots is that QEMU emulates DIMM hotplug so the guest operating system can detect when new memory is added and removed using the same mechanism as on real hardware. This involves plugging or unplugging a DIMM into a slot, just like on a physical machine. In other words, changing the amount of memory available isn't done in byte units, it's done by changing the set of DIMMs plugged into the emulated machine.

Hotpluggable guest memory

The "pc-dimm" device (hw/mem/pc-dimm.c) models a DIMM. Memory is hotplugged by creating a new "pc-dimm" device. Although the name includes "pc" this device is also used with ppc and s390 machine types.

As a side-note, the initial RAM that the guest started with might not be modelled with a "pc-dimm" device and it can't be unplugged.

The guest RAM itself isn't contained inside the "pc-dimm" object. Instead the "pc-dimm" must be associated with a "memory-backend" object.

Memory backends

The "memory-backend" device (backends/hostmem.c) contains the actual host memory that backs guest RAM. This can either be anonymous mmapped memory or file-backed mmapped memory. File-backed guest RAM allows Linux hugetlbfs usage for huge pages on the host and also shared-memory so other host applications can access to guest RAM.

The "pc-dimm" and "memory-backend" objects are the user-visible parts of guest RAM in QEMU. They can be managed using the QEMU command-line and QMP monitor interface. This is just the tip of the iceberg though because there are still several aspects of guest RAM internal to QEMU that will be covered next.

The following diagram shows the components explained below:

RAM blocks and the ram_addr_t address space

Memory inside a "memory-backend" is actually mmapped by RAMBlock through qemu_ram_alloc() (exec.c). Each RAMBlock has a pointer to the mmap memory and also a ram_addr_t offset. This ram_addr_t offset is interesting because it is in a global namespace so the RAMBlock can be looked up by the offset.

The ram_addr_t namespace is different from the guest physical memory space. The ram_addr_t namespace is a tightly packed address space of all RAMBlocks. Guest physical address 0x100001000 might not be ram_addr_t 0x100001000 since ram_addr_t does not include guest physical memory regions that are reserved, memory-mapped I/O, etc. Furthermore, the ram_addr_t offset is dependent on the order in which RAMBlocks were created, unlike the guest physical memory space where everything has a fixed location.

All RAMBlocks are in a global list RAMList object called ram_list. ram_list holds the RAMBlocks and also the dirty memory bitmaps.

Dirty memory tracking

When the guest CPU or device DMA stores to guest RAM this needs to be noticed by several users:

  1. The live migration feature relies on tracking dirty memory pages so they can be resent if they change during live migration.
  2. TCG relies on tracking self-modifying code so it can recompile changed instructions.
  3. Graphics card emulation relies on tracking dirty video memory to redraw only scanlines that have changed.

There are dirty memory bitmaps for each of these users in ram_list because dirty memory tracking can be enabled or disabled independently for each of these users.

Address spaces

All CPU architectures have a memory space and some also have an I/O address space. This is represented by AddressSpace, which contains a tree of MemoryRegions (include/exec/memory.h).

The MemoryRegion is the link between guest physical address space and the RAMBlocks containing the memory. Each MemoryRegion has the ram_addr_t offset of the RAMBlock and each RAMBlock has a MemoryRegion pointer.

Note that MemoryRegion is more general than just RAM. It can also represent I/O memory where read/write callback functions are invoked on access. This is how hardware register accesses from a guest CPU are dispatched to emulated devices.

The address_space_rw() function dispatches load/store accesses to the appropriate MemoryRegions. If a MemoryRegion is a RAM region then the data will be accessed from the RAMBlock's mmapped guest RAM. The address_space_memory global variable is the guest physical memory space.

Conclusion

There are a few different layers involved in managing guest physical memory. The "pc-dimm" and "memory-backend" objects are the user-visible configuration objects for DIMMs and memory. The RAMBlock is the mmapped memory chunk. The AddressSpace and its MemoryRegion elements place guest RAM into the memory map.

Tuesday, 1 September 2015

KVM Forum 2015 slides are available!

KVM Forum 2015 was co-located with LinuxCon North America and Linux Plumbers Conference in Seattle, Washington.

The slides and videos for talks are being posted here.

Some of my favorite talks included:

  • Towards multi-threaded TCG by Alex Bennée and Frederic Konrad. Great overview of the TCG just-in-time compiler and how it needs to be extended to support SMP guests.
  • KVM Message Passing Performance by David Matlack. Performance analysis of message-passing performance (but also affects other workloads). The latency diagrams were particularly useful in showing where the overhead is.
  • Using IPMI in QEMU by Corey Minyard. Who would have thought that IPMI would attract this audience and get so much interest? Corey gave a great overview of what IPMI is and how QEMU can support it. Hopefully this work will be upstream soon.

Wednesday, 19 August 2015

virtio-vsock: Zero-configuration host/guest communication

Slides are available for my talk at KVM Forum 2015 about virtio-vsock: Zero-configuration host/guest communication.

virtio-vsock is a new host/guest communications mechanism that allows applications to use the Sockets API to communicate between the hypervisor and virtual machines. It uses the AF_VSOCK address family which was introduced in Linux in 2013.

There are several advantages of virtio-serial. The main advantage is the familiar Sockets API semantics, which is more convenient than serial ports. See the slides for full details on what virtio-vsock offers.

Friday, 14 August 2015

Asynchronous file I/O on Linux: Plus ça change

In 2009 Anthony Liguori gave a presentation at Linux Plumbers Conference about the state of asynchronous file I/O on Linux. He talked about what was missing from POSIX AIO and Linux AIO APIs. I recently got thinking about this again after reading the source code for the io_submit(2) system call.

Over half a decade has passed and plus ça change, plus c'est la même chose. Sure, there are new file systems, device-mapper targets, the multiqueue block layer, and high IOPS PCI SSDs. There's DAX for storage devices accessible via memory load/store instructions - radically different from the block device model.

However, the io_submit(2) system call remains a treacherous ally in the quest for asynchronous file I/O. I don't think much has changed since 2009 in making Linux AIO the best asynchronous file I/O mechanism.

The main problem is that io_submit(2) waits for I/O in some cases. It can block! This defeats the purpose of asynchronous file I/O because the caller is stuck until the system call completes. If called from a program's event loop, the program becomes unresponsive until the system call returns. But even if io_submit(2) is invoked from a dedicated thread where blocking doesn't matter, latency is introduced to any further I/O requests submitted in the same io_submit(2) call.

Sources of blocking in io_submit(2) depend on the file system and block devices being used. There are many different cases but in general they occur because file I/O code paths contain synchronous I/O (for metadata I/O or page cache write-out) as well as locks/waiting (for serializing operations). This is why the io_submit(2) system call can be held up while submitting a request.

This means io_submit(2) works best on fully-allocated files, volumes, or block devices. Anything else is likely to result in blocking behavior and cause poor performance.

Since these conditions don't apply in many cases, QEMU has its own userspace thread-pool with worker threads that call preadv(2)/pwritev(2). It would be nice to default to Linux AIO but the limitations are too serious.

Have there been new developments or did I get something wrong? Let me know in the comments.

Wednesday, 1 April 2015

Tracing Linux kernel function entries/returns

Here is a neat ftrace recipe for tracing execution while the Linux kernel is inside a particular function.  This helps when a kernel function or its children are failing but you don't know where or why.

ftrace will trigger on particular functions if you give it  set_graph_function values.  That way you only see traces from the functions you are interested in.  This eliminates the noise you get when tracing all function entries/returns without a filter.

Let's trace virtio_dev_probe() and all its children:

echo virtio_dev_probe >/sys/kernel/debug/tracing/set_graph_function
echo function_graph >/sys/kernel/debug/tracing/current_tracer
echo 1 >/sys/kernel/debug/tracing/tracing_on

modprobe transport_virtio

echo 0 >/sys/kernel/debug/tracing/tracing_on
echo >/sys/kernel/debug/tracing/current_tracer
echo >/sys/kernel/debug/tracing/set_graph_function
cat /sys/kernel/debug/tracing/trace

Here is some example output:

...
 0)               |        virtqueue_kick [virtio_ring]() {
 0) + 30.207 us   |          virtqueue_kick_prepare [virtio_ring]();
 0) + 13.342 us   |          vp_notify [virtio_pci]();
 0) + 90.315 us   |        }
 0) # 61946.45 us |      }
 0)   1.046 us    |      mutex_unlock();
 0) # 102833.9 us |    }
 0)   2.411 us    |    vp_get_status [virtio_pci]();
 0)   0.826 us    |    vp_get_status [virtio_pci]();
 0) ! 130.773 us  |    vp_set_status [virtio_pci]();
 0)               |    virtio_config_enable [virtio]() {
 0)   0.689 us    |      _raw_spin_lock_irq();
 0) + 33.796 us   |    }
 0) # 105349.9 us |  }

I haven't figured out whether set_graph_function can be used on functions whose kernel module has not been loaded yet.  I think the answer is no, but please let me know in the comments if there is a way to do it.