Thursday, 22 December 2011

QEMU 2011 Year in Review

As 2011 comes to an end I want to look back at the highlights from the QEMU community this year. Development progress feels good, the mailing list is very active, and QEMU's future looks bright. I only started contributing in 2010 but the growth since QEMU's early days must be enormous. Perhaps someone will make a source history visualization that shows the commit history and clusters of activity.

Here is the recap of the milestones that QEMU reached in 2011.

QEMU 0.14

In February the 0.14 release came out with a bunch of exciting new features:

For full details see the changelog.

QEMU 0.15

In August the 0.15 release brought yet more cool improvements:

For full details see the changelog.

Google Summer of Code

QEMU participated in Google Summer of Code 2011 and received funding for students to contribute to QEMU during the summer. Behind the scenes this takes an aweful lot of work from the students themselves but also from the mentors. These four projects were successfully completed:

  • Boot Mac OS >=8.5 on PowerPC system emulation
  • QED <-> QCOW2 image conversion utility
  • Improved VMDK image format compatibility
  • Adding NeXT emulation support

Hopefully we can continue to participate in GSoC and give students an opportunity to get involved with open source emulation and virtualization.

QEMU 1.0

The final QEMU release for 2011 was in December. The release announcement was picked up quite widely and after hitting Hacker News and Reddit required effort to try to keep the QEMU website up. I think that's a good sign, although QEMU 1.0 is kind of like Linux 3.0 in that the version number change does not represent a fundamental new codebase or architecture. Here somre of the changes:

  • Xtensa target architecture
  • TCG Interpreter interprets portable bytecode instead of translating to native machine code

For full details see the changelog.

Ongoing engineering efforts

There is a lot of change in motion as the year ends. Here are long-term efforts that are unfolding right now:

  • Jan Kiszka has made a lot of progress in the quest to merge qemu-kvm back into QEMU. In a way this is similar to Xen's QEMU fork which was merged back earlier this year. This is a great effort because some day soon there will be no more confusion over qemu-kvm vs qemu when they have been unified.
  • Avi Kivity took on the interfaces for guest memory management and is in the process of revamping them. This touches not only the core concept of how QEMU registers and tracks guest memory but also every single emulated device.
  • Anthony Liguori is working on an object model that will make emulated devices and all resources managed by QEMU consistent and accessible via APIs. I think of this like introducing sysfs so that there is a one hierarchy and ways to explore and manipulate everything QEMU knows about.

Looking forward to 2012

It is hard to pick the highlights to mention but I hope this summary has given you a few links to click and brought back cool features you forgot about :). Have a great new year!

Thursday, 17 November 2011

Pictures from the Computer History Museum

I visited the Computer History Museum in Mountain View, CA when attending Google Summer of Code Mentor Summit 2011. The museum is fantastic for anyone with an interest in computers because they have the actual machines there - PDP-11, VAX, System 360, and much more. I'd like to go back again because I didn't finish seeing everything before closing time :).

Here are some highlights from the museum, selected more for fun than historic value:

IBM flowcharting tools: This is how I design all my QEMU patches.

Fortran, the next Ruby on Rails?

The neat thing is that I was actually on the same flight back home as the GNU Fortran hackers who attended the Mentor Summit. I think they liked the badges :).

System 360 is not as big as you imagine

Alexander Graf probably has one in his bedroom :).

Go and visit if you get a chance. They have the actual teapot that the famous OpenGL teapot is modelled after!

Wednesday, 28 September 2011

Preparing and storing patch revisions as git tags using git-publish

This weekend I got down to solving a workflow problem that has been bugging me for some time: preparing and storing patch revisions. Manually managing patch revisions is painful; I often find myself switching between git and my inbox several times to put together a consistent patch series.

git-publish is a script that numbers patch revisions, optionally stores a cover letter, and submits the patches via git-send-email(1). When your tree is in a state that you wish to publish you say:

$ git publish --to=qemu-devel@nongnu.org

It creates a git tag that you can refer back to in the future and send out the patch series emails.

No more numbering revisions, copy & pasting cover letters, or running several steps to format and send patch series.

Give it a try if you are tired of manually managing patch revisions with git. git-publish is released under the MIT License at http://github.com/stefanha/git-publish. I have provided documentation but you can set it up in just two lines:

$ git clone git://github.com/stefanha/git-publish.git
$ git-publish/git-publish --setup # make available via git alias

Be sure to check out the README - it explains how to install and run it in more detail.

Happy git-publishing!

Monday, 19 September 2011

Enhanced VMDK support now in QEMU

QEMU now has greatly enhanced VMDK VMware disk image file support, thanks to Fam Zheng's hard work during Google Summer of Code 2011. Previously QEMU was only able to handle older VMDK files because it did not support the entire VMDK file format specification. This resulted in qemu-img convert and other tools being unable to open certain VMDK files. As of now, qemu.git has merged code to handle the VMDK specification and work well with modern image files.

If you had trouble in the past manipulating VMDK files with qemu-img, it may be worth another look soon. You can already build the latest and greatest qemu-img from the qemu.git repository and distros will provide packages with full VMDK support in the future:

$ git clone git://git.qemu.org/qemu.git
$ cd qemu
$ ./configure
$ make qemu-img
$ ./qemu-img convert ...

It is still recommended to convert VMDK files to QEMU's native formats (raw, qcow2, or qed) in order to get optimal performance for running VMs.

At the end of Fam's Summer of Code project, he put together an article on gotchas and undocumented behavior in the VMDK specification. This will be of great interest to developers writing their own code to manipulate VMDK image files. His experience this summer involved testing a wide range of VMware software, real-world image files, as well as studying existing open-source VMDK code. The VMDK specification is ambiguous in places and does not cover several essential details, so Fam had to figure them out himself and he then documented them.

So with that, happy VMDK-ing...

Sunday, 11 September 2011

How to share files instantly between virtual machines and host

It's pretty common to need to copy files between a virtual machine and the host. This can be drivers or installers from the host into the virtual machine or it could be in order to get some data out of the virtual machine and onto the host.

There are several solutions to sharing files, like network file systems, third-party file hosting, or even the virtfs paravirtualized file system supported by KVM. But my favorite ad-hoc file sharing tool is Python's built-in webserver.

The one-liner that shares files over HTTP

To share some files, simply change into the directory subtree you want to export and start the web server:

$ cd path/to/shared/files && python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...

The directory hierarchy at path/to/shared/files is now available over HTTP on all IPs for the machine on port 8000. The web server generates index listings for directories so it is easy to browse.

To access the host from a virtual machine:

  • User networking (slirp): http://10.0.2.2:8000/
  • NAT networking: the default gateway from ip route show | grep ^default, for example http://192.168.122.1:8000/
  • Bridged or routed networking: the IP address of the host

To access the virtual machine from the host:

  • NAT, bridged, or routed networking: the virtual machine's IP address
  • User networking (slirp): forward a port with the hostfwd_add QEMU monitor command or -net user,hostfwd= option documented in the QEMU man page

Advantages

There are a couple of reasons why I like Python's built-in web server:

  • No need to install software since Linux and Mac typically already have Python installed.
  • No privileges are required to run the web server.
  • Works for both physical and virtual machines.
  • HTTP is supported by desktop file managers, browsers, and the command-line with wget or curl.
  • Network booting and installation is possible straight from the web server. Be warned that RHEL installs over HTTP have been known to fail with SimpleHTTPServer but I have not encountered problems with other software.

Security tips

This handy HTTP server is only suitable for trusted networks for a couple of reasons:

  • Encryption is not supported so data travels in plain text and can be tampered with in flight.
  • The directory listing feature means you should only export directory subtrees that contain no sensitive data.
  • SimpleHTTPServer is not a production web server and there could be obvious bugs.

Conclusion

Python's built-in web server is one of my favorite tricks that not many people seem to know. I hope this handy command comes in useful to you!

Wednesday, 7 September 2011

QEMU Internals: vhost architecture

This post explains how vhost provides in-kernel virtio devices for KVM. I have been hacking on vhost-scsi and have answered questions about ioeventfd, irqfd, and vhost recently, so I thought this would be a useful QEMU Internals post.

Vhost overview

The vhost drivers in Linux provide in-kernel virtio device emulation. Normally the QEMU userspace process emulates I/O accesses from the guest. Vhost puts virtio emulation code into the kernel, taking QEMU userspace out of the picture. This allows device emulation code to directly call into kernel subsystems instead of performing system calls from userspace.

The vhost-net driver emulates the virtio-net network card in the host kernel. Vhost-net is the oldest vhost device and the only one which is available in mainline Linux. Experimental vhost-blk and vhost-scsi devices have also been developed.

In Linux 3.0 the vhost code lives in drivers/vhost/. Common code that is used by all devices is in drivers/vhost/vhost.c. This includes the virtio vring access functions which all virtio devices need in order to communicate with the guest. The vhost-net code lives in drivers/vhost/net.c.

The vhost driver model

The vhost-net driver creates a /dev/vhost-net character device on the host. This character device serves as the interface for configuring the vhost-net instance.

When QEMU is launched with -netdev tap,vhost=on it opens /dev/vhost-net and initializes the vhost-net instance with several ioctl(2) calls. These are necessary to associate the QEMU process with the vhost-net instance, prepare for virtio feature negotiation, and pass the guest physical memory mapping to the vhost-net driver.

During initialization the vhost driver creates a kernel thread called vhost-$pid, where $pid is the QEMU process pid. This thread is called the "vhost worker thread". The job of the worker thread is to handle I/O events and perform the device emulation.

In-kernel virtio emulation

Vhost does not emulate a complete virtio PCI adapter. Instead it restricts itself to virtqueue operations only. QEMU is still used to perform virtio feature negotiation and live migration, for example. This means a vhost driver is not a self-contained virtio device implementation, it depends on userspace to handle the control plane while the data plane is done in-kernel.

The vhost worker thread waits for virtqueue kicks and then handles buffers that have been placed on the virtqueue. In vhost-net this means taking packets from the tx virtqueue and transmitting them over the tap file descriptor.

File descriptor polling is also done by the vhost worker thread. In vhost-net the worker thread wakes up when packets come in over the tap file descriptor and it places them into the rx virtqueue so the guest can receive them.

Vhost as a userspace interface

One surprising aspect of the vhost architecture is that it is not tied to KVM in any way. Vhost is a userspace interface and has no dependency on the KVM kernel module. This means other userspace code, like libpcap, could in theory use vhost devices if they find them convenient high-performance I/O interfaces.

When a guest kicks the host because it has placed buffers onto a virtqueue, there needs to be a way to signal the vhost worker thread that there is work to do. Since vhost does not depend on the KVM kernel module they cannot communicate directly. Instead vhost instances are set up with an eventfd file descriptor which the vhost worker thread watches for activity. The KVM kernel module has a feature known as ioeventfd for taking an eventfd and hooking it up to a particular guest I/O exit. QEMU userspace registers an ioeventfd for the VIRTIO_PCI_QUEUE_NOTIFY hardware register access which kicks the virtqueue. This is how the vhost worker thread gets notified by the KVM kernel module when the guest kicks the virtqueue.

On the return trip from the vhost worker thread to interrupting the guest a similar approach is used. Vhost takes a "call" file descriptor which it will write to in order to kick the guest. The KVM kernel module has a feature called irqfd which allows an eventfd to trigger guest interrupts. QEMU userspace registers an irqfd for the virtio PCI device interrupt and hands it to the vhost instance. This is how the vhost worker thread can interrupt the guest.

In the end the vhost instance only knows about the guest memory mapping, a kick eventfd, and a call eventfd.

Where to find out more

Here are the main points to begin exploring the code:
  • drivers/vhost/vhost.c - common vhost driver code
  • drivers/vhost/net.c - vhost-net driver
  • virt/kvm/eventfd.c - ioeventfd and irqfd
The QEMU userspace code shows how to initialize the vhost instance:
  • hw/vhost.c - common vhost initialization code
  • hw/vhost_net.c - vhost-net initialization

Sunday, 21 August 2011

KVM Forum 2011 Highlights

KVM Forum 2011 was co-located with LinuxCon North America in Vancouver, Canada. KVM Forum ran Monday and Tuesday, 16 & 17 of August and featured two tracks packed with developer-oriented talks on KVM and related open source projects.

Here is a summary with links to some of the most interesting talks:

Big picture and roadmaps


Daniel Berrange gave an excellent overview of libvirt and libguestfs. These two tools, along with other virt-tools like virt-v2v, form the user-visible toolkit and APIs around KVM. Anyone who needs to automate KVM or develop custom applications that integrate with KVM should learn about the work being done by the libvirt community to provide both APIs and tools for managing virtualizated guests.

Alon Levy's SPICE Roadmap talk explains how remote graphics, input, sound, and USB are being added to KVM. SPICE goes far beyond today's VNC server, which can only scrape screen updates and send them to the client. SPICE has a deeper insight into the graphics pipeline and is able to provide efficient remote displays. In addition, channels for input, sound, and USB pass-through promise to bring good desktop integration to KVM.

Subsystem status and plumbing


Fixing the USB Disaster explains the state of USB, where Gerd Hoffmann has been working to remove limitations and add support for the USB 2.0 standard. These much-needed improvements will allow USB pass-through to actually work across devices. In the past I've had mixed results when passing through VoIP handsets and consumer electronics devices. Thanks to Gerd's work, I'm hoping that USB pass-through will work more consistently in future releases.

Migration: One year later tells the story of live migration in KVM. Juan Quintela has been working on this area of QEMU. In part due to today's live migration support in the device model, it has been a struggle to provide working live migration as device emulation is extended or fixed. In particular it is a challenging problem to migration from an old qemu-kvm to a new one, and vice versa.

New features


AMD IOMMU Version 2 covers the enhanced I/O Memory Management Unit from AMD. Joerg Roedel, who has been working on nested virtualization on AMD CPUs, presents the key features of this new IOMMU. It adds PCI ATS-based support for demand paging. This eliminates the need to lock guest memory when doing PCI pass-through, since it's now possible to swap in a page and resume the I/O when a fault occurs.

Along the lines of Kemari, Kei Ohmura presents Rapid VM Synchronization with I/O Emulation Logging-Replay. The new trick is that I/O logging enables shared-nothing configurations where the primary and secondary host do not have access to shared storage. Instead the primary sends I/O logs to the secondary, where they are replayed to bring the secondary disk image up to date.

Performance


For a tour of performance tweaks across memory, networking, and storage, check out Mark Wagner's talk on KVM Performance Improvements and Optimizations. He covers Transparent Huge Pages, vhost-net, SR-IOV, block I/O schedulers, Linux AIO, NUMA tuning, and more.

...and more


Each talk was only 30 minutes long, and with two tracks that meant lots of talks. To see all presentations, go to the KVM Forum 2011 website.

This post only covered non-IBM presentations. There were so many IBMers working on KVM around that I'd like to bring together those talks and show all the areas they touch on. In my next post I will give an overview of the KVM presentations given by IBM Linux Technology Center people at KVM Forum and LinuxCon North America.

Wednesday, 3 August 2011

My KVM Architecture guest post is up at Virtualization@IBM

Earlier this year IBM launched the Virtualization@IBM blog and I'm pleased to have contributed a guest post on KVM Architecture!

KVM Architecture: The Key Components of Open Virtualization with KVM explains the open virtualization stack built around KVM. It highlights the performance, security, and management characteristics of the architecture. I hope it is a good overview if you want a quick idea of how KVM works.

You can follow @Linux_at_IBM and @OpenKVM on Twitter for more official updates. For example, the Open Virtualization Alliance that HP, IBM, Intel, Red Hat, and many others are coming together around.

Wednesday, 8 June 2011

LinuxCon Japan 2011 KVM highlights

I had the opportunity to attended LinuxCon Japan 2011 in Yokohama. The conference ran many interesting talks, with a virtualization mini-summit taking place through the first two days. Here are highlights from the conference and some of the latest KVM-related presentation slides.

KVM Ecosystem

Jes Sorensen from Red Hat presented the KVM Weather Report, a status update and look at recent KVM work. Check out his slides if you want an overview of where KVM is today. The main points that I identify are:
  • KVM performance is excellent and continues to improve.
  • In the future, expect advanced features that round out KVM's strengths.
Jes' presentation ends on a nice note with a weather forecast that reads "Cloudy with a chance of total world domination" :).

Virtualization End User Panel

The end user panel brought together three commerical users of open virtualization software to discuss their deployments and answer questions from the audience. I hope a video of this panel will be made available because it is interesting to understand how virtualization is put to use.

All three users were in the hosting or cloud market. The users were split across KVM, Xen, and KVM on standard and VMware on premium offerings. It is clear that KVM is becoming the hypervisor of choice for hosting due to its low cost and good Linux integration - the Xen user had started several years ago but is now evaluating KVM. My personal experience with USA and UK hosting is that Xen is still widely deployed although KVM is growing quickly.

All three users rely on custom management tools and web interfaces. Although libvirt is being used the management layers above it are seen as an opportunity to differentiate. The current breed of open cloud and virtualization management tools weren't seen as mature or scalable enough. I expect this area of the virtualization stack to solidify with several of the open source efforts consolidating in order to reach critical mass.

Storage and Networking

Jes Sorensen from Red Hat covered the KVM Live Snapshot Support work that will allow disk snapshots to be taken for backup and other purposes without stopping the VM. My own talk gave An Updated Overview of the QEMU Storage Stack and covered other current work in the storage area, as well as explaining the most important storage configuration settings when using QEMU.

Stephen Hemminger from Vyatta presented an overview of Virtual Networking Performance. KVM is looking pretty good relative to other hypervisors, no doubt thanks to all the work that has gone into network performance under KVM. Michael Tsirkin and many others are still optimizing virtio-net and vhost_net to reduce latency, improve throughput, and reduce CPU consumption and I think the results justify virtio and paravirtualized I/O. KVM is able to continue improving network performance by extending its virtio host<->guest interface in order to work more efficiently - something that is impossible when emulating existing real-world hardware.

Other KVM-related talks

Isaku Yamahata from VA Linux gave his Status Update on QEMU PCI Express Support. His goal is PCI device assignment of PCI Express adapters. I think his work is important for QEMU in the long term since eventually hardware emulation needs to be able to present modern devices in order for guest operating systems to function.

Guangrong Xiao from Fujitsu presented KVM Memory Virtualization Progress, which describes how guest memory access works. It covers both hardware MMU virtualization in modern processors as well as the software solution used on older hardware. Kernel Samepage Merging (KSM) and Transparent Hugepages (THP) are also touched upon. This is a good technical overview if you want to understand how memory management virtualization works.

More slides and videos

The full slides and videos for talks should be available within the next few weeks. Here are the links to the LinuxCon Japan 2011 main schedule and virtualization mini-summit schedule.

Saturday, 7 May 2011

KVM Slides from Red Hat Summit 2011

Update: Added link to Mark Wagner's KVM Performance Improvements and Optimizations slides that Andrew Cathrow posted on IRC.

This year at Red Hat Summit 2011 many presentations touched on KVM and virtualization. Slide decks are mostly online now so I took a look and highlighted those that I found most interesting. It's also worth checking again in a few days, hopefully the remaining slide decks will come online.

Converting, Inspecting, & Modifying Virtual Machines with Red Hat Enterprise Linux 6.1

Slides: libguestfs material, virt-v2v/virt-p2v material

Richard Jones (libguestfs) and Matthew Booth (virt-v2v/virt-p2v) cover the tools they have developed for manipulating virtual machines and their disk images. This looks really, really cool. KVM needs great tools for working with VM data rather than requiring the user to manually stack up disk partitioning, volume management, file system, and other functionality from scratch every time.

libguestfs has a small Linux-based appliance VM containing disk, volume, and file system tools. There are a bunch of command-line tools and a shell for interacting with appliance VM, which can access guest file systems without requiring root privileges on the host. Files can be downloaded/uploaded, partitions can be inspected, guest operating systems can be detected, and even the Windows registry can be edited using libguestfs.

KVM Performance Optimizations

Slides: KVM Performance Optimizations

Rik van Riel gave an overview of recent and future KVM performance optimizations:
  • vhost-net in-kernel virtio-net host accelerator.
  • kernel samepage merging (ksm) memory deduplication.
  • transparent hugepages automatic hugepages without administrator management.
  • pause look exiting as a solution to the lockholder preemption problem.
  • free page hinting and dynamic memory resizing to do intelligent swapping and waste fewer resources.

This is definitely worth reading if you're interested in virtualization internals. The presentation also answers the practical question of how ksm and transparent hugepages interact (both are memory management features that are not trivially compatible with each other). Asynchronous page faults weren't mentioned but I've linked to Gleb Natapov's KVM Forum 2010 slides on this feature since it fits in the same category.

System Resource Management Using Red Hat Enterprise Linux 6 cGroups

Slides: System Resource Management Using Red Hat Enterprise Linux 6 cGroups

Linda Wang and Bob Kozdemba explain the cgroups resource control features that have been added to Linux. Processes can be assigned to control groups which kernel subsystems like the scheduler or block layer take into account when arbitrating resources. Cgroups can be used to divide CPU, memory, block, and network resources. This looks much better than nice(1), ionice(1), and friends although cgroups are a complimentary feature and don't replace them. Next time a compile or a download is affecting interactive foreground processes I'll be sure to try out cgroups.

What does cgroups have to do with KVM? Since KVM is based on Linux and VMs are in fact userspace processes the cgroups features can be used to apply resource controls to VMs. Libvirt will play a role here and take care of setting up the right cgroups behind the scenes, but it is interesting to learn about the underlying mechanism and what it can do.

KVM Performance Improvements and Optimizations

Slides: KVM Performance Improvements and Optimizations

Mark Wagner gives an overview of performance tuning across CPU, memory (NUMA), disk, and network I/O. Lots of keywords and tweaks to dig into for anyone tuning KVM installations.

Saturday, 23 April 2011

How to capture VM network traffic using qemu -net dump

This post describes how to save a packet capture of the network traffic a QEMU
virtual machine sees. This feature is built into QEMU and works with any
emulated network card and any host network device except vhost-net.

It's relatively easy to use tcpdump(8) with tap networking. First the
tap device for the particular VM needs to be identified and then packets can be
captured:
# tcpdump -i vnet0 -s0 -w /tmp/vm0.pcap

The tcpdump(8) approach cannot be easily used with non-tap host network devices, including slirp and socket.

Using the dump net client

Packet capture is built into QEMU and can be done without tcpdump(8). There are some restrictions:
  1. The vhost-net host network device is not supported because traffic does not cross QEMU so interception is not possible.
  2. The old-style -net command-line option must be used instead of -netdev because the dump net client depends on the mis-named "vlan" feature (essentially a virtual network hub).

Without further ado, here is an example invocation:
$ qemu -net nic,model=e1000 -net dump,file=/tmp/vm0.pcap -net user
This presents the VM with an Intel e1000 network card using QEMU's userspace network stack (slirp). The packet capture will be written to /tmp/vm0.pcap. After shutting down the VM, either inspect the packet capture on the command-line:
$ /usr/sbin/tcpdump -nr /tmp/vm0.pcap

Or open the pcap file with Wireshark.

Wednesday, 13 April 2011

KVM-Autotest Install Fest on April 14

Mike Roth has just posted a nice guide to getting started with KVM-Autotest, the suite of acceptance tests that can be run against KVM. KVM-Autotest is able to automate guest installs and prevent regressions being introduced into KVM.

I'm looking forward to participating in the KVM-Autotest Install Fest tomorrow and encourage all QEMU and KVM developers to do the same. I only dabbled with KVM-Autotest once in the past and this is an opportunity to begin using it more regularly and look at contributing tests.

Adam Litke has helped organize the event and set up a wiki page here.

I look forward to see fellow KVM-Autotesters on #qemu IRC tomorrow :).

Saturday, 9 April 2011

How to pass QEMU command-line options through libvirt

An entire virtual machine configuration can be passed on QEMU's extensive
command-line, including everything from PCI slots to CPU features to serial
port settings. While defining a virtual machine from a monster
command-line may seem insane, there are times when QEMU's rich command-line
options come in handy.

And at those times one wishes to side-step libvirt's domain XML and specify
QEMU command-line options directly. Luckily libvirt makes this possible and I
learnt about it from Daniel Berrange and Anthony Liguori on IRC. This libvirt
feature will probably come in handy to others and so I want to share it.

The <qemu:commandline> domain XML tag

There is a special namespace for QEMU-specific tags in libvirt domain XML. You
cannot use QEMU-specific tags without first declaring the namespace. To enable
it use the following:
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

Now you can add command-line arguments to the QEMU invocation. For example, to load an option ROM with -option-rom:
<qemu:commandline>
   <qemu:arg value='-option-rom'/>
   <qemu:arg value='path/to/my.rom'/>
</qemu:commandline>

It is also possible to add environment variables to the QEMU invocation:
<qemu:commandline>
   <qemu:env name='MY_VAR' value='my_value'/>
</qemu:commandline>

Setting qdev properties through libvirt

Taking this a step further we can set qdev properties through libvirt. There is no domain XML for setting the virtio-blk-pci ioeventfd qdev property. Here is how to set it using <qemu:arg> and the -set QEMU option:
<qemu:commandline>
  <qemu:arg value='-set'/>
  <qemu:arg value='device.virtio-disk0.ioeventfd=off'/>
</qemu:commandline>

The result is that libvirt generates a QEMU command-line that ends with -set device.virtio-disk0.ioeventfd=off. This causes QEMU to go back and set the ioeventfd property of device virtio-disk0 to off.

More information

The following libvirt wiki page documents mappings from QEMU command-line options to libvirt domain XML. This is extremely useful if you know which QEMU option to use but are unsure how to express that in domain XML.

That page also reveals the <qemu:commandline> tag and shows how it can be used to invoke QEMU with the GDB stub (-s).

Tuesday, 29 March 2011

How to use perf-probe

Dynamic tracepoints are an observability feature that allows functions
and even arbitrary lines of source code to be instrumented without recompiling
the program. Armed with a copy of the program's source code, tracepoints can
be placed anywhere at runtime and the value of variables can be dumped each
time execution passes the tracepoint. This is an extremely powerful technique
for instrumenting complex systems or code written by someone else.

I recently investigated issues with CD-ROM media change
inside a KVM guest. Using QEMU tracing I was able to
instrument the ATAPI CD-ROM emulation inside QEMU and observe the commands
being sent to the CD-ROM. I complemented this with perf-probe(1) to
instrument the sr and cdrom drivers in the Linux host and
guest kernels. The power of perf-probe(1) was key to understanding the
state of the CD-ROM drivers without recompiling a custom debugging kernel.

Overview of perf-probe(1)

The probe subcommand of perf(1) allows dynamic tracepoints
to be added or removed inside the Linux kernel
. Both core kernel and modules
can be instrumented.

Besides instrumenting locations in the code, a tracepoint can also fetch
values from local variables, globals, registers, the stack, or memory. You can
combine this with perf-record -g callgraph recording to get stack
traces when execution passes probes. This makes perf-probe(1)
extremely powerful.

Requirements

A word of warning: use a recent kernel and perf(1) tool. I was unable
to use function return probes with Fedora 14's 2.6.35-based kernel. It works
fine on with a 2.6.38 perf(1) tool. In general, I have found the
perf(1) tool to be somewhat unstable but this has improved a lot with
recent versions.

Plain function tracing can be done without installing kernel debuginfo
packages but arbitrary source line probes and variable dumping requires the
kernel debuginfo packages. On Debian-based distros the kernel debuginfo
packages are called linux-image-*-dbg and on Red Hat-based distros
they are called kernel-debuginfo.

Tracing function entry

The following command-line syntax adds a new function entry probe:
sudo perf probe <function-name> [<local-variable> ...]

Tracing function return

The following command-line syntax adds a new function return probe and dumps the return value:
sudo perf probe <funtion-name>%return '$retval'
Notice the arg1 return value in this example output for scsi_test_unit_ready():
$ sudo perf record -e probe:scsi_test_unit_ready -aR sleep 10
[...]
$ sudo perf script
     kworker/0:0-3383  [000]  3546.003235: scsi_test_unit_ready: (ffffffffa0208c6d <- ffffffffa0069306) arg1=8000002
 hald-addon-stor-3025  [001]  3546.004431: scsi_test_unit_ready: (ffffffffa0208c6d <- ffffffffa006a143) arg1=8000002
     kworker/0:0-3383  [000]  3548.004531: scsi_test_unit_ready: (ffffffffa0208c6d <- ffffffffa0069306) arg1=8000002
 hald-addon-stor-3025  [001]  3548.005531: scsi_test_unit_ready: (ffffffffa0208c6d <- ffffffffa006a143) arg1=8000002

Listing probes

Existing probes can be listed:
sudo perf probe -l

Removing probes

A probe can be deleted with:
sudo perf probe -d <probe-name<
All probes can be deleted with:
sudo perf probe -d '*'

Further information

See the perf-probe(1) man page for details on command-line syntax and additional features.

The underlying mechanism of perf-probe(1) is kprobe-based event tracing, which is documented here.

I hope that future Linux kernels will add perf-probe(1) for
userspace processes. Nowadays GDB might already include this feature in its
tracing commands but I haven't had a chance to try them out yet.

Saturday, 26 March 2011

Best practices and tuning tips for KVM

The IBM Best practices for KVM document covers storage, networking, and memory/CPU overcommit tuning. If you are looking for known good configurations and a place to start using KVM for optimized server virtualization, check out this document!

Tuesday, 22 March 2011

How to access the QEMU monitor through libvirt

It is sometimes useful to issue QEMU monitor commands to VMs managed by libvirt. Since libvirt takes control of the monitor socket it is not possible to interact with the QEMU monitor in the same way as when running QEMU or KVM manually.

Daniel Berrange shared the following techniques on IRC a while back. It is actually pretty easy to get at the QEMU monitor even while libvirt is managing the VM:

Method 1: virsh qemu-monitor-command


There is a virsh command available in libvirt ≥0.8.8 that allows you to access the QEMU monitor through virsh:

virsh qemu-monitor-command --hmp <domain> '<command> [...]'

Method 2: Connecting directly to the monitor socket


On older libvirt versions the only option is shutting down libvirt, using the monitor socket directly, and then restarting libvirt:

sudo service libvirt-bin stop  # or "libvirtd" on Red Hat-based distros
sudo nc -U /var/lib/libvirt/qemu/<domain>.monitor
...
sudo service libvirt-bin start

Either way works fine. I hope this is useful for folks troubleshooting QEMU or KVM. In the future I will post more libvirt tips :).

Update: Daniel Berrange adds that using the QEMU monitor essentially voids your libvirt warranty :). Try to only use query commands like info qtree rather than commands that change the state of QEMU like adding/removing devices.

Saturday, 19 March 2011

QEMU.org accepted for Google Summer of Code 2011

Good news for students interested in contributing to open source this summer: QEMU.org has been accepted for Google Summer of Code 2011!

If you are interested or have a friend who is enrolled at a university and would like to get paid to work on open source software this summer, check out the QEMU GSoC 2011 ideas page. The full list of accepted organizations is here.

Take a look at my advice on applying on how to succeed with your student application and get chosen.

It's going to be a fun summer and I'm looking forward to getting to know talented students who want to contribute to open source!

Saturday, 12 March 2011

How to write trace analysis scripts for QEMU

This post shows how to write a Python script that finds overlapping disk writes in a QEMU simple trace file.

Several trace backends, including SystemTap and LTTng Userspace Tracer, are supported by QEMU. The built in "simple" trace backend often gives the best bang for the buck because it does not require installing additional software and is easy to use for developers. See my earlier post for an overview of QEMU tracing.

The simple trace backend has recently been enhanced with a Python module for analyzing trace files. This makes it easy to write scripts that post-process trace files and extract useful information. Take this example from the commit that introduced the simpletrace module:

#!/usr/bin/env python
# Print virtqueue elements that were never returned to the guest.

import simpletrace

class VirtqueueRequestTracker(simpletrace.Analyzer):
    def __init__(self):
        self.elems = set()

    def virtqueue_pop(self, vq, elem, in_num, out_num):
        self.elems.add(elem)

    def virtqueue_fill(self, vq, elem, length, idx):
        self.elems.remove(elem)

    def end(self):
        for elem in self.elems:
            print hex(elem)

simpletrace.run(VirtqueueRequestTracker())

This script tracks virtqueue_pop and virtqueue_fill operations and prints out the elements that were popped but never filled back, which indicates elements have leaked.

The model of an analysis script is similar to awk. Trace records are processed from the input file by invoking methods on the user's simpletrace.Analyzer object. The analyzer object does not have to supply methods for all possible trace events, it can just implement those that it wants to know about. Trace events that have no dedicated method cause the catchall() method to be invoked, if provided.

Tracing disk write operations


Let's write a slightly fancier script that finds disk writes that overlap a given range. I've needed to perform disk write overlap queries in the past when debugging image formats, so this is a useful script to have. QEMU's block layer write function looks like this:

BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
                                  QEMUIOVector *iov, int nb_sectors,
                                  BlockDriverCompletionFunc *cb, void *opaque);

The block device is called bs and sectors are 512 bytes. Conveniently there is already a trace event for bdrv_aio_write so we just need to enable it in the trace-events file:

disable bdrv_aio_writev(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs %p sector_num %"PRId64" nb_sectors %d opaque %p"

Open the trace-events file and remove the disable keyword from the bdrv_aio_write trace event. Then rebuild QEMU like this:

$ ./configure --enable-trace-backend=simple # ... plus your usual options
$ make

Next time you run QEMU a trace file named trace-$PID will be created in the current working directory. The file is in binary and can be parsed using the simpletrace Python module or pretty-printed like this:

$ scripts/simpletrace.py trace-events trace-12345 # replace "trace-12345" with the actual filename

Finding overlapping disk writes


Here is the usage information for a script to find disk writes that overlap a given range:

usage: find_overlapping_writes.py <trace-events> <trace-file> <bs> <sector-num> <nb-sectors>

The script only considers writes to a specific block device, bs. That means all disk I/O to other block devices is ignored.

For example, let's find writes to a 1 MB region at offset 2 MB of the BlockDriverState 0x29c8180:
$ ./find_overlapping_writes.py trace-events trace-19129 0x29c8180 4096 2048
0xe40+664 opaque=0x2eb4190
0x10d8+3032 opaque=0x2eb41d0

Here is the code:
#!/usr/bin/env python
import sys
import simpletrace

def intersects(a_start, a_len, b_start, b_len):
    return not (a_start + a_len <= b_start or \
                b_start + b_len <= a_start)

class OverlappingWritesAnalyzer(simpletrace.Analyzer):
    def __init__(self, bs, sector_num, nb_sectors):
        self.bs = bs
        self.sector_num = sector_num
        self.nb_sectors = nb_sectors

    def bdrv_aio_writev(self, bs, sector_num, nb_sectors, opaque):
        if bs != self.bs:
            return
        if intersects(self.sector_num, self.nb_sectors, sector_num, nb_sectors):
            print '%#x+%d opaque=%#x' % (sector_num, nb_sectors, opaque)

if len(sys.argv) != 6:
    sys.stderr.write('usage: %s <trace-events> <trace-file> <bs> <sector-num> <nb-sectors>\n' %
                     sys.argv[0])
    sys.exit(1)

trace_events, trace_file, bs, sector_num, nb_sectors = sys.argv[1:]
bs = int(bs, 0)
sector_num = int(sector_num, 0)
nb_sectors = int(nb_sectors, 0)

analyzer = OverlappingWritesAnalyzer(bs, sector_num, nb_sectors)
simpletrace.process(trace_events, trace_file, analyzer)

The core of the script is the OverlappingWritesAnalyzer that checks bdrv_aio_writev events for intersection with the given range.

This script is longer than the virtqueue leak detector example above because it parses command-line arguments. The simpletrace.run() function used by the leak detector handles the default trace events and trace file arguments for you. So scripts that take no special command-line arguments can use simpletrace.run(), which also prints usage information automatically. For overlapping writes we really need our own command-line arguments so the slightly lower-level simpletrace.process() function is used.

Where to find out more


There is more information about the simpletrace module in the doc comments, so the simplest way to get started is:
$ cd qemu
$ PYTHONPATH=scripts python
>>> import simpletrace
>>> help(simpletrace)

Another example of how to use the simpletrace module is the trace file pretty-printer which is included as part of the scripts/simpletrace.py source code itself!

Feel free to leave questions or comments about the simple trace backend and the simpletrace Python module.

Wednesday, 9 March 2011

QEMU Internals: Big picture overview

Last week I started the QEMU Internals series to share knowledge of how QEMU works. I dove straight in to the threading model without a high-level overview. I want to go back and provide the big picture so that the details of the threading model can be understood more easily.

The story of a guest


A guest is created by running the qemu program, also known as qemu-kvm or just kvm. On a host that is running 3 virtual machines there are 3 qemu processes:


When a guest shuts down the qemu process exits. Reboot can be performed without restarting the qemu process for convenience although it would be fine to shut down and then start qemu again.

Guest RAM


Guest RAM is simply allocated when qemu starts up. It is possible to pass in file-backed memory with -mem-path such that hugetlbfs can be used. Either way, the RAM is mapped in to the qemu process' address space and acts as the "physical" memory as seen by the guest:


QEMU supports both big-endian and little-endian target architectures so guest memory needs to be accessed with care from QEMU code. Endian conversion is performed by helper functions instead of accessing guest RAM directly. This makes it possible to run a target with a different endianness from the host.

KVM virtualization


KVM is a virtualization feature in the Linux kernel that lets a program like qemu safely execute guest code directly on the host CPU. This is only possible when the target architecture is supported by the host CPU. Today KVM is available on x86, ARMv8, ppc, s390, and MIPS CPUs.

In order to execute guest code using KVM, the qemu process opens /dev/kvm and issues the KVM_RUN ioctl. The KVM kernel module uses hardware virtualization extensions found on modern Intel and AMD CPUs to directly execute guest code. When the guest accesses a hardware device register, halts the guest CPU, or performs other special operations, KVM exits back to qemu. At that point qemu can emulate the desired outcome of the operation or simply wait for the next guest interrupt in the case of a halted guest CPU.

The basic flow of a guest CPU is as follows:
open("/dev/kvm")
ioctl(KVM_CREATE_VM)
ioctl(KVM_CREATE_VCPU)
for (;;) {
     ioctl(KVM_RUN)
     switch (exit_reason) {
     case KVM_EXIT_IO:  /* ... */
     case KVM_EXIT_HLT: /* ... */
     }
}

The host's view of a running guest


The host kernel schedules qemu like a regular process. Multiple guests run alongside without knowledge of each other. Applications like Firefox or Apache also compete for the same host resources as qemu although resource controls can be used to isolate and prioritize qemu.

Since qemu system emulation provides a full virtual machine inside the qemu userspace process, the details of what processes are running inside the guest are not directly visible from the host. One way of understanding this is that qemu provides a slab of guest RAM, the ability to execute guest code, and emulated hardware devices; therefore any operating system (or no operating system at all) can run inside the guest. There is no ability for the host to peek inside an arbitrary guest.

Guests have a so-called vcpu thread per virtual CPU. A dedicated iothread runs a select(2) event loop to process I/O such as network packets and disk I/O completion. For more details and possible alternate configuration, see the threading model post.

The following diagram illustrates the qemu process as seen from the host:




Further information


Hopefully this gives you an overview of QEMU and KVM architecture. Feel free to leave questions in the comments and check out other QEMU Internals posts for details on these aspects of QEMU.

Here are two presentations on KVM architecture that cover similar areas if you are interested in reading more:

Tuesday, 8 March 2011

How to automatically run checkpatch.pl when developing QEMU

The checkpatch.pl script was recently added to qemu.git as a way to scan patches for coding standard violations. You can automatically run checkpatch.pl when committing changes to git and abort if there are violations:

$ cd qemu
$ cat >.git/hooks/pre-commit
#!/bin/bash
exec git diff --cached | scripts/checkpatch.pl --no-signoff -q -
^D
$ chmod 755 .git/hooks/pre-commit

Any commit that violates the coding standard as checked by checkpatch.pl will be aborted. I am running with this git hook now and will post any tweaks I make to it.

Update: If you encounter a false positive because checkpatch.pl is complaining about code you didn't touch, use git commit --no-verify to override the pre-commit hook. Use this trick sparingly :-).

Saturday, 5 March 2011

QEMU Internals: Overall architecture and threading model

This is the first post in a series on QEMU Internals aimed at developers. It is designed to share knowledge of how QEMU works and make it easier for new contributors to learn about the QEMU codebase.

Running a guest involves executing guest code, handling timers, processing I/O, and responding to monitor commands. Doing all these things at once requires an architecture capable of mediating resources in a safe way without pausing guest execution if a disk I/O or monitor command takes a long time to complete. There are two popular architectures for programs that need to respond to events from multiple sources:
  1. Parallel architecture splits work into processes or threads that can execute simultaneously. I will call this threaded architecture.
  2. Event-driven architecture reacts to events by running a main loop that dispatches to event handlers. This is commonly implemented using the select(2) or poll(2) family of system calls to wait on multiple file descriptors.

QEMU actually uses a hybrid architecture that combines event-driven programming with threads. It makes sense to do this because an event loop cannot take advantage of multiple cores since it only has a single thread of execution. In addition, sometimes it is simpler to write a dedicated thread to offload one specific task rather than integrate it into an event-driven architecture. Nevertheless, the core of QEMU is event-driven and most code executes in that environment.

The event-driven core of QEMU


An event-driven architecture is centered around the event loop which dispatches events to handler functions. QEMU's main event loop is main_loop_wait() and it performs the following tasks:

  1. Waits for file descriptors to become readable or writable. File descriptors play a critical role because files, sockets, pipes, and various other resources are all file descriptors. File descriptors can be added using qemu_set_fd_handler().
  2. Runs expired timers. Timers can be added using qemu_mod_timer().
  3. Runs bottom-halves (BHs), which are like timers that expire immediately. BHs are used to avoid reentrancy and overflowing the call stack. BHs can be added using qemu_bh_schedule().

When a file descriptor becomes ready, a timer expires, or a BH is scheduled, the event loop invokes a callback that responds to the event. Callbacks have two simple rules about their environment:
  1. No other core code is executing at the same time so synchronization is not necessary. Callbacks execute sequentially and atomically with respect to other core code. There is only one thread of control executing core code at any given time.
  2. No blocking system calls or long-running computations should be performed. Since the event loop waits for the callback to return before continuing with other events, it is important to avoid spending an unbounded amount of time in a callback. Breaking this rule causes the guest to pause and the monitor to become unresponsive.

This second rule is sometimes hard to honor and there is code in QEMU which blocks. In fact there is even a nested event loop in qemu_aio_wait() that waits on a subset of the events that the top-level event loop handles. Hopefully these violations will be removed in the future by restructuring the code. New code almost never has a legitimate reason to block and one solution is to use dedicated worker threads to offload long-running or blocking code.

Offloading specific tasks to worker threads


Although many I/O operations can be performed in a non-blocking fashion, there are system calls which have no non-blocking equivalent. Furthermore, sometimes long-running computations simply hog the CPU and are difficult to break up into callbacks. In these cases dedicated worker threads can be used to carefully move these tasks out of core QEMU.

One example user of worker threads is posix-aio-compat.c, an asynchronous file I/O implementation. When core QEMU issues an aio request it is placed on a queue. Worker threads take requests off the queue and execute them outside of core QEMU. They may perform blocking operations since they execute in their own threads and do not block the rest of QEMU. The implementation takes care to perform necessary synchronization and communication between worker threads and core QEMU.

Another example is ui/vnc-jobs-async.c which performs compute-intensive image compression and encoding in worker threads.

Since the majority of core QEMU code is not thread-safe, worker threads cannot call into core QEMU code. Simple utilities like qemu_malloc() are thread-safe but that is the exception rather than the rule. This poses a problem for communicating worker thread events back to core QEMU.

When a worker thread needs to notify core QEMU, a pipe or a qemu_eventfd() file descriptor is added to the event loop. The worker thread can write to the file descriptor and the callback will be invoked by the event loop when the file descriptor becomes readable. In addition, a signal must be used to ensure that the event loop is able to run under all circumstances. This approach is used by posix-aio-compat.c and makes more sense (especially the use of signals) after understanding how guest code is executed.

Executing guest code


So far we have mainly looked at the event loop and its central role in QEMU. Equally as important is the ability to execute guest code, without which QEMU could respond to events but would not be very useful.

There are two mechanism for executing guest code: Tiny Code Generator (TCG) and KVM. TCG emulates the guest using dynamic binary translation, also known as Just-in-Time (JIT) compilation. KVM takes advantage of hardware virtualization extensions present in modern Intel and AMD CPUs for safely executing guest code directly on the host CPU. For the purposes of this post the actual techniques do not matter but what matters is that both TCG and KVM allow us to jump into guest code and execute it.

Jumping into guest code takes away our control of execution and gives control to the guest. While a thread is running guest code it cannot simultaneously be in the event loop because the guest has (safe) control of the CPU. Typically the amount of time spent in guest code is limited because reads and writes to emulated device registers and other exceptions cause us to leave the guest and give control back to QEMU. In extreme cases a guest can spend an unbounded amount of time without giving up control and this would make QEMU unresponsive.

In order to solve the problem of guest code hogging QEMU's thread of control signals are used to break out of the guest. A UNIX signal yanks control away from the current flow of execution and invokes a signal handler function. This allows QEMU to take steps to leave guest code and return to its main loop where the event loop can get a chance to process pending events.

The upshot of this is that new events may not be detected immediately if QEMU is currently in guest code. Most of the time QEMU eventually gets around to processing events but this additional latency is a performance problem in itself. For this reason timers, I/O completion, and notifications from worker threads to core QEMU use signals to ensure that the event loop will be run immediately.

You might be wondering what the overall picture between the event loop and an SMP guest with multiple vcpus looks like. Now that the threading model and guest code has been covered we can discuss the overall architecture.

iothread and non-iothread architecture


The traditional architecture is a single QEMU thread that executes guest code and the event loop. This model is also known as non-iothread or !CONFIG_IOTHREAD and is the default when QEMU is built with ./configure && make. The QEMU thread executes guest code until an exception or signal yields back control. Then it runs one iteration of the event loop without blocking in select(2). Afterwards it dives back into guest code and repeats until QEMU is shut down.

If the guest is started with multiple vcpus using -smp 2, for example, no additional QEMU threads will be created. Instead the single QEMU thread multiplexes between two vcpus executing guest code and the event loop. Therefore non-iothread fails to exploit multicore hosts and can result in poor performance for SMP guests.

Note that despite there being only one QEMU thread there may be zero or more worker threads. These threads may be temporarily or permanent. Remember that they perform specialized tasks and do not execute guest code or process events. I wanted to emphasise this because it is easy to be confused by worker threads when monitoring the host and interpret them as vcpu threads. Remember that non-iothread only ever has one QEMU thread.

The newer architecture is one QEMU thread per vcpu plus a dedicated event loop thread. This model is known as iothread or CONFIG_IOTHREAD and can be enabled with ./configure --enable-io-thread at build time. Each vcpu thread can execute guest code in parallel, offering true SMP support, while the iothread runs the event loop. The rule that core QEMU code never runs simultaneously is maintained through a global mutex that synchronizes core QEMU code across the vcpus and iothread. Most of the time vcpus will be executing guest code and do not need to hold the global mutex. Most of the time the iothread is blocked in select(2) and does not need to hold the global mutex.

Note that TCG is not thread-safe so even under the iothread model it multiplexes vcpus across a single QEMU thread. Only KVM can take advantage of per-vcpu threads.

Conclusion and words about the future

Hopefully this helps communicate the overall architecture of QEMU (which KVM inherits). Feel free to leave questions in the comments below.

In the future the details are likely to change and I hope we will see a move to CONFIG_IOTHREAD by default and maybe even a removal of !CONFIG_IOTHREAD.

I will try to update this post as qemu.git changes.

Thursday, 3 March 2011

Should I use QEMU or KVM?

UPDATE: The qemu-kvm.git fork has been merged back into qemu.git as of QEMU 1.3.0. Always use qemu.git for the latest code. See my full post here.

"What is the difference between QEMU and KVM?" comes up regularly because these two pieces of software share a close relationship. I am going to explain how to choose between the two and the nature of their relationship.

Should I install the qemu or qemu-kvm package?


If you want to run x86 virtual machines on x86 physical machines, install qemu-kvm. It has the fastest and most thoroughly tested support for the common x86 virtualization use case.

If you want to run anything else, install qemu. That includes running non-x86 machines and user level emulation instead of full-system emulation.

If you are still not sure which is right for you, take a look at the QEMU and KVM websites.

How do I check that qemu-kvm is using hardware support?


If qemu-kvm is unable to use hardware virtualization extensions it will fall back to emulation which is much slower. If you are worried this might be the case, run the following check:

grep 'svm\|vmx' /proc/cpuinfo

If you get output then the CPU supports virtualization extensions and KVM should work. Otherwise check that virtualization is enabled in your BIOS.

See the processor support KVM wiki page for more information.

It is also a good idea to use the -enable-kvm command-line option to ensure that KVM is used. The libvirt, virt-manager, virsh stack will do this by default.

What is the difference between qemu.git and qemu-kvm.git?


The QEMU codebase is known as qemu.git. That's the git repository that holds the QEMU source code history. The KVM codebase is known as qemu-kvm.git, the git repository that holds the KVM source code history.

The relationship between qemu.git and qemu-kvm.git is as follows. qemu-kvm.git is a fork of qemu.git and periodically merges updates from qemu.git back into qemu-kvm.git. A lot of code changes are merged into qemu.git and become available in qemu-kvm.git after the next periodic merge. KVM-specific enhancements may be merged into qemu-kvm.git and may be sent back upstream to qemu.git.

Efforts are underway to completely merge qemu-kvm.git into qemu.git. This will make qemu-kvm.git obsolete and result in a single codebase. In the future there may only be a qemu package.

Tuesday, 1 March 2011

Advice for students applying to Google Summer of Code

For the past several years Google has run a program called Summer of Code (GSoC) that funds university students to work on open source projects during the summer. A large number of leading open source projects participate and provide mentorship to students.

Google have announced that organizations can start applying for GSoC 2011. Students will be able to apply for projects once the accepted organizations have been announced. See the timeline for details.

In 2008 I participated as a student and worked on remote GDB debugging for the gPXE network bootloader. I had a great time and stuck around after GSoC ended, continuing to contribute to Etherboot.org. In the following two years of GSoC I participated again, this time as a mentor.

I've seen GSoC from both sides and here is my advice for students who want to apply.

Is it worth doing?


Yes, definitely. GSoC gives you privileged access to an open source community and a mentor who has committed to supporting you. If you have ever been interested in contributing to an open source project then this is the chance!

Choosing a project


In past years there have been many participating organizations to choose from. You can either apply for a listed project idea or suggest your own project idea. If you have your own project idea then make sure to get in touch with the organization well before the student application deadline in order to pitch your idea to them and get their support.

Choose a project idea that you are comfortable with, both in terms of the amount of effort it will take and your current level of skills. You can learn a lot of new things during the summer but make sure you can deliver on what you are promising. The good news is that there are so many project ideas to choose from that you should be able to find something that matches your interests and skills.

The other common factor is the amount of time you will be expected to spend on your project. Many students work full-time from Monday to Friday much like a regular job. Sometimes organizations are happy to accept talented students who can deliver their project with less time commitment. You can probably take some vacation days off but make sure to state your availability upfront when applying.

It's worth keeping in mind that GSoC is quite decentralized. Individual organizations have a large degree of control over how they run their projects. No two organizations work the same so look at their previous years' wiki pages and project archives or ask the mentors to understand how they operate.

I suggest applying to two or three organizations. Since organizations are free to approach GSoC quite differently, it's worth diversifying your applications so you can choose the organization and mentor you are most comfortable with in the end. Remember that just because a piece of software is cool does not mean that their GSoC or community are the right place for you, so look at multiple organizations. Also keep in mind that organizations have limited "slots", or numbers of students that Google will fund, so if you are unlucky an organization may run out of slots and be unable to take you even if they are interested. For these reasons it makes sense to apply to two or three organizations.

Making your application


The bare minimum student application involves filling out a project proposal form. To increase your chance of getting accepted you need to consider how organizations select students.

Organizations will receive more student applications than they have slots. In the past I've seen 5:1 to 10:1 ratios so understand that GSoC is competitive. There are three levels of student applications:
  1. Students who either do not have the skills or did not put in enough time to prepare a decent project proposal. These are easy to spot for the organization and they are not your competition. They do serve as a reminder to double-check the timeline and make sure you fill in appropriate information. If you have questions just ask the organization you are applying to, they'll be glad to help interested students.
  2. Students who might be good candidates but do not stand out from the competition. The majority of applications will be students who probably have the skills to tackle the project but their attitude, personality, and enthusiasm is unknown. These students fail to communicate their abilities and vision clearly enough to stand out. Your main goal when applying should be not to fall into this group.
  3. Students who have shown enthusiasm, ability to communicate, and clearly have the skills not just to complete their project but also to contribute to the community. If you can stand out like this then you're likely to get picked. These students will contact the mentor ahead of time and discuss the project idea. This will arm them with the information to put together a good proposal. They will join mailing lists, forums, or IRC channels to learn about the community. They will even contribute patches before being accepted for GSoC - this is a key action you can take to improve your chances.

Another way of explaining this is that a large number of students will apply. Many of them will be in the ballpark and could potentially complete the project. But due to the high applications to slots ratio, only the best will get chosen.

Perfectly good students will not get a slot so aim to be that top level of student who is interested not just in doing a project over the summer but in diving into the community, helping users, contributing patches, and fixing bugs. If you do that then your chances of being accepted are good.

Final thoughts


I hope this helps you prepare for Google Summer of Code. For more information check out the official FAQ.

This year I'm excited about helping QEMU. A project ideas page has been published, so check that out if you are interested in emulation or virtualization.

Have GSoC questions or advice to share? Post a comment!

Friday, 25 February 2011

How to access virtual machine image files from the host

I am going to explain how to mount or access virtual machine disk images from the host using qemu-nbd.

Often you want to access an image file from the host to:
  • Copy files in before starting a new virtual machine.
  • Customize configuration like setting a hostname or networking details.
  • Troubleshoot a virtual machine that fails to boot.
  • Retrieve files after you've decided to stop using a virtual machine.

There is actually a toolkit for accessing image files called libguestfs. Take a look at that first but what follows is the poor man's version using tools that come with QEMU.

Required packages


The required programs are the qemu-nbd tool and (optionally) kpartx for detecting partitions.

On Debian-based distros the packages are called qemu-utils and kpartx.

On RHEL 6 nbd support is not available out-of-the-box but you can build from source if you wish.

Please leave package names for other distros in the comments!

Remember to back up important data


Consider making a backup copy of the image file before trying this out. Especially if you don't work with disk images often it can be easy to lose data with a wrong command.

Attaching an image file


The goal is to make an image file appear on the host as a block device so it can be mounted or accessed with tools like fdisk or fsck. The image file can be in any format that QEMU supports including raw, qcow2, qed, vdi, vmdk, vpc, and others.

1. Ensure the nbd driver is loaded


The Network Block Device driver in Linux needs to be loaded:

modprobe nbd

The qemu-nbd tool will use the nbd driver to create block devices and perform I/O.

2. Connect qemu-nbd


Before you do this, make sure the virtual machine is not running! It is generally not safe to access file systems from two machines at once and this applies for virtual machines and the host.

There should be many /dev/nbdX devices available now and you can pick an unused one as the block device through which to access the image:

sudo qemu-nbd -c /dev/nbd0 path/to/image/file

Don't be surprised that there is no output from this command. On success the qemu-nbd tool exits and leaves a daemon running in the background to perform I/O. You can now access /dev/nbd0 or whichever nbd device you picked like a regular block device using mount, fdisk, fsck, and other tools.

3. (Optionally) detect partitions


The kpartx utility automatically sets up partitions for the disk image:

sudo kpartx -a /dev/nbd0

They would be named /dev/nbd0p1, /dev/nbd0p2, and so on.

Detaching an image file


When all block devices and partitions are no longer mounted or in use you can clean up as follows.

1. (Optionally) forget partitions


sudo kpartx -d /dev/nbd0

2. Disconnect qemu-nbd


sudo qemu-nbd -d /dev/nbd0

3. Remove the nbd driver


Once there are no more attached nbd devices you may wish to unload the nbd driver:

sudo rmmod nbd

More features: read-only, throwaway snapshots, and friends


The qemu-nbd tool has more features that are worth looking at:
  • Ensuring read-only access using the --read-only option.
  • Allowing write access but not saving changes to the image file using the --snapshot option. You can think of this as throwaway snapshots.
  • Exporting the image file over the network using the --bind and --port options. Drop the -c option because no local nbd device is used in this case. Only do this on secure private networks because there is no access control.

Hopefully this has helped you quickstart qemu-nbd for accessing image files from the host. Feel free to leave questions in the comments below.

Wednesday, 23 February 2011

Observability using QEMU tracing

I am going to describe the tracing feature in the QEMU and KVM.

Overview of QEMU tracing


Tracing is available for the first time in QEMU 0.14.0 and qemu-kvm 0.14.0. It's an optional feature and may not be enabled in distro packages yet, but it's there if you are willing to build from source.

QEMU tracing is geared towards answering questions about running virtual machines:
  • What I/O accesses are being made to emulated devices?
  • How long are disk writes taking to complete inside QEMU?
  • Is QEMU leaking memory or other resources by not freeing them?
  • Are network packets being received but filtered at the QEMU level?

In order to find answers to these questions we place trace events into the QEMU source code at strategic points. For example, every qemu_malloc() and qemu_free() call can be traced so we know what heap memory allocations are going on.

Current status


Today QEMU tracing is useful to developers and anyone troubleshooting or investigating bugs.

The set of trace events that comes with QEMU is limited but already useful for observing the block layer and certain emulated hardware. Developers are adding trace events to new code and converting some existing debug printfs to trace events. I expect the default set of trace events to grow and become more useful in the future.

Trace events are currently not a stable API so scripts that work with one version of QEMU are not guaranteed to work with another version. There is also no documentation on the semantics of particular trace events, so it is necessary to understand the code which contains the trace event to know its meaning. In the future we can make stable trace events with explicit semantics like "packet received from host".

QEMU tracing cross-platform support


You have a choice of trace backends: SystemTap, LTTng Userspace Tracer, and a built-in "simple" tracer are supported. DTrace could be added with little effort on Solaris, Mac OSX, and FreeBSD host platforms.

The available set of trace events is the same no matter which trace backend you choose.

Where to find out more


If you want to get started, check out the documentation that comes are part of QEMU.

Also check out the excellent QEMU 0.14.0 changelog for pointers related to tracing.

I looking forward to writing more about tracing in the future and sharing trace analysis scripts. In fact, I just submitted a patch to provide a Python API for processing trace files generated by the "simple" trace backend. It makes analyzing trace files quick and fun :).

Monday, 21 February 2011

Near instant kernel development cycle with KVM

I want to share my setup for rapid kernel development using KVM.

A fast development cycle makes a huge difference to productivity. For firmware and kernel development many areas can be efficiently tested inside virtual machines.

Traditionally physical test machines were used but virtualization lets you take the lab with you. This means working offline without giving up on testing.

In that past I used QEMU when working on the gPXE network bootloader. Now I am using KVM to test Linux kernel changes in less than 30 seconds and it's a really pleasant setup.

What can't be tested under KVM?


A lot of code can be tested in a virtual machine but device drivers or hardware-specific code often require physical machines. But with PCI device assignment, or passing physical PCI devices through into the virtual machine, it is becoming possible to test device drivers in a virtual machine too.

Testing kernels without disk images


Most virtual machines are booted from a disk image or an ISO file, but KVM can directly load a Linux kernel into memory skipping the bootloader. This means you don't need an image file containing the kernel and boot files. Instead, you can run a kernel directly like this:

qemu-kvm -kernel arch/x86/boot/bzImage -initrd initramfs.gz -append "console=ttyS0" -nographic

These flags directly load a kernel and initramfs from the host filesystem without the need to generate a disk image or configure a bootloader.

The optional -initrd flag loads an initramfs for the kernel to use as the root filesystem.

The -append flags adds kernel parameters and can be used to enable the serial console.

The -nographic option restricts the virtual machine to just a serial console and therefore keeps all test kernel output in your terminal rather than in a graphical window.

Building an initramfs


I don't use a distro initramfs generation utility because I like to control which files get included and the init script. Instead I use the linux-2.6/usr/gen_init_cio utility to build an initramfs cpio archive from a specification file. A neat feature of gen_init_cpio is that you don't need to be root in order to create device files or set ownership inside the initramfs. The specification file syntax looks like this:

# a comment
file <name> <location> <mode> <uid> <gid> [<hard links>]
dir <name> <mode> <uid> <gid>
nod <name> <mode> <uid> <gid> <dev_type> <maj> <min>
slink <name> <target> <mode> <uid> <gid>
pipe <name> <mode> <uid> <gid>
sock <name> <mode> <uid> <gid>

The kernel will execute the file at /init. I include busybox in the initramfs and have the following script:

#!/bin/sh
mount -t proc none /proc
mount -t sysfs none /sys
mount -t configfs none /sys/kernel/config
mount -t debugfs none /sys/kernel/debug
mount -t tmpfs none /tmp

# Test setup commands here:
insmod /lib/modules/$(uname -r)/kernel/...

exec /bin/sh -i

Instead of building out a full /lib/modules directory tree I just include those kernel module dependencies that I need. This means I use insmod(8) instead of modprobe(8) because I skip generating depmod(8) dependency metadata.

Tying it all together


Here are the steps I take to build and test a kernel:

cd linux-2.6
[...make some changes...]
make
usr/gen_init_cpio initramfs | gzip >initramfs.gz
qemu-kvm -kernel arch/x86/boot/bzImage -initrd initramfs.gz -append "console=ttyS0" -nographic

It takes about 28 seconds to the shell prompt inside the virtual machine with ccache and a hot page cache on this laptop. This keeps development fun :)!