Stefan Hajnoczi: 2013

Thursday, December 19, 2013

Distribute and provision disk images with virt-builder

I recently learnt about the virt-builder tool that was added in libguestfs 1.24. This is a really significant addition that makes publishing and using template disk images safe, quick, and efficient.

The best way to understand virt-builder is by looking at typical use cases.

Quick disk image creation from template images

For casual users there is a public repository of CentOS, Debian, and Ubuntu releases. Now you can create a Debian disk image with a single command. By the way, you don't need to be root:

$ virt-builder debian-7

Customization and configuration management

Whammo, you have a Debian 7 disk image. But folks that wish to customize the default image can use command-line options to add & delete files, create directories, install packages, and setup firstboot scripts. This makes virt-builder a great tool for bootstrapping your Puppet/Chef configuration management.

Publishing template images

Heavy duty users will publish their own disk images, maybe a library of images available to hosting customers or a private cloud environment. Not to mention virt-builder is a handy way for development teams to share template images. All this is possible using the cryptographically signed "index file" that catalogues the template images. Users can list and inspect images like this:

$ virt-builder --list
centos-6                 CentOS 6.5
debian-6                 Debian 6 (Squeeze)
debian-7                 Debian 7 (Wheezy)
fedora-18                Fedora® 18
fedora-19                Fedora® 19
fedora-20                Fedora® 20
scientificlinux-6        Scientific Linux 6.4
ubuntu-10.04             Ubuntu 10.04 (Lucid)
ubuntu-12.04             Ubuntu 12.04 (Precise)
ubuntu-13.10             Ubuntu 13.10 (Saucy)
$ virt-builder --notes fedora-20
Fedora 20.

This Fedora image contains only unmodified @Core group packages.

It is thus very minimal.  The kickstart and install script can be
found in the libguestfs source tree:

builder/website/fedora.sh

Fedora and the Infinity design logo are trademarks of Red Hat, Inc.
Source and further information is available from http://fedoraproject.org/

Conclusion

virt-builder is a much-needed tool for consuming and publishing template VM images for KVM. It automates a lot of low-level commands normally used to deploy template images. Vagrant and Docker don't need to worry just yet but I think virt-builder is enough to satisfy anyone who is already working with virt-manager, virsh, and friends.

By the way, virt-builder is included in the libguestfs-tools Fedora package.

Saturday, October 12, 2013

Google Summer of Code 2013 has finished!

Google funded 9 students to contribute to QEMU, KVM, and libvirt during the summer of 2013. We had a successful Google Summer of Code that has now come to a close.

Osier Yang (mentor), Michael Roth (mentor), and I wrote a blog post that highlights two projects from this summer:

http://google-opensource.blogspot.de/2013/10/google-summer-of-code-veteran-orgs-qemu.html

Gabriel Kerneis (mentor), Charlie Shepherd (student), and I also collaborated on a paper that describes the QEMU/CPC project that we had this summer. The paper is titled "QEMU/CPC: static analysis and CPS conversion for safe, portable, and efficient coroutines" and is available at http://gabriel.kerneis.info/research/files/qemu-cpc.pdf. There is also a mailing list discussion here.

Monday, June 24, 2013

virtio standardization has begun

The virtio paravirtualized I/O interfaces have been widely used in Linux and QEMU. Rusty Russell maintained a specification that the community worked around, but has now kicked off standardization through the OASIS standards body.

Follow virtio specification activity and participate on the VIRTIO Technical Committee page.

Today virtio devices include block (disk), SCSI, net (NIC), rng (random number generator), serial, and 9P (host<->guest filesystem). These devices can operate over PCI (used by x86 KVM), MMIO (used for ARM), and other transports.

I'm participating in the VIRTIO TC and hope this new level of virtio activity leads to even wider adoption of open source virtualized I/O devices.

Monday, May 6, 2013

Pictures from CERN

I went on a tour of CERN, the European nuclear research center that is home of the Large Hadron Collider (LHC). The facilities are split over multiple sites because the LHC is 27 kilometers long and 100 meters underground. I had a chance to see some of the smaller particle accelerators as well as the CMS experimental site in the LHC. Here are the best pictures from the tour.

LHC is currently offline for hardware upgrades

There are screens around the campus, even in the cafeteria, showing particle accelerator activity. Currently the LHC is offline due to hardware upgrades and will be coming back around 2014-2015 with higher particle energy. It's actually a good time to visit since it's possible to see experiment sites that would be inaccessible during operation.

CERN houses particle accelerators of different sizes

CERN is home to several different particle accelerators. Some are linear accelerators while others are ring-shaped to allow the particles to loop continuously (like the LHC). Particle beams are built up in "packets" or bursts, the low energy accelerators may be used to spin them up before injecting them into larger accelerators.

Linac 2 and LEIR: Smaller particle accelerators

Linac 2 and Low Energy Ion Ring (LEIR) are smaller particle accelerators. Their length is in the 10s of meters, which means you can see the whole thing. The principle seems to be similar to that of the big accelerators: force particles to collide by sending them through a vacuum guided by magnetic fields. The point of collision is equipped with sensors which measure the particles produced by the collision.

Inside the LHC ring pipe

The LHC is much larger scale than Linac 2 or LEIR so it has unique tricks up its sleeve. The electromagnets used to keep the particle beam on its course require so much energy that superconductivity is used to eliminate resistance. This means the pipe is cooled close to absolute zero and has insulation and a vacuum to shield it from the external environment.

The collisions are produced by accelerating two particle beams in opposite directions - clockwise and counterclockwise. You can see the two particle beam pipes in the picture above. The beams are kept separate for most of the ring, only the experiment sites which contain the detectors will cross the beams to create collisions.

The CMS experiment site

The Compact Muon Solenoid is one of the experiment sites on the LHC ring. It has a huge chamber filled with sensors that measure particle collisions. It is 15 meters in diameter and hard to get a picture of due to its size. There is also some datacenter space above with machines that process the data generated by the experiments.

Tux makes an appearance

The experiments produce a huge amount of data - only a tiny fraction of the collisions produce interesting events, like a Higgs boson. The incoming data is processed and discarded unless an interesting event is detected. This is called "triggering" and avoids storing huge amounts of unnecessary data. When walking past the racks I saw custom hardware, circuit boards dangling from machines here and there, which is probably used for filtering or classifying the data.

Finally, I spotted the Linux mascot, Tux, on a monitor. Nice to see :-).

Saturday, April 13, 2013

QEMU.org is in Google Summer of Code 2013!

As recently announced on Google+, QEMU.org has been accepted to Google Summer of Code 2013.

We have an exciting list of project ideas for QEMU, libvirt, and the KVM kernel module. Students should choose a project idea and contact the mentor to discuss the requirements. The easiest way to get in touch is via the #qemu-gsoc IRC channel on irc.oftc.net.

Student applications formally open on April 22 but it's best to get in touch with the mentor now. See the timeline for details.

I've shared my advice on applying to Summer of Code on this blog. Check it out if you're looking for a guide to a successful application from someone who has been both a student and a mentor.

Tuesday, April 9, 2013

QEMU Code Overview slides available

I recently gave a high-level overview of QEMU aimed at new contributors or people working with QEMU. The slides are now available here:

QEMU Code Overview (pdf)

Topics covered include:

External interfaces (command-line, QMP monitor, HMP monitor, UI, logging)
Architecture (process model, main loop, threads)
Device emulation (KVM accelerator, guest/host device split, hardware emulation)
Development (build process, contributing)

It is a short presentation and stays at a high level, but it can be useful for getting your bearings before digging into QEMU source code, debugging, or performance analysis.

Enjoy!

Wednesday, March 13, 2013

New in QEMU 1.4: high performance virtio-blk data plane implementation

QEMU 1.4 includes an experimental feature for improved high IOPS disk I/O scalability called virtio-blk data plane. It extends QEMU to perform disk I/O in a dedicated thread that is optimized for scalability with high IOPS devices and many disks. IBM and Red Hat have published a whitepaper presenting the highest IOPS achieved to date under virtualization using virtio-blk data plane:

KVM Virtualized I/O Performance [PDF]

Update

Much of this post is now obsolete! The virtio-blk dataplane feature was integrated with QEMU's block layer (live migration and block layer features are now supported), virtio-scsi dataplane support was added, and libvirt XML syntax was added.

If you have a RHEL 7.2 or later host please use the following:

QEMU syntax:

$ qemu-system-x86_64 -object iothread,id=iothread0 \
                     -drive if=none,id=drive0,file=vm.img,format=raw,cache=none,aio=native \
                     -device virtio-blk-pci,iothread=iothread0,drive=drive0

Libvirt domain XML syntax:

<domain>
    <iothreads>1<iothreads>
    <cputune>  <!-- optional -->
        <iothreadpin iothread="1" cpuset="5,6"/>
    </cputune>
    <devices>
        <disk type="file">
            <driver iothread="1" ... />
        </disk>
    </devices>
</domain>

When can virtio-blk data plane be used?

Data plane is suitable for LVM or raw image file configurations where live migration and advanced block features are not needed. This covers many configurations where performance is the top priority.

Data plane is still an experimental feature because it only supports a subset of QEMU configurations. The QEMU 1.4 feature has the following limitations:

Image formats are not supported (qcow2, qed, etc).
Live migration is not supported.
QEMU I/O throttling is not supported but cgroups blk-io controller can be used.
Only the default "report" I/O error policy is supported (-drive werror=,rerror=).
Hot unplug is not supported.
Block jobs (block-stream, drive-mirror, block-commit) are not supported.

How to use virtio-blk data plane

The following libvirt domain XML enables virtio-blk data plane:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='path/to/disk.img'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>
...
  </devices>
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.scsi=off'/>
  </qemu:commandline>
  <!-- config-wce=off is not needed in RHEL 6.4 -->
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.config-wce=off'/>
  </qemu:commandline>
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.x-data-plane=on'/>
  </qemu:commandline>
<domain>

Note that <qemu:commandline> must be added directly inside <domain> and not inside a child tag like <devices>.

If you do not use libvirt the QEMU command-line is:

qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=path/to/disk.img \
     -device virtio-blk,drive=drive0,scsi=off,config-wce=off,x-data-plane=on

What is the roadmap for virtio-blk data plane

The limitations of virtio-blk data plane in QEMU 1.4 will be lifted in future releases. The goal I intend to reach is that QEMU virtio-blk simply uses the data plane approach behind-the-scenes and the x-data-plane option can be dropped.

Reaching the point where data plane becomes the default requires teaching the QEMU event loop and all the core infrastructure to be thread-safe. In the past there has been a big lock which allows a lot of code to simply ignore multi-threading. This creates scalability problems that data plane avoids by using a dedicated thread. Work is underway to reduce scope of the big lock and allow the data plane thread to work with live migration and other QEMU features that are not yet supported.

Patches have also been posted upstream to convert the QEMU net subsystem and virtio-net to data plane. This demonstrates the possibility of converting other performance-critical devices.

With these developments happening, 2013 will be an exciting year for QEMU I/O performance.