Wednesday, 28 November 2018

Software Freedom Conservancy donations are being matched again!

Donations to Software Freedom Conservancy, the charity that acts as the legal home for QEMU and many other popular open source projects that don't run their own foundations or charities, are being matched again this year. That means your donation is doubled thanks to a group of donors who have pledged to match donations.

Software Freedom Conservancy helps projects with the details of running an open source project (legal advice, handling expenses, organizing conferences, etc) as well as taking a leading position on open source licensing and enforcement. Their work is not-for-profit and in the interest of the entire open source community.

If you want more projects like QEMU, Git, Samba, Inkscape, and Selenium to succeed as healthy open source communities, then donating to Software Freedom Conservancy is a good way to help.

Find out about becoming a Supporter here.

Tuesday, 27 November 2018

QEMU Advent Calendar 2018 is coming!

QEMU Advent Calendar is running again this year. Each day from December 1st through 24th a surprise QEMU disk image will be released for your entertainment.

Check out the website on December 1st for the first disk image:

Thomas Huth is organizing QEMU Advent Calendar 2018 with the help of others from the QEMU community. If you want to contribute a disk image, take a look at the call for images email.

Sunday, 4 November 2018

Video and slides available for "Security in QEMU"

I gave a talk about security in QEMU at KVM Forum 2018. It covers the architecture of QEMU and focusses on the attack surfaces that are exposed to guests. I hope it will be useful to anyone auditing or writing device emulation code. It also describes the key design principles for isolating the QEMU process and limiting the damage that can be done if a guest escapes.

The video of the talk is now available:

The slides are available here (PDF).

Friday, 26 January 2018

How to modify kernel modules without recompiling the whole Linux kernel

Do you need to recompile your Linux kernel in order to make a change to a module? What if you just want to try to fix a small bug while running a distro kernel package?

The Linux kernel source tree is large and rebuilding from scratch is a productivity killer. This article covers how to make changes to kernel modules without rebuilding the entire kernel.

I need to preface this by saying that I don't know if this is a "best practice". Maybe there are better ways but here is what I've been using recently.

Step by step

In most cases you can safely modify one or more kernel modules without rebuilding the whole kernel. Follow these steps:

1. Get the kernel sources

Download the kernel source tree corresponding to your current kernel version. How to get the kernel sources for the exact kernel package version you are currently running depends on your Linux distribution. On Fedora do the following:

$ dnf download --source kernel # or specify the exact kernel-X.Y.Z-R package you need
kernel-4.14.14-300.fc27.src.rpm            1.9 MB/s |  98 MB     00:50
$ rpmbuild -rp kernel-4.14.14-300.fc27.src.rpm
$ cd ~/rpmbuild/BUILD/kernel-4.14.fc27/linux-4.14.14-300.fc27.x86_64/

If you can't figure out how to get the corresponding kernel sources, use uname -r to find the kernel version and grab the vanilla sources from git. This will work as long as the kernel package you are running hasn't been patched too heavily by the package maintainers:

$ git clone git://
$ cd linux-stable
$ uname -r
$ git checkout v4.14.14 # let's hope this is close to what we're running!

2. Get the kernel config file

It is critical that you use the same .config file as your distro as some configuration options will build incompatible kernel modules. All will be good if your .config file matches your kernel's configuration, so grab it from /boot:

$ cp /boot/config-$(uname -r) .config
$ make oldconfig # this shouldn't produce much output

3. Set the version string

Kernel module versioning relies on a version string that is compiled into each kernel module. If the version string does not match your kernel's version then the module cannot be loaded. Be sure to set CONFIG_LOCALVERSION to match uname -r in the .config file:

$ uname -r # we only want the stuff after the X.Y.Z version number
$ sed -i 's/^CONFIG_LOCALVERSION=.*$/CONFIG_LOCALVERSION="-300.fc27.x86_64"' .config

4. Build your modules

Use the out-of-tree build syntax to compile just the modules you need. In this example let's rebuild drivers/virtio modules:

$ make modules_prepare
$ make -j4 M=drivers/virtio modules # or whatever directory you want

5. Install and copy your modules

It can be useful to install the modules in a staging directory so they can copied to remote machines or installed locally:

$ mkdir /tmp/staging
$ make M=drivers/virtio INSTALL_MOD_PATH=/tmp/staging modules_install
$ scp /tmp/staging/lib/modules/4.14.14-300.fc27.x86_64/extra/* root@remote-host:/lib/modules/4.14.14-300.fc27.x86_64/kernel/drivers/virtio/

Beware that some distros ship compressed kernel modules. Set CONFIG_MODULE_COMPRESS_XZ=y in the .config file to get .ko.xz files, for example.

6. Reload modules or reboot the test machine

Now that the new modules are in /lib/modules/... it's time to load them. If the old modules are currently loaded you may be able to rmmod them after terminating processes that rely on those modules. Then load the new modules using modprobe. If the old modules cannot be unloaded because the system depends on them, you need to reboot.

If the modules you modified are loaded during early boot, you'll need to rebuild the initramfs. Make sure you have a backup initramfs in case the system fails to boot!


This approach has limitations that mean it's mostly useful for debugging and development. For quality assurance testing it is better to follow a full build process that produces the same output that end users will install.

Here are some things to be aware of:

  • Don't make .config changes unless you are sure they are compatible with the running kernel.
  • Do not introduce new module dependencies since this approach doesn't rebuild dependency information.
  • Do not change exported symbols if other kernel modules depend on the code you are changing, unless you also rebuild the modules that depend on yours.
  • Your modified modules will not be cryptographically signed and will taint the kernel if your distro kernel package is signed.

What happens if things go wrong? Either you'll get an error when attempting to load the kernel module. Or you might just get an oops when there is a crash due to ABI breakage.


This may seem like a long process but it's faster than recompiling a full kernel from scratch. Once you've got it working you can keep modifying code and rebuilding from Step 3.

Saturday, 2 December 2017

My favorite software engineering books

The programming books that I find most interesting are neither about computer science theory nor the latest technology fads. Instead they are about the thought process behind building software and the best practices for doing so.

Here are some of my favorite books on software engineering topics:

The Practice of Programming

The Practice of Programming is a great round-trip through the struggle of writing programs. It covers many aspects that come together when writing software, like design, algorithms, coding style, and testing. Especially useful early on in your programming journey as an overview of challenges that you'll face.

Programming Pearls

Programming Pearls is a collection of essays by Jon Bentley from Communications of the Association for Computing Machinery. There are "Aha!" moments throughout the essays as they discuss how to analyze problems and come up with good solutions.

Bonus link: if you enjoy the problem solving in this book, then check out Hacker's Delight for clever problem solving and optimizations using bit-twiddling.

The Pragmatic Programmer

The Pragmatic Programmer covers the mindset of systematic and mindful software development. It goes beyond best practices and explains key qualities and trade-offs in programming. Thinking about these issues allow you to customize your approach to software development and produce better programs.

Code Complete

Code Complete is a survey of programming best practices. It draws on research evidence on code quality and provides guidelines on coding style. A great book if you're thinking about how to improve the quality, clarity, and maintainability of your programs.

Applying UML and Patterns

Applying UML and Patterns helped me learn to break down requirements and come up with software designs. This is a great book to help you get past the stage where programs you write from scratch are unmaintainable spaghetti code. I don't know how well this book has aged and would probably ignore the details of UML and Use Cases but the essence remains valuable.

Bonus link: if this book helps you design programs from scratch, then Refactoring will help you recognize that software is "soft" and can be changed substantially in a safe way by following a disciplined approach.

Writing Secure Code

Writing Secure Code discusses software security and the various classes of bugs that lead to security holes. Security is essential for writing code because the majority of programs have a security boundary where they process untrusted inputs. It's important to have a background in security as well as language and technology specifics of secure coding.

Producing Open Source Software

Producing Open Source Software explains how open source projects and communities work. It covers topics like licenses, project governance, source control, code review, and more. If you are getting contributing to open source or considering running an open source project then this book will prepare you.


These books all contributed to how I think about software development. Let me know which practical programming books you like in the comments!

Wednesday, 15 November 2017

Video and slides available for "Applying Polling Techniques to QEMU: Reducing virtio-blk I/O Latency"

At KVM Forum 2017 I gave a talk about the AioContext polling optimization that was merged in QEMU 2.9. It reduces latency for virtio-blk and virtio-scsi devices with the iothread= option on high IOPS devices like recent NVMe PCIe SSDs drives. It increases performance for latency-sensitive workloads and has been designed to avoid interfering with workloads that do not benefit from polling thanks to a self-tuning algorithm.

The video of the talk is now available:

The slides are available here (PDF).

Monday, 13 November 2017

Common disk benchmarking mistakes

Collecting benchmark results is the first step to solving disk I/O performance problems. Unfortunately, many bug reports and performance investigations fall down at the first step because bogus benchmark data is collected. This post explains common mistakes when running disk I/O benchmarks.

Disk I/O patterns

Skip this section if you are already familiar with these terms. Before we begin, it is important to understand the different I/O patterns and how they are used in benchmarking.

Sequential vs random I/O is the access pattern in which data is read or written. Sequential I/O is in-order data access commonly found in workloads like streaming multimedia or writing log files. Random I/O is access of non-adjacent data commonly found when accessing many small files or on systems running multiple applications that access the disk at the same time. It is easy to prefetch sequential I/O so both disk read caches and operating system page caches may keep the next piece of data ready even before it is accessed. Random I/O does not offer opportunities for prefetching and is therefore a harder access pattern to optimize.

Block or request size is the amount of data transferred by a single access. Small request sizes are 512B through 4 KB, large request sizes are 64 KB through 128 KB, while very large request sizes could be 1 MB (although the maximum allowed request size ultimately depends on the hardware). Fewer requests are needed to transfer the same amount of data when the request size is larger. Therefore, throughput is usually higher at larger request sizes because less per-request overhead is incurred for the same amount of data.

Read vs write is the request type that determines whether data is transferred to or from the storage medium. Reads can be completed cheaply if data is already in the disk read cache and, failing that, the access time depends on the storage medium. Traditional spinning disks have significant average seek times in the range of 4-15 milliseconds, depending on the drive, when the head is not positioned in the read location, while solid-state storage devices might just take on the order of 10 microseconds. Writes can be completed cheaply by leaving data in the disk write cache unless the cache is full or the cache is disabled.

Queue depth is the number of in-flight I/O requests at a given time. Latency-sensitive workloads submit one request and wait for it to complete before submitting the next request. This is queue depth 1. Parallel workloads submit many requests without waiting for earlier requests to complete first. The maximum queue depth depends on the hardware with 64 being a common number. Maximum throughput is usually achieved when queue depth is fairly high because the disk can keep busy without waiting for the next request to be submitted and it may optimize the order in which requests are processed.

Random reads are a good way to force storage medium access and minimize cache hit rates. Sequentual reads are a good way to maximize cache hit rates. Which I/O pattern is appropriate depends on your goals.

Real-life workloads are usually a mixture of sequential vs random, block sizes, reads vs writes, and the queue depth may vary over time. It is simplest to benchmark a specific I/O pattern in isolation but benchmark tools can also be configured to produce mixed I/O patterns like 70% reads/30% writes. The goal when configuring a benchmark is to produce the I/O pattern that is critical for real-life workload performance.

1. Use a real benchmarking tool

It is often tempting to use file utilities instead of real benchmarking tools because file utilities report I/O throughput like real benchmarking tools and time taken can be easily measured. Therefore it might seem like there is no need to install a real benchmarking tool when file utilities are already available on every system.

Do not use cp(1), scp(1), or even dd(1). Instead, use a real benchmark like fio(1).

What's the difference? Real benchmarking tools can be configured to produce specific I/O patterns, like 4 KB random reads with queue depth 8, whereas file utilities offer limited or no ability to choose the I/O pattern. Since disk performance varies depending on the I/O pattern, it is hard to understand or compare results between systems without full control over the I/O pattern.

The second reason why real benchmarking tools are necessary is that file utilities are not designed to exercise the disk, they are designed to manipulate files. This means file utilities spend time doing things that does not involve disk I/O and therefore produces misleading performance results. The most important example of this is that file utilities use the operating system's page cache and this can result in no disk I/O activity at all!

2. Bypass the page cache

One of the most common mistakes is forgetting to bypass the operating system's page cache. Files and block devices opened with the O_DIRECT flag perform I/O to the disk without going through the page cache. This is the best way to guarantee that the disk actually gets I/O requests. Files opened without this flag are in "buffered I/O" mode and that means I/O may be fulfilled entirely within the page cache in RAM without any disk I/O activity. If the goal is to benchmark disk performance then the page cache needs to be eliminated.

fio(1) jobs must use the direct=1 parameter to exercise the disk.

It is not sufficient to echo 3 > /proc/sys/vm/drop_caches before running the benchmark instead of using O_DIRECT. Although this command is often used to make non-disk benchmarks produce more consistent results between runs, it does not guarantee that the disk will actually receive I/O requests. In addition, the page cache interferes with the desired benchmark I/O pattern since page cache prefetch and writeback will alter the actual I/O pattern that the disk sees.

3. Bypass file systems and device mapper

fio(1) can do both file I/O and disk I/O benchmarking, so it's often mistakenly used in file I/O mode instead of disk I/O mode. When benchmarking disk performance it is best to eliminate file systems and device mapper targets to isolate raw disk I/O performance. File systems and device mapper targets may have their own internal bottlenecks, such as software locks, that are unrelated to disk performance. File systems and device mapper targets are also likely to modify the I/O pattern because they submit their own metadata I/O.

fio(1) jobs must use the filename=/path/to/disk to do disk I/O benchmarking.

Without a block device filename parameter, the benchmark would create regular files on whatever file system is in use. Remember to double- and triple-check the block device filename before running benchmarks that write to the disk to avoid accidentally overwriting important data like the system root disk!

Example benchmark configurations

Here are a few example fio(1) jobs that you can use as a starting point.

High-throughput parallel reads

This job is a read-heavy workload with lots of parallelism that is likely to show off the device's best throughput:

ramp_time=10            # start measuring after warm-up time

offset_increment=128m   # each job starts at a different offset

Latency-sensitive random reads

This job is a latency-sensitive workload that stresses per-request overhead and seek times:

ramp_time=10            # start measuring after warm-up time


Mixed workload

This job simulates a more real-life workload with an I/O pattern that contains boths reads and writes:

ramp_time=10            # start measuring after warm-up time



There are several common issues with disk benchmarking that can lead to useless results. Using a real benchmarking tool and bypassing the page cache and file system are the basic requirements for useful disk benchmark results. If you have questions or suggestions about disk benchmarking, feel free to post a comment.