Tuesday, November 19, 2019

Video and slides available for "virtio-fs: A Shared File System for Virtual Machines"

This year I presented virtio-fs at KVM Forum 2019 with David Gilbert and Miklos Szeredi. virtio-fs is a host<->guest file system that allows guests to access a shared directory on the host. We've been working on virtio-fs together with Vivek Goyal and community contributors since 2018 and are excited that it is now being merged upstream in Linux and QEMU.

virtio-fs gives guests file system access without the need for disk image files or copying files between the guest and host. You can even boot a guest from a directory on the host without a disk image file. Kata Containers 1.7 and later ship with virtio-fs support for running VM-isolated containers.

What is new and interesting about virtio-fs is that it takes advantage of the co-location of guests and the hypervisor to avoid file server communication and to provide local file system semantics. The guest can map the contents of files from the host page cache. This bypasses the guest page cache to reduce memory footprint and avoid copying data into guest RAM. Network file systems and earlier attempts at paravirtualized file systems, like virtio-9p, cannot do this since they are designed for message-passing communication only.

To learn more about virtio-fs, check out the video or slides (PDF) from the presentation.

Monday, August 5, 2019

Determining why a Linux syscall failed

One is often left wondering what caused an errno value when a system call fails. Figuring out the reason can be tricky because a single errno value can have multiple causes. Applications get an errno integer and no additional information about what went wrong in the kernel.

There are several ways to determine the reason for a system call failure (from easiest to most involved):

  1. Check the system call's man page for the meaning of the errno value. Sometimes this is enough to explain the failure.
  2. Check the kernel log using dmesg(1). If something went seriously wrong (like a hardware error) then there may be a detailed error information. It may help to increase the kernel log level.
  3. Read the kernel source code to understand various error code paths and identify the most relevant one.
  4. Use the function graph tracer to see which code path was taken.
  5. Add printk() calls, recompile the kernel (module), and rerun to see the output.

Reading the man page and checking dmesg(1) are fairly easy for application developers and do not require knowledge of kernel internals. If this does not produce an answer then it is necessary to look closely at the kernel source code to understand a system call's error code paths.

This post discusses the function graph tracer and how it can be used to identify system call failures without recompiling the kernel. This is useful because running a custom kernel may not be possible (e.g. due to security or reliability concerns) and recompiling the kernel is slow.

An example

In order to explore some debugging techniques let's take the io_uring_setup(2) system call as an example. It is failing with ENOMEM but the system is not under memory pressure, so ENOMEM is not expected.

The io_uring_setup(2) source code (fs/io_uring.c) contains many ENOMEM locations but it is not possible to conclusively identify which one is failing. The next step is to determine which code path is taken using dynamic instrumentation.

The function graph tracer

The Linux function graph tracer records kernel function entries and returns so that function call relationships are made apparent. The io_uring_setup(2) system call is failing with ENOMEM but it is unclear at which point in the system call this happens. It is possible to find the answer by studying the function call graph produced by the tracer and following along in the Linux source code.

Since io_uring_setup(2) is a system call it's not an ordinary C function definition and has a special symbol name in the kernel ELF file. It is possible to look up the (architecture-specific) symbol for the currently running kernel:

# grep io_uring_setup /proc/kallsyms
ffffffffbd357130 T __x64_sys_io_uring_setup

Let's trace all __x64_sys_io_uring_setup calls:

# cd /sys/kernel/debug/tracing
# echo '__x64_sys_io_uring_setup' > set_graph_function
# echo 'function_graph' >current_tracer
# cat trace_pipe >/tmp/trace.log
...now run the application in another terminal...
The trace contains many successful io_uring_setup(2) calls that look like this:
 1)               |  __x64_sys_io_uring_setup() {
 1)               |    io_uring_setup() {
 1)               |      capable() {
 1)               |        ns_capable_common() {
 1)               |          security_capable() {
 1)   0.199 us    |            cap_capable();
 1)   7.095 us    |          }
 1)   7.594 us    |        }
 1)   8.059 us    |      }
 1)               |      kmem_cache_alloc_trace() {
 1)               |        _cond_resched() {
 1)   0.244 us    |          rcu_all_qs();
 1)   0.708 us    |        }
 1)   0.208 us    |        should_failslab();
 1)   0.220 us    |        memcg_kmem_put_cache();
 1)   2.201 us    |      }
 1)               |      fd_install() {
 1)   0.223 us    |        __fd_install();
 1)   0.643 us    |      }
 1) ! 190.396 us  |    }
 1) ! 216.236 us  |  }

Although the goal is to understand system call failures, looking at a successful invocation can be useful too. Failed calls in trace output can be identified on the basis that they differ from successful calls. This knowledge can be valuable when searching through large trace files. A failed io_uring_setup(2) call aborts early and does not invoke fd_install(). Now it is possible to find a failed call amongst all the io_uring_setup(2) calls:

 2)               |  __x64_sys_io_uring_setup() {
 2)               |    io_uring_setup() {
 2)               |      capable() {
 2)               |        ns_capable_common() {
 2)               |          security_capable() {
 2)   0.236 us    |            cap_capable();
 2)   0.872 us    |          }
 2)   1.419 us    |        }
 2)   1.951 us    |      }
 2)   0.419 us    |      free_uid();
 2)   3.389 us    |    }
 2) + 48.769 us   |  }

The fs/io_uring.c code shows the likely error code paths:

        account_mem = !capable(CAP_IPC_LOCK);

        if (account_mem) {
                ret = io_account_mem(user,
                                ring_pages(p->sq_entries, p->cq_entries));
                if (ret) {
                        return ret;

        ctx = io_ring_ctx_alloc(p);
        if (!ctx) {
                if (account_mem)
                        io_unaccount_mem(user, ring_pages(p->sq_entries,
                return -ENOMEM;

But is there enough information in the trace to determine which of these return statements is executed? The trace shows free_uid() so we can be confident that both these code paths are valid candidates. By looking back at the success code path we can use the kmem_cache_alloc_trace() as a landmark. It is called by io_ring_ctx_alloc() so we should see kmem_cache_alloc_trace() in the trace before free_uid() if the second return statement is taken. Since it does not appear in the trace output we conclude that the first return statement is being taken!

When trace output is inconclusive

Function graph tracer output only shows functions in the ELF file. When the compiler inlines code, no entry or return is recorded in the function graph trace. This can make it hard to identify the exact return statement taken in a long function. Functions containing few function calls and many conditional branches are also difficult to analyze from just a function graph trace.

We can enhance our understanding of the trace by adding dynamic probes that record function arguments, local variables, and/or return values via perf-probe(2). By knowing these values we can make inferences about the code path being taken.

If this is not enough to infer which code path is being taken, detailed code coverage information is necessary.

One way to approximate code coverage is using a sampling CPU profiler, like perf(1), and letting it run under load for some time to gather statistics on which code paths are executed frequently. This is not as precise as code coverage tools, which record each branch encountered in a program, but it can be enough to observe code paths in functions that are not amenable to the function graph tracer due to the low number of function calls.

This is done as follows:

  1. Run the system call in question in a tight loop so the CPU is spending a significant amount of time in the code path you wish to observe.
  2. Start perf record -a and let it run for 30 seconds.
  3. Stop perf-record(1) and run perf-report(1) to view the annotated source code of the function in question.

The error code path should have a significant number of profiler samples and it should be prominent in the pref-report(1) annotated output.


Determining the cause for a system call failure can be hard work. The function graph tracer is helpful in shedding light on the code paths being taken by the kernel. Additional debugging is possible using perf-probe(2) and the sampling profiler, so that in most cases it's not necessary to recompile the kernel with printk() just to learn why a system call is failing.

Thursday, April 18, 2019

What's new in VIRTIO 1.1?

The VIRTIO 1.1 specification has been published! This article covers the major new features in this specification.

New Devices

The following new devices are defined:

  • virtio-input is a Linux evdev input device (mouse, keyboard, joystick)
  • virtio-gpu is a 2D graphics device (with 3D support planned)
  • virtio-vsock is a host<->guest socket communications device
  • virtio-crypto is a cryptographic accelerator device

New Device Features




New Core Features

There is a new virtqueue memory layout called packed virtqueues. The old layout is called split virtqueues because the avail and used rings are separate from the descriptor table. The new packed virtqueue layout uses just a single descriptor table as the single ring. The layout is optimized for a friendlier CPU cache footprint and there are several features that devices can exploit for better peformance.

The VIRTIO_F_NOTIFICATION_DATA feature is an optimization mainly for hardware implementations of VIRTIO. The driver writes extra information as part of the Available Buffer Notification. Thanks to the information included in the notification, the device does not need to fetch this information from memory anymore. This is useful for PCI hardware implementations where minimizing DMA operations improves performance significantly.

Thursday, February 28, 2019

QEMU accepted into Google Summer of Code and Outreachy 2019

QEMU is participating in the Google Summer of Code and Outreachy open source internship programs again this year. These 12-week, full-time, paid, remote work internships allow people interested in contributing to QEMU get started. Each intern works with one or more mentors who can answer questions and are experienced developers. This is a great way to try out working on open source if you are considering it as a career.

For more information (including eligibility requirements), see our GSoC and our Outreachy pages.

Friday, January 25, 2019

VIRTIO 1.1 is available for public review until Feb 21st 2019

The VIRTIO 1.1 specification for paravirtualized I/O devices includes the new packed vring layout and the GPU, input, crypto, and socket device types. In addition to this there are other improvements and new features in the specification. The new vring layout will increase performance and offers new features that devices can take advantage of.

You can review the specification and post comments until February 21st 2019: VIRTIO 1.1 csprd01.

Sunday, January 6, 2019

mute-thread: a script to mute email threads with notmuch

Ever get included on an email thread that isn't relevant? It can be distracting to see new emails appear on a thread you already know is not interesting. You could mark them as read manually, but that is tedious.

This mute-thread script silences email threads that you don't want to read, even after new emails are received.

Download it here.


It relies on the awesome notmuch(1) email utility, so make sure you have that set up in order to use this script.

The following .muttrc macro integrates this with the mutt(1) email client. When you press M the entire email thread is muted:

macro index M "<enter-command>unset wait_key<enter><pipe-message>~/.mutt/mute-thread add<enter><enter-command>set wait_key<enter><read-thread>" "Mute thread"

After fetching new emails, run notmuch and then mute-thread apply.

Unmuting threads

If you change your mind, run mute-thread remove MESSAGE-ID to unmute a thread again. Future emails will not be silenced.