Monday, August 5, 2019

Determining why a Linux syscall failed

One is often left wondering what caused an errno value when a system call fails. Figuring out the reason can be tricky because a single errno value can have multiple causes. Applications get an errno integer and no additional information about what went wrong in the kernel.

There are several ways to determine the reason for a system call failure (from easiest to most involved):

  1. Check the system call's man page for the meaning of the errno value. Sometimes this is enough to explain the failure.
  2. Check the kernel log using dmesg(1). If something went seriously wrong (like a hardware error) then there may be a detailed error information. It may help to increase the kernel log level.
  3. Read the kernel source code to understand various error code paths and identify the most relevant one.
  4. Use the function graph tracer to see which code path was taken.
  5. Add printk() calls, recompile the kernel (module), and rerun to see the output.

Reading the man page and checking dmesg(1) are fairly easy for application developers and do not require knowledge of kernel internals. If this does not produce an answer then it is necessary to look closely at the kernel source code to understand a system call's error code paths.

This post discusses the function graph tracer and how it can be used to identify system call failures without recompiling the kernel. This is useful because running a custom kernel may not be possible (e.g. due to security or reliability concerns) and recompiling the kernel is slow.

An example

In order to explore some debugging techniques let's take the io_uring_setup(2) system call as an example. It is failing with ENOMEM but the system is not under memory pressure, so ENOMEM is not expected.

The io_uring_setup(2) source code (fs/io_uring.c) contains many ENOMEM locations but it is not possible to conclusively identify which one is failing. The next step is to determine which code path is taken using dynamic instrumentation.

The function graph tracer

The Linux function graph tracer records kernel function entries and returns so that function call relationships are made apparent. The io_uring_setup(2) system call is failing with ENOMEM but it is unclear at which point in the system call this happens. It is possible to find the answer by studying the function call graph produced by the tracer and following along in the Linux source code.

Since io_uring_setup(2) is a system call it's not an ordinary C function definition and has a special symbol name in the kernel ELF file. It is possible to look up the (architecture-specific) symbol for the currently running kernel:

# grep io_uring_setup /proc/kallsyms
ffffffffbd357130 T __x64_sys_io_uring_setup

Let's trace all __x64_sys_io_uring_setup calls:

# cd /sys/kernel/debug/tracing
# echo '__x64_sys_io_uring_setup' > set_graph_function
# echo 'function_graph' >current_tracer
# cat trace_pipe >/tmp/trace.log run the application in another terminal...
The trace contains many successful io_uring_setup(2) calls that look like this:
 1)               |  __x64_sys_io_uring_setup() {
 1)               |    io_uring_setup() {
 1)               |      capable() {
 1)               |        ns_capable_common() {
 1)               |          security_capable() {
 1)   0.199 us    |            cap_capable();
 1)   7.095 us    |          }
 1)   7.594 us    |        }
 1)   8.059 us    |      }
 1)               |      kmem_cache_alloc_trace() {
 1)               |        _cond_resched() {
 1)   0.244 us    |          rcu_all_qs();
 1)   0.708 us    |        }
 1)   0.208 us    |        should_failslab();
 1)   0.220 us    |        memcg_kmem_put_cache();
 1)   2.201 us    |      }
 1)               |      fd_install() {
 1)   0.223 us    |        __fd_install();
 1)   0.643 us    |      }
 1) ! 190.396 us  |    }
 1) ! 216.236 us  |  }

Although the goal is to understand system call failures, looking at a successful invocation can be useful too. Failed calls in trace output can be identified on the basis that they differ from successful calls. This knowledge can be valuable when searching through large trace files. A failed io_uring_setup(2) call aborts early and does not invoke fd_install(). Now it is possible to find a failed call amongst all the io_uring_setup(2) calls:

 2)               |  __x64_sys_io_uring_setup() {
 2)               |    io_uring_setup() {
 2)               |      capable() {
 2)               |        ns_capable_common() {
 2)               |          security_capable() {
 2)   0.236 us    |            cap_capable();
 2)   0.872 us    |          }
 2)   1.419 us    |        }
 2)   1.951 us    |      }
 2)   0.419 us    |      free_uid();
 2)   3.389 us    |    }
 2) + 48.769 us   |  }

The fs/io_uring.c code shows the likely error code paths:

        account_mem = !capable(CAP_IPC_LOCK);

        if (account_mem) {
                ret = io_account_mem(user,
                                ring_pages(p->sq_entries, p->cq_entries));
                if (ret) {
                        return ret;

        ctx = io_ring_ctx_alloc(p);
        if (!ctx) {
                if (account_mem)
                        io_unaccount_mem(user, ring_pages(p->sq_entries,
                return -ENOMEM;

But is there enough information in the trace to determine which of these return statements is executed? The trace shows free_uid() so we can be confident that both these code paths are valid candidates. By looking back at the success code path we can use the kmem_cache_alloc_trace() as a landmark. It is called by io_ring_ctx_alloc() so we should see kmem_cache_alloc_trace() in the trace before free_uid() if the second return statement is taken. Since it does not appear in the trace output we conclude that the first return statement is being taken!

When trace output is inconclusive

Function graph tracer output only shows functions in the ELF file. When the compiler inlines code, no entry or return is recorded in the function graph trace. This can make it hard to identify the exact return statement taken in a long function. Functions containing few function calls and many conditional branches are also difficult to analyze from just a function graph trace.

We can enhance our understanding of the trace by adding dynamic probes that record function arguments, local variables, and/or return values via perf-probe(2). By knowing these values we can make inferences about the code path being taken.

If this is not enough to infer which code path is being taken, detailed code coverage information is necessary.

One way to approximate code coverage is using a sampling CPU profiler, like perf(1), and letting it run under load for some time to gather statistics on which code paths are executed frequently. This is not as precise as code coverage tools, which record each branch encountered in a program, but it can be enough to observe code paths in functions that are not amenable to the function graph tracer due to the low number of function calls.

This is done as follows:

  1. Run the system call in question in a tight loop so the CPU is spending a significant amount of time in the code path you wish to observe.
  2. Start perf record -a and let it run for 30 seconds.
  3. Stop perf-record(1) and run perf-report(1) to view the annotated source code of the function in question.

The error code path should have a significant number of profiler samples and it should be prominent in the pref-report(1) annotated output.


Determining the cause for a system call failure can be hard work. The function graph tracer is helpful in shedding light on the code paths being taken by the kernel. Additional debugging is possible using perf-probe(2) and the sampling profiler, so that in most cases it's not necessary to recompile the kernel with printk() just to learn why a system call is failing.

Thursday, April 18, 2019

What's new in VIRTIO 1.1?

The VIRTIO 1.1 specification has been published! This article covers the major new features in this specification.

New Devices

The following new devices are defined:

  • virtio-input is a Linux evdev input device (mouse, keyboard, joystick)
  • virtio-gpu is a 2D graphics device (with 3D support planned)
  • virtio-vsock is a host<->guest socket communications device
  • virtio-crypto is a cryptographic accelerator device

New Device Features




New Core Features

There is a new virtqueue memory layout called packed virtqueues. The old layout is called split virtqueues because the avail and used rings are separate from the descriptor table. The new packed virtqueue layout uses just a single descriptor table as the single ring. The layout is optimized for a friendlier CPU cache footprint and there are several features that devices can exploit for better peformance.

The VIRTIO_F_NOTIFICATION_DATA feature is an optimization mainly for hardware implementations of VIRTIO. The driver writes extra information as part of the Available Buffer Notification. Thanks to the information included in the notification, the device does not need to fetch this information from memory anymore. This is useful for PCI hardware implementations where minimizing DMA operations improves performance significantly.

Thursday, February 28, 2019

QEMU accepted into Google Summer of Code and Outreachy 2019

QEMU is participating in the Google Summer of Code and Outreachy open source internship programs again this year. These 12-week, full-time, paid, remote work internships allow people interested in contributing to QEMU get started. Each intern works with one or more mentors who can answer questions and are experienced developers. This is a great way to try out working on open source if you are considering it as a career.

For more information (including eligibility requirements), see our GSoC and our Outreachy pages.

Friday, January 25, 2019

VIRTIO 1.1 is available for public review until Feb 21st 2019

The VIRTIO 1.1 specification for paravirtualized I/O devices includes the new packed vring layout and the GPU, input, crypto, and socket device types. In addition to this there are other improvements and new features in the specification. The new vring layout will increase performance and offers new features that devices can take advantage of.

You can review the specification and post comments until February 21st 2019: VIRTIO 1.1 csprd01.

Sunday, January 6, 2019

mute-thread: a script to mute email threads with notmuch

Ever get included on an email thread that isn't relevant? It can be distracting to see new emails appear on a thread you already know is not interesting. You could mark them as read manually, but that is tedious.

This mute-thread script silences email threads that you don't want to read, even after new emails are received.

Download it here.


It relies on the awesome notmuch(1) email utility, so make sure you have that set up in order to use this script.

The following .muttrc macro integrates this with the mutt(1) email client. When you press M the entire email thread is muted:

macro index M "<enter-command>unset wait_key<enter><pipe-message>~/.mutt/mute-thread add<enter><enter-command>set wait_key<enter><read-thread>" "Mute thread"

After fetching new emails, run notmuch and then mute-thread apply.

Unmuting threads

If you change your mind, run mute-thread remove MESSAGE-ID to unmute a thread again. Future emails will not be silenced.

Wednesday, November 28, 2018

Software Freedom Conservancy donations are being matched again!

Donations to Software Freedom Conservancy, the charity that acts as the legal home for QEMU and many other popular open source projects that don't run their own foundations or charities, are being matched again this year. That means your donation is doubled thanks to a group of donors who have pledged to match donations.

Software Freedom Conservancy helps projects with the details of running an open source project (legal advice, handling expenses, organizing conferences, etc) as well as taking a leading position on open source licensing and enforcement. Their work is not-for-profit and in the interest of the entire open source community.

If you want more projects like QEMU, Git, Samba, Inkscape, and Selenium to succeed as healthy open source communities, then donating to Software Freedom Conservancy is a good way to help.

Find out about becoming a Supporter here.

Tuesday, November 27, 2018

QEMU Advent Calendar 2018 is coming!

QEMU Advent Calendar is running again this year. Each day from December 1st through 24th a surprise QEMU disk image will be released for your entertainment.

Check out the website on December 1st for the first disk image:

Thomas Huth is organizing QEMU Advent Calendar 2018 with the help of others from the QEMU community. If you want to contribute a disk image, take a look at the call for images email.