Wednesday, March 22, 2023

How to debug stuck VIRTIO devices in QEMU

Every once in a while a bug comes along where a guest hangs while communicating with a QEMU VIRTIO device. In this blog post I'll share some debugging approaches that can help QEMU developers who are trying to understand why a VIRTIO device is stuck.

There are a number of reasons why communication with a VIRTIO device might cease, so it helps to identify the nature of the hang:

  • Did the QEMU device see the requests that the guest driver submitted?
  • Did the QEMU device complete the request?
  • Did the guest driver see the requests that the device completed?

The case I will talk about is when QEMU itself is still responsive (the QMP/HMP monitor works) and the guest may or may not be responsive.

Finding requests that are stuck

There is a QEMU monitor command to inspect virtqueues called x-query-virtio-queue-status (QMP) and info virtio-queue-status (HMP). This is a quick way to extract information about a virtqueue from QEMU.

This command allows us to answer the question of whether the QEMU device completed its requests. The shadow_avail_idx and used_idx values in the output are the Available Ring index and Used Ring index, respectively. When they are equal the device has completed all requests. When they are not equal there are still requests in flight and the request must be stuck inside QEMU.

Here is a little more background on the index values. Remember that VIRTIO Split Virtqueues have an Available Ring index and a Used Ring index. The Available Ring index is incremented by the driver whenever it submits a request. The Used Ring index is incremented by the device whenever it completes a request. If the Available Ring index is equal to the Used Ring index then all requests have been completed.

Note that shadow_avail_idx is not the vring Available Ring index in guest RAM but just the last cached copy that the device saw. That means we cannot tell if there are new requests that the device hasn't seen yet. We need to take another approach to figure that out.

Finding requests that the device has not seen yet

Maybe the device has not seen new requests recently and this is why the guest is stuck. That can happen if the device is not receiving Buffer Available Notifications properly (normally this is done by reading a virtqueue kick ioeventfd, also known as a host notifier in QEMU).

We cannot use QEMU monitor commands here, but attaching the GDB debugger to QEMU will allow us to peak at the Available Ring index in guest RAM. The following GDB Python script loads the Available Ring index for a given VirtQueue:

$ cat avail-idx.py
import gdb

# ADDRESS is the address of a VirtQueue struct
vq = gdb.Value(ADDRESS).cast(gdb.lookup_type('VirtQueue').pointer())
avail_idx = vq['vring']['caches']['avail']['ptr'].cast(uint16_type.pointer())[1]
if avail_idx != vq['shadow_avail_idx']:
  print('Device has not seen all available buffers: avail_idx {} shadow_avail_idx {} in {}'.format(avail_idx, vq['shadow_avail_idx'], vq.dereference()))

You can run the script using the source avail-idx.py GDB command. Finding the address of the virtqueue depends on the type of device that you are debugging.

Finding completions that the guest has not seen

If requests are not stuck inside QEMU and the device has seen the latest request, then the guest driver might have missed the Used Buffer Notification from the device (normally an interrupt handler or polling loop inside the guest detects completed requests).

In VIRTIO the driver's current index in the Used Ring is not visible to the device. This means we have no general way of knowing whether the driver has seen completions. However, there is a cool trick for modern devices that have the VIRTIO_RING_F_EVENT_IDX feature enabled.

The trick is that the Linux VIRTIO driver code updates the Used Event Index every time a completed request is popped from the virtqueue. So if we look at the Used Event Index we know the driver's index into the Used Ring and can find out whether it has seen request completions.

The following GDB Python script loads the Used Event Index for a given VirtQueue:

$ cat used-event-idx.py
import gdb

# ADDRESS is the address of a VirtQueue struct
vq = gdb.Value(ADDRESS).cast(gdb.lookup_type('VirtQueue').pointer())
used_event = vq['vring']['caches']['avail']['ptr'].cast(uint16_type.pointer())[2 + vq['vring']['num']]
if used_event != vq['used_idx']:
  print('Driver has not seen all used buffers: used_event {} used_idx {} in {}'.format(used_event, vq['used_idx'], vq.dereference()))

You can run the script using the source avail-idx.py GDB command. Finding the address of the virtqueue depends on the type of device that you are debugging.

Conclusion

I hope this helps anyone who has to debug a VIRTIO device that seems to have gotten stuck.

Monday, February 20, 2023

Writing a C library in Rust

I started working on libblkio in 2020 with the goal of creating a high-performance block I/O library. The internals are written in Rust while the library exposes a public C API for easy integration into existing applications. Most languages have a way to call C APIs, often called a Foreign Function Interface (FFI). It's the most universal way to call into code written in different languages within the same program. The choice of building a C API was a deliberate one in order to make it easy to create bindings in many programming languages. However, writing a library in Rust that exposes a C API is relatively rare (librsvg is the main example I can think of), so I wanted to share what I learnt from this project.

Calling Rust code from C

Rust has good support for making functions callable from C. The documentation on calling Rust code from C covers the basics. Here is the Rust implementation of void blkioq_set_completion_fd_enabled(struct blkioq *q, bool enable) from libblkio:

#[no_mangle]
pub extern "C" fn blkioq_set_completion_fd_enabled(q: &mut Blkioq, enable: bool) {
    q.set_completion_fd_enabled(enable);
}

A C program just needs a function prototype for blkioq_set_completion_fd_enabled() and can call it directly like a C function.

What's really nice is that most primitive Rust types can be passed between languages without special conversion code in Rust. That means the function can accept arguments and return values that map naturally from Rust to C. In the code snippet above you can see that the Rust bool argument can be used without explicit conversion.

C pointers are converted to Rust pointers or references automatically by the compiler. If you want them to be nullable, just wrap them in Rust Option and the C NULL value becomes Rust None while a non-NULL pointer becomes Some. This makes it a breeze to pass data between Rust and C. In the example above, the Rust &mut Blkioq argument is a C struct blkioq *.

Rust structs also map to C nicely when they are declared with repr(C). The Rust compiler lays out the struct in memory so that its representation is compatible with the equivalent C struct.

Limitations of Rust FFI

It's not all roses though. There are fundamental differences between Rust and C that make FFI challenging. Not all language constructs are supported by FFI and some that are require manual work.

Rust generics and dynamically sized types (DST) cannot be used in extern "C" function signatures. Generics require that Rust compiler to generate code, which does not make sense in a C API because there is no Rust compiler involved. DSTs have no mapping to C and so they need to be wrapped in something that can be expressed in C, like a struct. DSTs include trait objects, so you cannot directly pass trait objects across the C/Rust language boundary.

Two extremes in library design

The limitations of FFI raise the question of how to design the library. The first extreme is to use the lowest common denominator language features supported by FFI. In the worst case this means writing C in Rust with frequent use of unsafe (because pointers and unpacked DSTs are passed around). This is obviously a bad approach because it foregoes the safety and expressiveness benefits of Rust. I think few human programmers would follow this approach although code generators or translators might output Rust code of this sort.

The other extreme is to forget about C and focus on writing an idiomatic Rust crate and then build a C API afterwards. Although this sounds nice, it's not entirely a good idea either because of the FFI limitations I mentioned. The Rust crate might be impossible to express as a C API and require significant glue code and possibly performance sacrifices if it values cannot be passed across language boundaries efficiently.

Lessons learnt

When I started libblkio I thought primarily in terms of the C API. Although the FFI code was kept isolated and the rest of the codebase was written in acceptably nice Rust, the main mistake was that I didn't think of what the native Rust crate API should look like. Only thinking of the C API meant that some of the key design decisions were suboptimal for a native Rust crate. Later on, when we began experimenting with a native Rust crate, it became clear where assumptions from the unsafe C API had crept in. It is hard to change them now, although Alberto Faria has done great work in revamping the codebase for a natural Rust API.

I erred too much on the side of the C API. In the future I would try to stay closer to the middle or slightly towards the native Rust API (but not to the extreme). That approach is most likely to end up with code that presents an efficient C API while still implementing it in idiomatic Rust. Overall, implementing a C library API in Rust was a success. I would continue to do this instead of writing new libraries in C because Rust's language features are more attractive than C's.

Video and slides available for "vhost-user-blk: a fast userspace block I/O interface"

At FOSDEM '23 I gave a talk about vhost-user-blk and its use as a userspace block I/O interface. The video and slides are now available here. Enjoy!

Wednesday, January 25, 2023

Speaking at FOSDEM '23 about "vhost-user-blk: A fast userspace block I/O interface"

vhost-user-blk has connected hypervisors to software-defined storage since around 2017, but it was mainly seen as virtualization technology. Did you know that vhost-user-blk is not specific to virtual machines? I think it's time to use it more generally as a userspace block I/O interface because it's fast, unprivileged, and avoids exposing kernel attack surfaces.

My LWN.net article about Accessing QEMU storage features without a VM already hinted at this, but now it's time to focus on what vhost-user-blk is and why it's easy to integrate into your applications. libblkio is a simple and familiar block I/O API with vhost-user-blk support. You can connect to existing SPDK-based software-defined storage applications, qemu-storage-daemon, and other vhost-user-blk back-ends.

Come see my FOSDEM '23 talk about vhost-user-blk as a fast userspace block I/O interface live on Saturday Feb 4 2023, 11:15 CET. It will be streamed on the FOSDEM website and recordings will be available later. Slides are available here.