Friday, November 18, 2022

LWN article on "Accessing QEMU storage features without a VM"

At KVM Forum 2022 Kevin Wolf and Stefano Garzarella gave a talk on qemu-storage-daemon, a way to get QEMU's storage functionality without running a VM. It's great for accessing disk images, basically taking the older qemu-nbd to the next level. The cool thing is this makes QEMU's software-defined storage functionality - block devices with snapshots, incremental backup, image file formats, etc - available to other programs. Backup and forensics tools as well as other types of programs can take advantage of qemu-storage-daemon.

Here is the full article about Accessing QEMU storage features without a VM. Enjoy!

Thursday, November 10, 2022

Using qemu-img to access vhost-user-blk storage

vhost-user-blk is a high-performance storage protocol that connects virtual machines to software-defined storage like SPDK or qemu-storage-daemon. Until now, tool support for vhost-user-blk has been lacking. Accessing vhost-user-blk devices involved running a virtual machine, which requires more setup than one would like.

QEMU 7.2 adds vhost-user-blk support to the qemu-img tool. This is possible thanks to libblkio, a library that other programs besides QEMU can use too.

Check for vhost-user-blk support in your installed qemu-img version like this (if it says 0 then you need to update qemu-img or compile it from source with libblkio enabled):

$ qemu-img --help | grep virtio-blk-vhost-user | wc -l
1

You can copy a raw disk image file into a vhost-user-blk device like this:

$ qemu-img convert \
      --target-image-opts \
      -n \
      test.img \
      driver=virtio-blk-vhost-user,path=/tmp/vhost-user-blk.sock,cache.direct=on

The contents of the vhost-user-blk device can be saved as a qcow2 image file like this:

$ qemu-img convert \
      --image-opts \
      -O qcow2 \
      driver=virtio-blk-vhost-user,path=/tmp/vhost-user-blk.sock,cache.direct=on out.qcow2

The size of the virtual disk can be read:

$ qemu-img info \
      --image-opts \
      driver=virtio-blk-vhost-user,path=/tmp/vhost-user-blk.sock,cache.direct=on
image: json:{"driver": "virtio-blk-vhost-user"}
file format: virtio-blk-vhost-user
virtual size: 4 GiB (4294967296 bytes)
disk size: unavailable

Other qemu-img sub-commands like bench and dd are also available for quickly accessing the vhost-user-blk device without running a virtual machine:

$ qemu-img bench \
      --image-opts \
      driver=virtio-blk-vhost-user,path=/tmp/vhost-user-blk.sock,cache.direct=on
Sending 75000 read requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 4096)
Run completed in 1.443 seconds.

Being able to access vhost-user-blk devices from qemu-img makes vhost-user-blk a little easier to work with.

Thursday, June 30, 2022

Comparing VIRTIO, NVMe, and io_uring queue designs

Queues and their implementation using shared memory ring buffers are a standard tool for communicating with I/O devices and between CPUs. Although ring buffers are widely used, there is no standard memory layout and it's interesting to compare the differences between designs. When defining libblkio's APIs, I surveyed the ring buffer designs in VIRTIO, NVMe, and io_uring. This article examines some of the differences between the ring buffers and queue semantics in VIRTIO, NVMe, and io_uring.

Ring buffer basics

A ring buffer is a circular array where new elements are written or produced on one side and read or consumed on the other side. Often terms such as head and tail or reader and writer are used to describe the array indices at which the next element is accessed. When the end of the array is reached, one moves back to the start of the array. The empty and full conditions are special states that must be checked to avoid underflow and overflow.

VIRTIO, NVMe, and io_uring all use single producer, single consumer shared memory ring buffers. This allows a CPU and an I/O device or two CPUs to communicate across a region of memory to which both sides have access.

Embedding data in descriptors

At a minimum a ring buffer element, or descriptor, contains the memory address and size of a data buffer:

OffsetTypeName
0x0u64buf
0x8u64len

In a storage device the data buffer contains a request structure with information about the I/O request (logical block address, number of sectors, etc). In order to process a request, the device first loads the descriptor and then loads the request structure described by the descriptor. Performing two loads is sub-optimal and it would be faster to fetch the request structure in a single load.

Embedding the data buffer in the descriptor is a technique that reduces the number of loads. The descriptor layout looks like this:

OffsetTypeName
0x0u64remainder_buf
0x8u64remainder_len
0x10...request structure

The descriptor is extended to make room for the data. If the size of the data varies and is sometimes too large for a descriptor, then the remainder is put into an external buffer. The common case will only require a single load but larger variable-sized buffers can still be handled with 2 loads as before.

VIRTIO does not embed data in descriptors due to its layered design. The data buffers are defined by the device type (net, blk, etc) and virtqueue descriptors are one layer below device types. They have no knowledge of the data buffer layout and therefore cannot embed data.

NVMe embeds the request structure into the Submission Queue Entry. The Command Dword 10, 11, 12, 13, 14, and 15 fields contain the request data and their meaning depends on the Opcode (request type). I/O buffers are still external and described by Physical Region Pages (PRPs) or Scatter Gather Lists (SGLs).

io_uring's struct io_uring_sqe embeds the request structure. Only I/O buffer(s) need to be external as their size varies, would be too large for the ring buffer, and typically zero-copy is desired due to the size of the data.

It seems that VIRTIO could learn from NVMe and io_uring. Instead of having small 16-byte descriptors, it could embed part of the data buffer into the descriptor so that devices need to perform fewer loads during request processing. The 12-byte struct virtio_net_hdr and 16-byte struct virtio_blk_req request headers would fit into a new 32-byte descriptor layout. I have not prototyped and benchmarked this optimization, so I don't know how effective it is.

Descriptor chaining vs external descriptors

I/O requests often include variable size I/O buffers that require scatter-gather lists similar to POSIX struct iovec arrays. Long arrays don't fit into a descriptor so descriptors have fields that point to an external array of descriptors.

Another technique for scatter-gather lists is to chain descriptors together within the ring buffer instead of relying on memory external to the ring buffer. When descriptor chaining is used, I/O requests that don't fit into a single descriptor can occupy multiple descriptors.

Advantages of chaining are better cache locality when a sequence of descriptors is used and no need to allocate separate per-request external descriptor memory.

A consequence of descriptor chaining is that the maximum queue size, or queue depth, becomes variable. It is not possible to guarantee space for specific number of I/O requests because the available number of descriptors depends on the chain size of requests placed into the ring buffer.

VIRTIO supports descriptor chaining although drivers usually forego it when VIRTIO_F_RING_INDIRECT_DESC is available.

NVMe and io_uring do not support descriptor chaining, instead relying on embedded and external descriptors.

Limits on in-flight requests

The maximum number of in-flight requests depends on the ring buffer design. Designs where descriptors are occupied from submission until completion prevent descriptor reuse for other requests while the current request is in flight.

An alternative design is where the device processes submitted descriptors and they are considered free again as soon as the device has looked at them. This approach is natural when separate submission and completion queues are used and there is no relationship between the two descriptor rings.

VIRTIO requests occupy descriptors for the duration of their lifetime, at least in the Split Virtqueue format. Therefore the number of in-flight requests is influenced by the descriptor table size.

NVMe has separate Submission Queues and Completion Queues, but its design still limits the number of in-flight requests to the queue size. The Completion Queue Entry's SQ Head Pointer (SQHD) field precludes having more requests in flight than the Submission Queue size because the field would no longer be unique. Additionally, the driver has no way of detecting Submission Queue Head changes, so it only knows there is space for more submissions when completions occur.

io_uring has independent submission (SQ) and completions queues (CQ) with support for more in-flight requests than the ring buffer size. When there are more in-flight requests than CQ capacity, it's possible to overflow the CQ. io_uring has a backlog mechanism for this case, although the intention is for applications to properly size queues to avoid hitting the backlog often.

Conclusion

VIRTIO, NVMe, and io_uring have slightly different takes on queue design. The semantics and performance vary due to these differences. VIRTIO lacks data embedding inside descriptors. io_uring supports more in-flight requests than the queue size. NVMe and io_uring rely on external descriptors with no ability to chain descriptors.

Friday, April 29, 2022

Debugging Flatpak applications

Flatpak is a way to distribute applications on Linux. Its container-style approach allows applications to run across Linux distributions. This means native packages (rpm, deb, etc) are not needed and it's relatively easy to get your app to Linux users with fewer worries about distro compatibility. This makes life a lot easier for developers and is also convenient for users.

I've run popular applications like OBS Studio as flatpaks and even publish my own on Flathub, a popular hosting site for applications. Today I figured out how to debug flatpaks, which requires some extra steps that I'll share below so I don't forget them myself!

Bonus Tip: Testing local flatpaks

If you're building a flatpak of your own application it's handy to use the dir sources type in the manifest to compile your application's source code from a local directory instead of a git tag or tarball URL. This way you can make changes to the source code and test them quickly inside Flatpak.

Put something along these lines in the manifest's modules object where /home/user/my-app is you local directory with your app's source code:

{
     "name": "my-app",
     "sources": [
         {
             "type": "dir",
             "path": "/home/user/my-app"
         }
     ],
     ...
}

Building and installing apps with debuginfo

flatpak-builder(1) automatically creates a separate .Debug extension for your flatpak that contains your application's debuginfo. You'll need the .Debug extension if you want proper backtraces and source level debugging. At the time of writing the Flatpak documentation did not mention how to install the locally-built .Debug extension. Here is how:

$ flatpak-builder --user --force-clean --install build my.org.app.json
$ flatpak install --user --reinstall --assumeyes "$(pwd)/.flatpak-builder/cache" my.org.app.Debug

It might be a good idea to install debuginfo for the system libraries in your SDK too in case it's not already installed:

$ flatpak install org.kde.Sdk.Debug # or your runtime's SDK

Running applications for debugging

There is a flatpak(1) option that launches the application with the SDK instead of the Runtime:

$ flatpak run --user --devel my.org.app

The SDK contains development tools whereas the Runtime just has the files needed to run applications.

It can also be handy to launch a shell so you can control the launch of your app and maybe use gdb or strace:

$ flatpak run --user --devel --command=sh my.org.app
[📦 my.org.app ~]$ gdb /app/bin/my-app

Working with core dumps

If your application crashes it will dump core like any other process. However, existing ways of inspecting core dumps like coredumpctl(1) are not fully functional because the process ran inside namespaces and debuginfo is located inside flatpaks instead of the usual system-wide /usr/lib/debug location. coredumpctl(1), gdb, etc aren't Flatpak-aware and need extra help.

Use the flatpak-coredumpctl wrapper to launch gdb:

$ flatpak-coredumpctl -m <PID> my.org.app

You can get PID from the list printed by coredumpctl(1).

Conclusion

This article showed how to install locally-built .Debug extensions and inspect core dumps when using Flatpak. I hope that over time these manual steps will become unnecessary as flatpak-builder(1) and coredumpctl(1) are extended to automatically install .Debug extensions and handle Flatpak core dumps. For now it just takes a few extra commands compared to debugging regular applications.

Monday, March 7, 2022

vhost-user is coming to non-Linux hosts!

Sergio Lopez sent a QEMU patch series and vhost-user protocol specification update that maps vhost-user to non-Linux POSIX host operating systems. This is great news because vhost-user has become a popular way to develop emulated devices in any programming language that execute as separate processes with their own security sandboxing. Until now they have only been available on Linux hosts.

At the moment the BSD and macOS implementation is slower than the Linux implementation because the KVM ioeventfd and irqfd primitives are unavailable on those operating systems. Instead POSIX pipes is used and the VMM (QEMU) needs to acts as a forwarder for MMIO/PIO accesses and interrupt injections. On Linux the kvm.ko kernel module has direct support for this, bypassing the VMM process and achieving higher efficiency. However, similar mechanisms could be added to non-KVM virtualization drivers in the future.

This means that vhost-user devices can now start to support multiple host operating systems and I'm sure they will be used in new ways that no one thought about before.

Friday, February 4, 2022

Speaking at FOSDEM '22 about "What's coming in VIRTIO 1.2"

I will give a talk titled What's coming in VIRTIO 1.2: New virtual I/O devices and features on Saturday, February 5th 2022 at 10:00 CET at the FOSDEM virtual conference (it's free and there is no registration!). The 9 new device types will be covered, as well as some of the other features that have been added to the upcoming 1.2 release of the VIRTIO specification. I hope to see you there and if you miss it there will be slides and video available afterwards.

Wednesday, December 8, 2021

How to add debuginfo to perf(1)

Sometimes it's necessary to add debuginfo so perf-report(1) and similar commands can display human-readable function names instead of raw addresses. For instance, if a program was built from source and stripped of symbols when installing into /usr/local/bin/ then perf(1) does not have the symbol information available.

perf(1) maintains a cache of debuginfos keyed by the build-id (also known as .note.gnu.build-id) that uniquely identifies executables and shared objects on Linux. perf.data files contain the build-ids involved when the data was recorded. This allows perf-report(1) and similar commands to look up the required debuginfo from the build-id cache for address to function name translation.

If perf-report(1) displays raw addresses instead of human-readable function names, then we need to get the debuginfo for the build-ids in the perf.data file and add it to the build-id cache. You can show the build-ids required by a perf.data file with perf-buildid-list(1):

$ perf buildid-list # reads ./perf.data by default
b022da126fad1e0a287a6a25016f6c7c996e68c9 /lib/modules/5.14.11-200.fc34.x86_64/kernel/arch/x86/kvm/kvm-intel.ko.xz
f8aa9d9bf047e67b76f22426ad4af310f9b0325a /lib/modules/5.14.11-200.fc34.x86_64/kernel/arch/x86/kvm/kvm.ko.xz
6740f24c4733268d03b41f9483282297dde6b286 [vdso]

Your build-id cache may be missing debuginfo or have limited debuginfo with less symbol information than you need. For example, if data was collected from a stripped /usr/local/bin/my-program executable and you now want to update the build-id cache with the executable that contains full debuginfo, use the perf-buildid-cache(1) command:

$ perf buildid-cache --update=path/to/my-program-with-symbols

There is also an --add=path/to/debuginfo option for adding new build-ids that are not yet in the cache.

Now perf-report(1) and similar tools will display human-readable function names from path/to/my-program-with-symbols instead of the stripped /usr/local/bin/my-program executable. If that doesn't work, verify that the build-ids in my-program-with-symbols and my-program match.