Stefan Hajnoczi

Wednesday, June 25, 2025

Profiling tools I use for QEMU storage performance optimization

A fair amount of the development work I do is related to storage performance in QEMU/KVM. Although I have written about disk I/O benchmarking and my performance analysis workflow in the past, I haven't covered the performance tools that I frequently rely on. In this post I'll go over what's in my toolbox and hopefully this will be helpful to others.

Performance analysis is hard when the system is slow but there is no clear bottleneck. If a CPU profile shows that a function is consuming significant amounts of time, then that's a good target for optimizations. On the other hand, if the profile is uniform and each function only consumes a small fraction of time, then it is difficult to gain much by optimizing just one function (although taking function call nesting into account may point towards parent functions that can be optimized):

If you are measuring just one metric, then eventually the profile will become uniform and there isn't much left to optimize. It helps to measure at multiple layers of the system in order to increase the chance of finding bottlenecks.

Here are the tools I like to use when hunting for QEMU storage performance optimizations.

kvm_stat

kvm_stat is a tool that runs on the host and counts events from the kvm.ko kernel module, including device register accesses (MMIO), interrupt injections, and more. These are often associated with vmentry/vmexit events where control passes between the guest and the hypervisor. Each time this happens there is a cost and it is preferrable to minimize the number of vmentry/vmexit events.

kvm_stat will let you identify inefficiencies when guest drivers are accessing devices as well as low-level activity (MSRs, nested page tables, etc) that can be optimized.

Here is output from an I/O intensive workload:

 Event                                         Total %Total CurAvg/s
 kvm_entry                                   1066255   20.3    43513
 kvm_exit                                    1066266   20.3    43513
 kvm_hv_timer_state                           954944   18.2    38754
 kvm_msr                                      487702    9.3    19878
 kvm_apicv_accept_irq                         278926    5.3    11430
 kvm_apic_accept_irq                          278920    5.3    11430
 kvm_vcpu_wakeup                              250055    4.8    10128
 kvm_pv_tlb_flush                             250000    4.8    10123
 kvm_msi_set_irq                              229439    4.4     9345
 kvm_ple_window_update                        213309    4.1     8836
 kvm_fast_mmio                                123433    2.3     5000
 kvm_wait_lapic_expire                         39855    0.8     1628
 kvm_apic_ipi                                   9614    0.2      457
 kvm_apic                                       9614    0.2      457
 kvm_unmap_hva_range                               9    0.0        1
 kvm_fpu                                          42    0.0        0
 kvm_mmio                                         28    0.0        0
 kvm_userspace_exit                               21    0.0        0
 vcpu_match_mmio                                  19    0.0        0
 kvm_emulate_insn                                 19    0.0        0
 kvm_pio                                           2    0.0        0
 Total                                       5258472          214493

Here I'm looking for high CurAvg/s rates. Any counters at 100k/s are well worth investigating.

Important events:

kvm_entry/kvm_exit: vmentries and vmexits are when the CPU transfers control between guest mode and the hypervisor.
kvm_msr: Model Specific Register accesses
kvm_msi_set_irq: Interrupt injections (Message Signalled Interrupts)
kvm_fast_mmio/kvm_mmio/kvm_pio: Device register accesses

sysstat

sysstat is a venerable suite of performance monitoring tools cover CPU, network, disk, memory activity. It can be used equally within guests and on the host. It is like an extended version of the classic vmstat(8) tool.

The mpstat, pidstat, and iostat tools are the ones I use most often:

mpstat reports CPU consumption, including %usr (userspace), %sys (kernel), %guest (guest mode), %steal (hypervisor), and more. Use this to find CPUs that are maxed out or poor use of parallelism/multi-threading.
pidstat reports per-process and per-thread statistics. This is useful for identifying specific threads that are consuming resources.
iostat reports disk statistics. This is useful for comparing I/O statistics between bare metal and guests.

blktrace

blktrace monitors Linux block I/O activity. This can be used equally within guests and on the host. Often it's interesting to capture traces in both the guest and on the host so they can be compared. If the I/O pattern is different then something in the I/O stack is modifying requests. That can be a sign of a misconfiguration.

The blktrace data can be analyzed and plotted with the btt command. For example, the latencies from driver submission to completion can be summarized to find the overhead compared to bare metal.

perf-top

In its default mode, perf-top is a sampling CPU profiler. It periodically collects the CPU's program counter value so that a profile can be generated showing hot functions. It supports call graph recording with the -g option if you want to aggregate nested function calls and find out which parent functions are responsible for the most CPU usage.

perf-top (and its non-interactive perf-record/perf-report cousin) is good at identifying inner loops of programs, excessive memcpy/memset, expensive locking instructions, instructions with poor cache hit rates, etc. When I use it to profile QEMU it shows the host kernel and QEMU userspace activity. It does not show guest activity.

Here is the output without call graph recording where we can see vmexit activity, QEMU virtqueue processing, and excessive memset at the top of the profile:

   3.95%  [kvm_intel]                 [k] vmx_vmexit
   3.37%  qemu-system-x86_64          [.] virtqueue_split_pop
   2.64%  libc.so.6                   [.] __memset_evex_unaligned_erms
   2.50%  [kvm_intel]                 [k] vmx_spec_ctrl_restore_host
   1.57%  [kernel]                    [k] native_irq_return_iret
   1.13%  qemu-system-x86_64          [.] bdrv_co_preadv_part
   1.13%  [kernel]                    [k] sync_regs
   1.09%  [kernel]                    [k] native_write_msr
   1.05%  [nvme]                      [k] nvme_poll_cq

The goal is to find functions or families of functions that consume significant amounts of CPU time so they can be optimized. If the profile is uniform with most functions taking less then 1%, then the bottlenecks are more likely to be found with other tools.

perf-trace

perf-trace is an strace-like tool for monitoring system call activity. For performance monitoring it has a --summary option that shows time spent in system calls and the counts. This is how you can identify system calls that block for a long time or that are called too often.

Here is an example from the host showing a summary of a QEMU IOThread that uses io_uring for disk I/O and ioeventfd/irqfd for VIRTIO device activity:

 IO iothread1 (332737), 1763284 events, 99.9%

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   io_uring_enter    351680      0  8016.308     0.003     0.023     0.209      0.11%
   write             390189      0  1238.501     0.002     0.003     0.098      0.08%
   read              121057      0   305.355     0.002     0.003     0.019      0.14%
   ppoll              18707      0    62.228     0.002     0.003     0.023      0.32%

Conclusion

We looked at kvm_stat, sysstat, blktrace, perf-top, and perf-trace. They provide performance metrics from different layers of the system. Another worthy mention is the bcc collection of eBPF-based tools that offers a huge array of performance monitoring and troubleshooting tools. Let me know which tools you like to use!

Tuesday, October 1, 2024

Video and slides available for "IOThread Virtqueue Mapping" talk at KVM Forum 2024

My KVM Forum 2024 talk "IOThread Virtqueue Mapping: Improving virtio-blk SMP scalability in QEMU" is now available on YouTube. The slides are also available here.

IOThread Virtqueue Mapping is a new QEMU feature for configuring multiple IOThreads that will handle a virtio-blk device's virtqueues. This means QEMU can take advantage of the host's Linux multi-queue block layer and assign CPUs to I/O processing. Giving additional resources to virtio-blk emulation allows QEMU to achieve higher IOPS and saturate fast local NVMe drives. This is especially important for applications that submit I/O on many vCPUs simultaneously - a workload that QEMU had trouble keeping up with in the past.

You can read more about IOThread Virtqueue Mapping in this Red Hat blog post.

Sunday, March 31, 2024

Where are the Supply Chain Safe Programming Languages?

Programming languages currently offer few defences against supply chain attacks where a malicious third-party library compromises a program. As I write this, the open source community is trying to figure out the details of the xz-utils backdoor, but there is a long history of supply chain attacks. High profile incidents have made plain the danger of shipping software built from large numbers dependencies, many of them unaudited and under little scrutiny for malicious code. In this post I will share ideas on future supply chain safe programming languages.

Supply Chain Safe Programming Languages?

I'm using the term Supply Chain Safe Programming Languages for languages that defend against supply chain attacks and allow library dependencies to be introduced with strong guarantees about what the dependencies can and cannot do. This type of programming language is not yet widely available as of March 2024, to the best of my knowledge.

Supply chain safety is often associated with software packaging and distribution techniques for verifying that software was built from known good inputs. Although adding supply chain safety tools on top of existing programming languages is a pragmatic solution, I think future progress requires addressing supply chain safety directly in the programming language.

Why today's languages are not supply chain safe

Many existing languages have a module system that gives the programmer control over the visibility of variables and functions. By hiding variable and functions from other modules, one might hope to achieve isolation so that a component like a decompression library could not read a sensitive variable from the program. Unfortunately this level of isolation between components is not really available in popular programming languages today even in languages with public/private visibility features. Visibility is more of a software engineering tool for keeping programs decoupled than an isolation mechanism that actually protects components of a program from each other. There are many ways to bypass visibility.

The fundamental problem is that existing programming languages do not even acknowledge that programs often consist of untrusted components. Compilers and interpreters currently treat the entire input source code as having more or less the same level of trust. Here are some of the ways in which today's programming languages fall short:

Unsafe programming languages like C, C++, and even Rust allow the programmer to bypass the type system to do pretty much anything.
Dynamic languages like Python and JavaScript have introspection and monkey patching abilities that allow the programmer to hook other parts of the program and escape attempts at isolation.
Build systems and metaprogramming facilities like macros allow untrusted components to generate code that executes in the context of another component.
Standard libraries provide access to spawning new programs, remapping virtual memory, loading shared libraries written in unsafe languages, hijacking function calls through the linker, raw access to the program's memory space with /proc/self/mem, and so on. All of these can bypass language-level mechanisms for isolating components in a program.

Whatever your current language, it's unlikely that the language itself allows you to isolate components of a program. The best approach we have today for run-time isolation is through sandboxing. Examples of sandboxing approaches include seccomp(2), v8 Isolates for JavaScript, invoking untrusted code in a WebAssembly runtime, or the descendents of chroot(2).

Sandboxes are not supported directly by the programming language and have a number of drawbacks and limitations. Integrating sandboxing into programs is tedious so they are primarily used in the most critical attack surfaces like web browsers or hypervisors. There is usually a performance overhead associated with interacting with the sandbox because data needs to be marshalled or copied. Sandboxing is an opt-in mechanism that doesn't raise the bar of software in general. I believe that supply chain safe programming languages could offer similar isolation but as the default for most software.

What a Supply Chain Safe Programming Language looks like

The goal of a supply chain safe programming language is to isolate components of a program by default. Rather than leaving supply chain safety outside the scope of the language, the language should allow components to be integrated with strong guarantees about what effects they can have on each other. There may be practical reasons to offer an escape hatch to unsafe behavior, but the default needs to be safe.

At what level of granularity should isolation operate? I think modules are too coarse grained because they are often collections of functions that perform very different types of computation requiring different levels of access to resources. The level of granularity should at least go down to the function level within a component, although even achieving module-level granularity would be a major improvement over today's standards.

An example is that a hash table lookup function should be unable to connect to the internet. That way the function can be used without fear of it becoming a liability if it contains bugs or its source code is manipulated by an attacker.

A well-known problem in programming language security is that the majority of languages expose ambient capabilities to all components in a program. Ambient capabilities provide access to resources that are not explicitly passed in to the component. Think of a file descriptor in a POSIX process that is available to any function in the program, including a string compare function that has no business manipulating file descriptors.

Capability-based security approaches are a solution to the ambient capabilities problem in languages today. Although mainstream programming languages do not offer capabilities as part of the language, there have been special-purpose and research languages that demonstrated that this approach works. In a type safe programming language with capability-based security it becomes possible to give components access to only those resources that they require. Usually type safety is the mechanism that prevents capabilities from being created out of thin air, although other approaches may be possible for dynamic languages. The type system will not allow a component to create itself a new capability that the component does not already possess.

Capability-based security addresses safety at runtime, but it does not address safety at compile time. If we want to compose programs from untrusted components then it is not possible to rely on today's build scripts, code generators, or macro systems. The problem is that they can be abused by a component to execute code in the context of another component.

Compile-time supply chain safety means isolating components so their code stays within their component. For example, a "leftpad" macro that pads a string literal with leading spaces would be unsafe if it can generate code that is compiled as part of the main program using the macro. Similarly, a build script for the leftpad module must not be able to affect or escape the build environment.

Macros, build scripts, code generators, and so on are powerful tools that programmers find valuable. The challenge for supply chain safe programming languages is to harness that power so that it remains convenient to use without endangering safety. One example solution is running build scripts in an isolated environment that cannot affect other components in the program. This way a component can take advantage of custom build-time behavior without endangering the program. However, it is unclear to me how far inter-component facilities like macros can be made safe, if at all.

Conclusion

I don't have the answers or even a prototype, but I think supply chain safe programming languages are an inevitability. Modern programs are composed of many third-party components yet we do not have effective techniques for confining components. Languages treat the entire program as trusted rather than as separate untrusted components that must be isolated.

Hopefully we will begin to see new mainstream programming languages emerge that are supply chain safe, not just memory safe!

Wednesday, March 6, 2024

How to access libvirt domains in KubeVirt

KubeVirt makes it possible to run virtual machines on Kubernetes alongside container workloads. Virtual machines are configured using VirtualMachineInstance YAML. But under the hood of KubeVirt lies the same libvirt tooling that is commonly used to run KVM virtual machines on Linux. Accessing libvirt can be convenient for development and troubleshooting.

Note that bypassing KubeVirt must be done carefully. Doing this in production may interfere with running VMs. If a feature is missing from KubeVirt, then please request it.

The following diagram shows how the user's VirtualMachineInstance is turned into a libvirt domain:

Accessing virsh

Libvirt's virsh command-line tool is available inside the virt-launcher Pod that runs a virtual machine. First determine vm1's virt-launcher Pod name by filtering on its label (thanks to Alice Frosi for this trick!):

$ kubectl get pod -l vm.kubevirt.io/name=vm1
NAME                      READY   STATUS    RESTARTS   AGE
virt-launcher-vm1-5gxvg   2/2     Running   0          8m13s

Find the name of the libvirt domain (this is guessable but it doesn't hurt to check):

$ kubectl exec virt-launcher-vm1-5gxvg -- virsh list
 Id   Name          State
-----------------------------
 1    default_vm1   running

Arbitrary virsh commands can be invoked. Here is an example of dumping the libvirt domain XML:

$ kubectl exec virt-launcher-vm1-5gxvg -- virsh dumpxml default_vm1
<domain type='kvm' id='1'>
  <name>default_vm1</name>
...

Viewing libvirt logs and full the QEMU command-line

The libvirt logs are captured by Kubernetes so you can view them with kubectl log <virt-launcher-pod-name>. If you don't know the virt-launcher pod name, check with kubectl get pod and look for your virtual machine's name.

The full QEMU command-line is part of the libvirt logs, but unescaping the JSON string is inconvenient. Here is another way to get the full QEMU command-line:

$ kubectl exec <virt-launcher-pod-name> -- ps aux | grep qemu

Customizing KubeVirt's libvirt domain XML

KubeVirt has a feature for customizing libvirt domain XML called hook sidecars. After the libvirt XML is generated, it is sent to a user-defined container that processes the XML and returns it back. The libvirt domain is defined using this processed XML. To learn more about how it works, check out the documentation.

Hook sidecars are available when the Sidecar feature gate is enabled in the kubevirt/kubevirt custom resource. Normally only the cluster administrator can modify the kubevirt CR, so be sure to check when trying this feature:

$ kubectl auth can-i update  kubevirt/kubevirt -n kubevirt
yes

Although you can provide a complete container image for the hook sidecar, there is a shortcut if you just want to run a script. A generic hook sidecar image is available that launches a script which can be provided as a ConfigMap. Here is example YAML including a ConfigMap that I've used to test the libvirt IOThread Virtqueue Mapping feature:

---
apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  name: kubevirt
  namespace: kubevirt
spec:
  configuration:
    developerConfiguration: 
      featureGates:
        - Sidecar
---
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: "fedora"
spec:
  storage:
    accessModes:
        - ReadWriteOnce
    resources:
      requests:
        storage: 5Gi
  source:
    http:
      url: "https://download.fedoraproject.org/pub/fedora/linux/releases/38/Cloud/x86_64/images/Fedora-Cloud-Base-38-1.6.x86_64.raw.xz"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: sidecar-script
data:
  my_script.sh: |
    #!/usr/bin/env python3
    import xml.etree.ElementTree as ET
    import os.path
    import sys
    
    NUM_IOTHREADS = 4
    VOLUME_NAME = 'data' # VirtualMachine volume name
    
    def main(xml):
        domain = ET.fromstring(xml)
    
        domain.find('iothreads').text = str(NUM_IOTHREADS)
    
        disk = domain.find(f"./devices/disk/alias[@name='ua-{VOLUME_NAME}']..")
        driver = disk.find('driver')
        del driver.attrib['iothread']
        iothreads = ET.SubElement(driver, 'iothreads')
        for i in range(NUM_IOTHREADS):
            iothread = ET.SubElement(iothreads, 'iothread')
            iothread.set('id', str(i + 1))
    
        ET.dump(domain)
    
    if __name__ == "__main__":
        # Workaround for https://github.com/kubevirt/kubevirt/issues/11276
        if os.path.exists('/tmp/ran-once'):
            main(sys.argv[4])
        else:
            open('/tmp/ran-once', 'wb')
            print(sys.argv[4])
---
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
  creationTimestamp: 2018-07-04T15:03:08Z
  generation: 1
  labels:
    kubevirt.io/os: linux
  name: vm1
  annotations:
    hooks.kubevirt.io/hookSidecars: '[{"args": ["--version", "v1alpha3"],
      "image": "kubevirt/sidecar-shim:20240108_99b6c4bdb",
      "configMap": {"name": "sidecar-script",
                    "key": "my_script.sh",
                    "hookPath": "/usr/bin/onDefineDomain"}}]'
spec:
  domain:
    ioThreadsPolicy: auto
    cpu:
      cores: 8
    devices:
      blockMultiQueue: true
      disks:
      - disk:
          bus: virtio
        name: disk0
      - disk:
          bus: virtio
        name: data
    machine:
      type: q35
    resources:
      requests:
        memory: 1024M
  volumes:
  - name: disk0
    persistentVolumeClaim:
      claimName: fedora
  - name: data
    emptyDisk:
      capacity: 8Gi

If you need to go down one level further and customize the QEMU command-line, see my post on passing QEMU command-line options in libvirt domain XML.

More KubeVirt debugging tricks

The official KubeVirt documentation has a Virtualization Debugging section with more tricks for customizing libvirt logging, launching QEMU with strace or gdb, etc. Thanks to Alice Frosi for sharing the link!

Conclusion

It is possible to get libvirt access in KubeVirt for development and testing. This can make troubleshooting easier and it gives you the full range of libvirt domain XML if you want to experiment with features that are not yet exposed by KubeVirt.

Thursday, January 25, 2024

Key-Value Stores: The Foundation of File Systems and Databases

File systems and relational databases are like cousins. They share more than is apparent at first glance.

It's not immediately obvious that relational databases and file systems rely upon the same underlying concept. That underlying concept is the key-value store and this article explores how both file systems and databases can be implemented on top of key-value stores.

The key-value store interface

Key-value stores provide an ordered map data structure. A map is a data structure that supports storing and retrieving from a collection of pairs. It's called a map because it is like a mathematical relation from a given key to an associated value. These are the key-value pairs that a key-value store holds. Finally, ordered means that the collection can be traversed in sorted key order. Not all key-value store implementations support ordered traversal, but both file systems and databases need this property as we shall see.

Here is a key-value store with an integer key and a string value:

Notice that the keys can be enumerated in sorted order: 2 → 14 → 17.

A key-value store provides the following interface for storing and retrieving values by a given key:

put(Key, Value) - an insert/update operation that stores a value for a given key
get(Key) -> Value - a lookup operation that retrieves the most recently stored value for a given key
first() -> Key, last() -> Key, next(Key) -> Key, prev(Key) -> Key - a cursor API that enumerates keys in sorted order

You've probably seen this sort of API if you have explored libraries like LevelDB, RocksDB, LMDB, BoltDB, etc or used NoSQL key-value stores. File systems and databases usually implement their own customized key-value stores rather than use these off-the-shelf solutions.

Why key-value stores are necessary

Let's look at how the key-value store interface relates to disks. Disks present a range of blocks that can be read or written at their block addresses. Disks can be thought of like arrays in programming. They have O(1) lookup and update time complexity but inserting or removing a value before the end of the array is O(n) because subsequent elements need to be copied. They are efficient for dense datasets where every element is populated but inefficient for sparse datasets that involve insertion and removal.

Workloads that involve insertion or removal are not practical when the cost is O(n) for realistic sizes of n. That's why programs often use in-memory data structures like hash tables or balanced trees instead of arrays. Key-value stores can be thought of as the on-disk equivalent to these in-memory data structures. Inserting or removing values from a key-value store takes sub-linear time, perhaps O(log n) or even better amortized time. We won't go into the data structures used to implement key-value stores, but B+ trees and Log-Structured Merge-Trees are popular choices.

This gives us an intuition about when key-value stores are needed and why they are an effective tool. Now let's look at how file systems and databases can be built on top of key-value stores next.

Building a file system on a key-value store

First let's start with how data is stored in files. A file system locates file data on disk by translating file offsets to Logical Block Addresses (LBAs). This is necessary because file data may not be stored contiguously on disk and files can be sparse with unallocated "holes" where nothing has been written yet. Thus, each file can be implemented as a key-value store with <Offset, <LBA, Length>> key-value pairs that comprise the translations needed to locate data on disk:

Reading and writing to the file involves looking up Offset -> LBA translations and inserting new translations when new blocks are allocated for the file. This is a good fit for a key-value store, but it's not the only place where file systems employ key-value stores.

File systems track free blocks that are not in used by files or metadata so that the block allocator can quickly satisfy allocation requests. This can be implemented as a key-value store with <LBA, Length> key-value pairs representing all free LBA ranges.

If the block allocator needs to satisfy contiguous allocation requests then a second key-value store with <Length, LBA> key-value pairs can serve as an efficient lookup or index. A best-fit allocator uses this key-value store by looking up the requested contiguous allocation size. Either an free LBA range of the matching size will be found or the next ordered key can be traversed when lookup fails to find a bigger free range capable of satisfying this allocation request. This is an important pattern with key-value stores: we can have one main key-value store plus one or more indices that are derived from the same key-value pairs but use a different datum as the key than the primary key-value store, allowing efficient lookups and ordered traversal. The same pattern will come up in databases too.

Next, let's look at how to represent directory metadata in a key-value store. Files are organized into a hierarchy of directories (or folders). The file system stores the directory entries belonging to each directory. Each directory can be organized as a key-value store with filenames as keys and inode numbers as values. Path traversal consists of looking up directory entries in each directory along file path components like home, user, and file in the path /home/user/file. When a file is created, a new directory entry is inserted. When a file is deleted, its directory entry is removed. The contents of a directory can be listed by traversing the keys.

Some file systems like BTRFS use key-value stores for other on-disk structures such as snapshots, checksums, etc, too. There is even a root key-value store in BTRS from which all these other key-value stores can be looked up. We'll see that the same concept of a "forest of trees" or a root key-value store that points to other key-value stores also appears in databases below.

Building a database on a key-value store

The core concept in relational databases is the table, which contains the rows of the data we wish to store. The table columns are the various fields that are stored by each row. One or more columns make up the primary key by which table lookups are typically performed. The table can be implemented as a key-value store using the primary key columns as the key and the remainder of the columns as the value:

This key-value store can look up rows in the table by their Id. What if we want to look up a row by Username instead?

To enable efficient lookups by Username, a secondary key-value store called an index maintains a mapping from Username to Id. The index does not duplicate all the columns in the table, just the Username and Id. To perform a query like SELECT * FROM Users WHERE Username = 'codd', the index is first used to look up the Id and then the remainder of the columns are looked up from the table.

SQLite's file format documentation shows the details of how data is organized along these lines and the power of key-value stores. The file format has a header the references the "table b-tree" that points to the roots of all tables. This means there is an entry point key-value store that points to all the other key-value stores associated with tables, indices, etc in the database. This is similar to the forest of trees we saw in the BTRFS file system where the key-value store acts as the central data structure tying everything together.

Conclusion

If a disk is like an array in programming, then a key-value store is like a dict. It offers a convenient interface for storing and retrieving sparse data with good performance. Both file systems and databases are abundant with sparse data and therefore fit naturally on top of key-value store. The actual key-value store implementations inside file systems and databases may be specialized variants of B-trees and other data structures that don't even call themselves key-value stores, but the fundamental abstraction upon which file systems and databases are built is the key-value store.

Tuesday, January 2, 2024

QEMU AioContext removal and how it was done

This post is about the AioContext lock removal in QEMU 9.0 (planned for release in 2024), how we got here, and what it means for multi-threaded code in QEMU.

Early QEMU as a single-threaded program

Until 2009 QEMU was largely a single-threaded program. This had the benefit that the code didn't need to consider thread-safety and was thus simpler and less bug-prone. The main loop interleaved running the next piece of guest code and handling external events such as timers, disk I/O, and network I/O. This architecture had the downside that emulating multi-processor guests was bottlenecked by the single host CPU on which QEMU ran. There was no parallelism and this became problematic as multi-processor guests became popular.

Multi-threading with vCPU threads and the Big QEMU Lock

The architecture was modified to support running dedicated vCPU threads for KVM guests. This made parallelism possible for multi-processor guests but the feature was initially only available for KVM guests. The Multi-Threaded TCG (MTTCG) feature eventually allowed translated code to also take advantage of vCPU threads in 2016.

A straightforward approach to making all existing code thread-safe was taken: the Big QEMU Lock (BQL) was introduced to serialize access to QEMU's internal state. The BQL is a single global mutex that is used to protect the majority of QEMU's internal state. KVM vCPU threads do not need access to QEMU's internal state while executing guest code, so they don't hold the BQL most of the time. The main loop thread drops the BQL while blocking in ppoll(2) and this allows vCPU threads to acquire the lock when they come out of guest code.

Multi-threading with IOThreads and the AioContext lock

Although the vCPU bottleneck had been solved, device emulation still ran with the BQL held. This meant that only a single QEMU thread could process I/O requests at a time. For I/O bound workloads this was a bottleneck and especially disk I/O performance suffered due to this limitation. My first attempt at removing the bottleneck in 2012 amounted to writing a new "dataplane" code path outside the BQL, but it lacked the features that users needed like disk image file formats, I/O throttling, etc because it couldn't use the existing code that relied on the BQL. The long term solution would be introducing thread-safety to the existing code and that led to the creation of the AioContext lock.

The AioContext lock was like a mini-BQL but for an event loop (QEMU calls this an AioContext) instead of the entire program. Initially the event loop would acquire the lock while running event handlers, thereby ensuring mutual exclusion for all handlers associated with the event loop. Another thread could acquire the lock to stop the event loop from running and safely access variables. This was a crude approach though and propagated the BQL way of thinking further. QEMU began to suffer from deadlocks and race conditions now that multi-threading was possible. Although I wrote developer documentation about how the model worked, it became tricky to gain confidence in the safety of the code as the whole QEMU block layer needed to grapple with AioContext locking and did so incompletely and inconsistently.

The upshot of all of this was that disk I/O processing could run in a dedicated event loop thread (QEMU calls this an IOThread) while the QEMU monitor could acquire the AioContext lock for a brief moment to inspect the emulated disk for an "info block" monitor command, for example. Unlike the earlier "dataplane" approach, it was now possible for the QEMU block layer to run outside the BQL and instead rely on the AioContext lock.

Removing the AioContext lock

Paolo Bonzini had the idea to gradually eliminate the AioContext lock in favor of fine-grained locks because we kept hitting problems with the AioContext lock that I described above. His insight was to change the model so that handler functions would explicitly take their AioContext's lock instead acquiring the lock around the entire event loop iteration. The advantage to letting handlers take the lock was that they could also replace it with another mechanism. Eventually it would be possible to move away from the AioContext lock.

What came after was a multi-year journey that I credit to Paolo's vision. Emanuele Giuseppe Esposito worked with Paolo on putting fine-grained locking into practice and on sorting through the entire QEMU block layer to determine under which threads and locks variables were accessed. This was a massive effort and required a lot of persistence. Kevin Wolf figured out how to use clang's Thread Safety Analysis (TSA) to check some of the locking rules at compile time. Kevin also spent a lot of time protecting the block driver graph with a reader/writer lock so that in-flight I/O does not crash amidst modifications to the graph. Emanuele and Kevin gave a talk at KVM Forum 2023 about the larger QEMU multi-queue block layer effort and the slides are available here (PDF).

Once everything that previously relied on the AioContext lock had switched to another form of thread-safety, it was possible to remove the AioContext lock as nothing used it anymore. The BQL is still widely used and covers global state that is accessed from few threads. Code that can run in any IOThread now uses its own locks or other mechanisms. The complexity of the codebase is still roughly the same as with the AioContext lock, but now there are fine-grained locks, which are easier to understand and there are fewer undocumented locking assumptions that led to deadlocks and races in the past.

Conclusion

QEMU's AioContext lock enabled multi-threading but was also associated with deadlocks and race conditions due to its ambient nature. From QEMU 9.0 onwards, QEMU will switch to fine-grained locks that are more localized and make thread-safety more explicit. Changing locking in a large program is time-consuming and difficult. It took a multi-person multi-year effort to complete this work, but it forms the basis for further work including the QEMU multi-queue block layer effort that push multi-threading further in QEMU.

Monday, January 1, 2024

Storage literature notes on free space management and snapshots

I recently looked at papers about free space management and snapshots in storage systems like file systems, volume managers, and key-value stores. I'm publishing my notes in case you find them useful, but the real value might simply be the links to papers in this field. They might be a useful starting point for someone wishing to read into this field.

My aim was to get an overview of data structures and algorithms used in modern storage systems for tracking free space and snapshotting objects.

Literature

For a 50-year overview of file systems, see this blog series.

The Zettabyte File system (2003)
- The Storage Pool Allocator (SPA) provides allocation and freeing of blocks across physical disks. It deals in disk virtual addresses (DVAs) so the caller is unaware of which disk storage is located. Blocks can be migrated between devices without changing their DVA because the SPA can just update translation metadata.
  - A slab allocator is used to satisfy contiguous block allocation requests of power-of-2 sizes (see details). Each device is divided into ~200 “metaslabs” (i.e. 0.5% of the device).
  - Allocations in a metaslab are written into a log called a space map and rewritten when the log becomes too long (see details). In memory, range trees are built from the on-disk log so that free space can be looked up by offset or length (see details).
- All blocks are checksummed. Checksums are stored along with the block pointer, so the integrity of the entire tree is protected via the checksum. When data is mirrored across drives it is possible to fix checksum failures.
- The Data Management Unit (DMU) provides an object storage interface for creating, accessing, and deleting objects on top of the SPA.
- The ZFS POSIX Layer (ZPL) implements POSIX file system semantics using the DMU to create objects for directories, files, etc.
- When there are too many data blocks to store the block pointers, ZFS uses indirect blocks (up to 6 levels). Indirect blocks are blocks containing block pointers.

B-trees, Shadowing, and Clones (2006)
- Uses a copy-on-write B+-tree to implement an object storage device (OSD).
- Requests are written to a log for recovery in between B+-tree checkpoints.
- B+-tree pages are kept cached in memory until checkpoint write-out so that multiple updates to the same page are batched.
- Hierarchical reference counts are used on tree nodes. This makes refcounts lazy and avoids having to increment/decrement refcounts on all blocks upfront.

FlexVol: Flexible, Efficient File Volume Virtualization in WAFL (2008)
- Introduces logical volumes into WAFL so that multiple file systems can be managed on the same physical storage with separate snapshots, policies, etc.
- Delayed Block Freeing: do not actually free blocks and instead defer until 2% of blocks are ready to be freed in the background.
- Cloning Volumes from Snapshots works like backing file chains in qcow2 or VMDK. WAFL knows which Snapshots are referenced and won’t free their metadata and blocks because Clone Volumes may still be using them. Clone Volumes can be detached from their Snapshots by copying out the data blocks to new blocks.

Tracking Back References in a Write-Anywhere File System (2010)
- Log-structured back references are write-optimized so that block allocation, snapshot creation, etc efficiently record users of physical blocks. This information is needed during defragmentation and other data reorganization operations.
- Serves queries from physical block address to logical block (inode, offset).
- Implemented using a log-structured merge tree (requires periodic compaction) and a Bloom filter.

MDB: A Memory-Mapped Database and Backend for OpenLDAP (2011)
- LMDB is a read-optimized key-value store implemented as a copy-on-write B+-tree
- Concurrency model: 1 writer and N readers at the same time
- Entire database file is mmapped but writes and flushes use syscalls
- Freelist B+-tree tracks free pages in database file

BTRFS: The Linux B-tree filesystem (2012)
- Extent-based free space management
  - Extent allocation tree stores back references, allowing extents to be moved later
  - Relies on contiguous free space, so background defragmentation is necessary
- Sub-volume tree nodes are reference counted
- A 4KB write creates new inodes, file extents, checksums, and back references and corresponding b-tree spine nodes. When there are multiple modifications, spatial locality (sequential I/O or inode changes in a directory) helps batch these changes together resulting in fewer than N new nodes for N operations. Random I/O is less efficient.

GCTrees: Garbage Collecting Snapshots (2015)
- Rodeh's hierarchical reference counting delays refcount updates by keep refcounts on tree nodes and updating only the node's refcount closest to the root. Further tree modifications might eventually make it necessary to update subsets of refcounts in tree leaves. This can be combined with a refcount log to reduce the random I/O involved in updating many scattered refcounts.
- GCTrees node store an offset to the parent GCTree node and a borrowed bitmap tracking which blocks are shared with the parent.
  - When a GCTree is deleted:
    - Blocks are ignored when the borrowed bit is set
    - The borrowed bit is checked in immediate child GCTree nodes to determine if the remaining blocks are still in use:
      - If not in use, free the block
      - If in use, clear the borrowed bit in the child to transfer ownership of the block to the child (paper doesn't explain how this works when multiple immediate children borrow the same block because this research only considers read-only snapshots without writeable clone support)
    - The linked list (relationship between GCTree nodes) is updated

Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL (2017)
- WAFL keeps free space metadata up-to-date instead of eventually consistent (relying on scanning metadata in the background to identify free space).
- Free space is recorded in a bitmap called activemap. Blocks are allocated near each other (e.g. contiguous), if possible, to minimize updates to the activemap.
- WAFL implements background and inline defragmentation to make contiguous free space available.
- File deletion does not instantly clear bits in the activemap because doing so would be expensive on large files. Deleted files are incrementally freed across checkpoints.
- The Batched Free Log (BFLog) holds deleted blocks and sorts them so they can be deleted incrementally.

How to Copy Files (2020)
- Aims to create what they call "nimble clones" (fast creation, fast read/write I/O, and efficient space utilization)
- Read performance with btrfs, ZFS, xfs degrades after successive rounds of clone + write. The intuition is that at some point it's better to copy the blocks to avoid fragmentation instead of sharing them.
  - They call this Copy-on-Abundant-Write (CAW)
- Implemented in BetrFS, a file system based on a Bε-tree key-value store that uses path names as keys instead of inode numbers.
  - Uses hierarchical reference counts to track nodes
  - Free space is tracked in a bitmap in the node translation table, which is used for indirection to avoid rewriting nodes when physical block locations are updated
  - Didn't look in detail at the Bε-tree DAG technique introduced to implement efficient copies

Data structures

B+ trees: common in file systems and databases for ordered indexes

Bitmaps: widely used to track block allocation

Log-structured merge trees: write-optimized key-value stores that require periodic compaction

Bloom filters: probabilistic data structure for set membership tests sacrificing accuracy (there can be false positives) for low space requirements

Skip lists: probabilistic O(log n) multi-level linked list data structure atop a sorted array but not as popular as B+ trees for on-disk structures