Stefan Hajnoczi: January 2024

Thursday, January 25, 2024

Key-Value Stores: The Foundation of File Systems and Databases

File systems and relational databases are like cousins. They share more than is apparent at first glance.

It's not immediately obvious that relational databases and file systems rely upon the same underlying concept. That underlying concept is the key-value store and this article explores how both file systems and databases can be implemented on top of key-value stores.

The key-value store interface

Key-value stores provide an ordered map data structure. A map is a data structure that supports storing and retrieving from a collection of pairs. It's called a map because it is like a mathematical relation from a given key to an associated value. These are the key-value pairs that a key-value store holds. Finally, ordered means that the collection can be traversed in sorted key order. Not all key-value store implementations support ordered traversal, but both file systems and databases need this property as we shall see.

Here is a key-value store with an integer key and a string value:

Notice that the keys can be enumerated in sorted order: 2 → 14 → 17.

A key-value store provides the following interface for storing and retrieving values by a given key:

put(Key, Value) - an insert/update operation that stores a value for a given key
get(Key) -> Value - a lookup operation that retrieves the most recently stored value for a given key
first() -> Key, last() -> Key, next(Key) -> Key, prev(Key) -> Key - a cursor API that enumerates keys in sorted order

You've probably seen this sort of API if you have explored libraries like LevelDB, RocksDB, LMDB, BoltDB, etc or used NoSQL key-value stores. File systems and databases usually implement their own customized key-value stores rather than use these off-the-shelf solutions.

Why key-value stores are necessary

Let's look at how the key-value store interface relates to disks. Disks present a range of blocks that can be read or written at their block addresses. Disks can be thought of like arrays in programming. They have O(1) lookup and update time complexity but inserting or removing a value before the end of the array is O(n) because subsequent elements need to be copied. They are efficient for dense datasets where every element is populated but inefficient for sparse datasets that involve insertion and removal.

Workloads that involve insertion or removal are not practical when the cost is O(n) for realistic sizes of n. That's why programs often use in-memory data structures like hash tables or balanced trees instead of arrays. Key-value stores can be thought of as the on-disk equivalent to these in-memory data structures. Inserting or removing values from a key-value store takes sub-linear time, perhaps O(log n) or even better amortized time. We won't go into the data structures used to implement key-value stores, but B+ trees and Log-Structured Merge-Trees are popular choices.

This gives us an intuition about when key-value stores are needed and why they are an effective tool. Now let's look at how file systems and databases can be built on top of key-value stores next.

Building a file system on a key-value store

First let's start with how data is stored in files. A file system locates file data on disk by translating file offsets to Logical Block Addresses (LBAs). This is necessary because file data may not be stored contiguously on disk and files can be sparse with unallocated "holes" where nothing has been written yet. Thus, each file can be implemented as a key-value store with <Offset, <LBA, Length>> key-value pairs that comprise the translations needed to locate data on disk:

Reading and writing to the file involves looking up Offset -> LBA translations and inserting new translations when new blocks are allocated for the file. This is a good fit for a key-value store, but it's not the only place where file systems employ key-value stores.

File systems track free blocks that are not in used by files or metadata so that the block allocator can quickly satisfy allocation requests. This can be implemented as a key-value store with <LBA, Length> key-value pairs representing all free LBA ranges.

If the block allocator needs to satisfy contiguous allocation requests then a second key-value store with <Length, LBA> key-value pairs can serve as an efficient lookup or index. A best-fit allocator uses this key-value store by looking up the requested contiguous allocation size. Either an free LBA range of the matching size will be found or the next ordered key can be traversed when lookup fails to find a bigger free range capable of satisfying this allocation request. This is an important pattern with key-value stores: we can have one main key-value store plus one or more indices that are derived from the same key-value pairs but use a different datum as the key than the primary key-value store, allowing efficient lookups and ordered traversal. The same pattern will come up in databases too.

Next, let's look at how to represent directory metadata in a key-value store. Files are organized into a hierarchy of directories (or folders). The file system stores the directory entries belonging to each directory. Each directory can be organized as a key-value store with filenames as keys and inode numbers as values. Path traversal consists of looking up directory entries in each directory along file path components like home, user, and file in the path /home/user/file. When a file is created, a new directory entry is inserted. When a file is deleted, its directory entry is removed. The contents of a directory can be listed by traversing the keys.

Some file systems like BTRFS use key-value stores for other on-disk structures such as snapshots, checksums, etc, too. There is even a root key-value store in BTRS from which all these other key-value stores can be looked up. We'll see that the same concept of a "forest of trees" or a root key-value store that points to other key-value stores also appears in databases below.

Building a database on a key-value store

The core concept in relational databases is the table, which contains the rows of the data we wish to store. The table columns are the various fields that are stored by each row. One or more columns make up the primary key by which table lookups are typically performed. The table can be implemented as a key-value store using the primary key columns as the key and the remainder of the columns as the value:

This key-value store can look up rows in the table by their Id. What if we want to look up a row by Username instead?

To enable efficient lookups by Username, a secondary key-value store called an index maintains a mapping from Username to Id. The index does not duplicate all the columns in the table, just the Username and Id. To perform a query like SELECT * FROM Users WHERE Username = 'codd', the index is first used to look up the Id and then the remainder of the columns are looked up from the table.

SQLite's file format documentation shows the details of how data is organized along these lines and the power of key-value stores. The file format has a header the references the "table b-tree" that points to the roots of all tables. This means there is an entry point key-value store that points to all the other key-value stores associated with tables, indices, etc in the database. This is similar to the forest of trees we saw in the BTRFS file system where the key-value store acts as the central data structure tying everything together.

Conclusion

If a disk is like an array in programming, then a key-value store is like a dict. It offers a convenient interface for storing and retrieving sparse data with good performance. Both file systems and databases are abundant with sparse data and therefore fit naturally on top of key-value store. The actual key-value store implementations inside file systems and databases may be specialized variants of B-trees and other data structures that don't even call themselves key-value stores, but the fundamental abstraction upon which file systems and databases are built is the key-value store.

Tuesday, January 2, 2024

QEMU AioContext removal and how it was done

This post is about the AioContext lock removal in QEMU 9.0 (planned for release in 2024), how we got here, and what it means for multi-threaded code in QEMU.

Early QEMU as a single-threaded program

Until 2009 QEMU was largely a single-threaded program. This had the benefit that the code didn't need to consider thread-safety and was thus simpler and less bug-prone. The main loop interleaved running the next piece of guest code and handling external events such as timers, disk I/O, and network I/O. This architecture had the downside that emulating multi-processor guests was bottlenecked by the single host CPU on which QEMU ran. There was no parallelism and this became problematic as multi-processor guests became popular.

Multi-threading with vCPU threads and the Big QEMU Lock

The architecture was modified to support running dedicated vCPU threads for KVM guests. This made parallelism possible for multi-processor guests but the feature was initially only available for KVM guests. The Multi-Threaded TCG (MTTCG) feature eventually allowed translated code to also take advantage of vCPU threads in 2016.

A straightforward approach to making all existing code thread-safe was taken: the Big QEMU Lock (BQL) was introduced to serialize access to QEMU's internal state. The BQL is a single global mutex that is used to protect the majority of QEMU's internal state. KVM vCPU threads do not need access to QEMU's internal state while executing guest code, so they don't hold the BQL most of the time. The main loop thread drops the BQL while blocking in ppoll(2) and this allows vCPU threads to acquire the lock when they come out of guest code.

Multi-threading with IOThreads and the AioContext lock

Although the vCPU bottleneck had been solved, device emulation still ran with the BQL held. This meant that only a single QEMU thread could process I/O requests at a time. For I/O bound workloads this was a bottleneck and especially disk I/O performance suffered due to this limitation. My first attempt at removing the bottleneck in 2012 amounted to writing a new "dataplane" code path outside the BQL, but it lacked the features that users needed like disk image file formats, I/O throttling, etc because it couldn't use the existing code that relied on the BQL. The long term solution would be introducing thread-safety to the existing code and that led to the creation of the AioContext lock.

The AioContext lock was like a mini-BQL but for an event loop (QEMU calls this an AioContext) instead of the entire program. Initially the event loop would acquire the lock while running event handlers, thereby ensuring mutual exclusion for all handlers associated with the event loop. Another thread could acquire the lock to stop the event loop from running and safely access variables. This was a crude approach though and propagated the BQL way of thinking further. QEMU began to suffer from deadlocks and race conditions now that multi-threading was possible. Although I wrote developer documentation about how the model worked, it became tricky to gain confidence in the safety of the code as the whole QEMU block layer needed to grapple with AioContext locking and did so incompletely and inconsistently.

The upshot of all of this was that disk I/O processing could run in a dedicated event loop thread (QEMU calls this an IOThread) while the QEMU monitor could acquire the AioContext lock for a brief moment to inspect the emulated disk for an "info block" monitor command, for example. Unlike the earlier "dataplane" approach, it was now possible for the QEMU block layer to run outside the BQL and instead rely on the AioContext lock.

Removing the AioContext lock

Paolo Bonzini had the idea to gradually eliminate the AioContext lock in favor of fine-grained locks because we kept hitting problems with the AioContext lock that I described above. His insight was to change the model so that handler functions would explicitly take their AioContext's lock instead acquiring the lock around the entire event loop iteration. The advantage to letting handlers take the lock was that they could also replace it with another mechanism. Eventually it would be possible to move away from the AioContext lock.

What came after was a multi-year journey that I credit to Paolo's vision. Emanuele Giuseppe Esposito worked with Paolo on putting fine-grained locking into practice and on sorting through the entire QEMU block layer to determine under which threads and locks variables were accessed. This was a massive effort and required a lot of persistence. Kevin Wolf figured out how to use clang's Thread Safety Analysis (TSA) to check some of the locking rules at compile time. Kevin also spent a lot of time protecting the block driver graph with a reader/writer lock so that in-flight I/O does not crash amidst modifications to the graph. Emanuele and Kevin gave a talk at KVM Forum 2023 about the larger QEMU multi-queue block layer effort and the slides are available here (PDF).

Once everything that previously relied on the AioContext lock had switched to another form of thread-safety, it was possible to remove the AioContext lock as nothing used it anymore. The BQL is still widely used and covers global state that is accessed from few threads. Code that can run in any IOThread now uses its own locks or other mechanisms. The complexity of the codebase is still roughly the same as with the AioContext lock, but now there are fine-grained locks, which are easier to understand and there are fewer undocumented locking assumptions that led to deadlocks and races in the past.

Conclusion

QEMU's AioContext lock enabled multi-threading but was also associated with deadlocks and race conditions due to its ambient nature. From QEMU 9.0 onwards, QEMU will switch to fine-grained locks that are more localized and make thread-safety more explicit. Changing locking in a large program is time-consuming and difficult. It took a multi-person multi-year effort to complete this work, but it forms the basis for further work including the QEMU multi-queue block layer effort that push multi-threading further in QEMU.

Monday, January 1, 2024

Storage literature notes on free space management and snapshots

I recently looked at papers about free space management and snapshots in storage systems like file systems, volume managers, and key-value stores. I'm publishing my notes in case you find them useful, but the real value might simply be the links to papers in this field. They might be a useful starting point for someone wishing to read into this field.

My aim was to get an overview of data structures and algorithms used in modern storage systems for tracking free space and snapshotting objects.

Literature

For a 50-year overview of file systems, see this blog series.

The Zettabyte File system (2003)
- The Storage Pool Allocator (SPA) provides allocation and freeing of blocks across physical disks. It deals in disk virtual addresses (DVAs) so the caller is unaware of which disk storage is located. Blocks can be migrated between devices without changing their DVA because the SPA can just update translation metadata.
  - A slab allocator is used to satisfy contiguous block allocation requests of power-of-2 sizes (see details). Each device is divided into ~200 “metaslabs” (i.e. 0.5% of the device).
  - Allocations in a metaslab are written into a log called a space map and rewritten when the log becomes too long (see details). In memory, range trees are built from the on-disk log so that free space can be looked up by offset or length (see details).
- All blocks are checksummed. Checksums are stored along with the block pointer, so the integrity of the entire tree is protected via the checksum. When data is mirrored across drives it is possible to fix checksum failures.
- The Data Management Unit (DMU) provides an object storage interface for creating, accessing, and deleting objects on top of the SPA.
- The ZFS POSIX Layer (ZPL) implements POSIX file system semantics using the DMU to create objects for directories, files, etc.
- When there are too many data blocks to store the block pointers, ZFS uses indirect blocks (up to 6 levels). Indirect blocks are blocks containing block pointers.

B-trees, Shadowing, and Clones (2006)
- Uses a copy-on-write B+-tree to implement an object storage device (OSD).
- Requests are written to a log for recovery in between B+-tree checkpoints.
- B+-tree pages are kept cached in memory until checkpoint write-out so that multiple updates to the same page are batched.
- Hierarchical reference counts are used on tree nodes. This makes refcounts lazy and avoids having to increment/decrement refcounts on all blocks upfront.

FlexVol: Flexible, Efficient File Volume Virtualization in WAFL (2008)
- Introduces logical volumes into WAFL so that multiple file systems can be managed on the same physical storage with separate snapshots, policies, etc.
- Delayed Block Freeing: do not actually free blocks and instead defer until 2% of blocks are ready to be freed in the background.
- Cloning Volumes from Snapshots works like backing file chains in qcow2 or VMDK. WAFL knows which Snapshots are referenced and won’t free their metadata and blocks because Clone Volumes may still be using them. Clone Volumes can be detached from their Snapshots by copying out the data blocks to new blocks.

Tracking Back References in a Write-Anywhere File System (2010)
- Log-structured back references are write-optimized so that block allocation, snapshot creation, etc efficiently record users of physical blocks. This information is needed during defragmentation and other data reorganization operations.
- Serves queries from physical block address to logical block (inode, offset).
- Implemented using a log-structured merge tree (requires periodic compaction) and a Bloom filter.

MDB: A Memory-Mapped Database and Backend for OpenLDAP (2011)
- LMDB is a read-optimized key-value store implemented as a copy-on-write B+-tree
- Concurrency model: 1 writer and N readers at the same time
- Entire database file is mmapped but writes and flushes use syscalls
- Freelist B+-tree tracks free pages in database file

BTRFS: The Linux B-tree filesystem (2012)
- Extent-based free space management
  - Extent allocation tree stores back references, allowing extents to be moved later
  - Relies on contiguous free space, so background defragmentation is necessary
- Sub-volume tree nodes are reference counted
- A 4KB write creates new inodes, file extents, checksums, and back references and corresponding b-tree spine nodes. When there are multiple modifications, spatial locality (sequential I/O or inode changes in a directory) helps batch these changes together resulting in fewer than N new nodes for N operations. Random I/O is less efficient.

GCTrees: Garbage Collecting Snapshots (2015)
- Rodeh's hierarchical reference counting delays refcount updates by keep refcounts on tree nodes and updating only the node's refcount closest to the root. Further tree modifications might eventually make it necessary to update subsets of refcounts in tree leaves. This can be combined with a refcount log to reduce the random I/O involved in updating many scattered refcounts.
- GCTrees node store an offset to the parent GCTree node and a borrowed bitmap tracking which blocks are shared with the parent.
  - When a GCTree is deleted:
    - Blocks are ignored when the borrowed bit is set
    - The borrowed bit is checked in immediate child GCTree nodes to determine if the remaining blocks are still in use:
      - If not in use, free the block
      - If in use, clear the borrowed bit in the child to transfer ownership of the block to the child (paper doesn't explain how this works when multiple immediate children borrow the same block because this research only considers read-only snapshots without writeable clone support)
    - The linked list (relationship between GCTree nodes) is updated

Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL (2017)
- WAFL keeps free space metadata up-to-date instead of eventually consistent (relying on scanning metadata in the background to identify free space).
- Free space is recorded in a bitmap called activemap. Blocks are allocated near each other (e.g. contiguous), if possible, to minimize updates to the activemap.
- WAFL implements background and inline defragmentation to make contiguous free space available.
- File deletion does not instantly clear bits in the activemap because doing so would be expensive on large files. Deleted files are incrementally freed across checkpoints.
- The Batched Free Log (BFLog) holds deleted blocks and sorts them so they can be deleted incrementally.

How to Copy Files (2020)
- Aims to create what they call "nimble clones" (fast creation, fast read/write I/O, and efficient space utilization)
- Read performance with btrfs, ZFS, xfs degrades after successive rounds of clone + write. The intuition is that at some point it's better to copy the blocks to avoid fragmentation instead of sharing them.
  - They call this Copy-on-Abundant-Write (CAW)
- Implemented in BetrFS, a file system based on a Bε-tree key-value store that uses path names as keys instead of inode numbers.
  - Uses hierarchical reference counts to track nodes
  - Free space is tracked in a bitmap in the node translation table, which is used for indirection to avoid rewriting nodes when physical block locations are updated
  - Didn't look in detail at the Bε-tree DAG technique introduced to implement efficient copies

Data structures

B+ trees: common in file systems and databases for ordered indexes

Bitmaps: widely used to track block allocation

Log-structured merge trees: write-optimized key-value stores that require periodic compaction

Bloom filters: probabilistic data structure for set membership tests sacrificing accuracy (there can be false positives) for low space requirements

Skip lists: probabilistic O(log n) multi-level linked list data structure atop a sorted array but not as popular as B+ trees for on-disk structures