Thursday, January 25, 2024

Key-Value Stores: The Foundation of File Systems and Databases

File systems and relational databases are like cousins. They share more than is apparent at first glance.

It's not immediately obvious that relational databases and file systems rely upon the same underlying concept. That underlying concept is the key-value store and this article explores how both file systems and databases can be implemented on top of key-value stores.

The key-value store interface

Key-value stores provide an ordered map data structure. A map is a data structure that supports storing and retrieving from a collection of pairs. It's called a map because it is like a mathematical relation from a given key to an associated value. These are the key-value pairs that a key-value store holds. Finally, ordered means that the collection can be traversed in sorted key order. Not all key-value store implementations support ordered traversal, but both file systems and databases need this property as we shall see.

Here is a key-value store with an integer key and a string value:

Notice that the keys can be enumerated in sorted order: 2 → 14 → 17.

A key-value store provides the following interface for storing and retrieving values by a given key:

  • put(Key, Value) - an insert/update operation that stores a value for a given key
  • get(Key) -> Value - a lookup operation that retrieves the most recently stored value for a given key
  • first() -> Key, last() -> Key, next(Key) -> Key, prev(Key) -> Key - a cursor API that enumerates keys in sorted order

You've probably seen this sort of API if you have explored libraries like LevelDB, RocksDB, LMDB, BoltDB, etc or used NoSQL key-value stores. File systems and databases usually implement their own customized key-value stores rather than use these off-the-shelf solutions.

Why key-value stores are necessary

Let's look at how the key-value store interface relates to disks. Disks present a range of blocks that can be read or written at their block addresses. Disks can be thought of like arrays in programming. They have O(1) lookup and update time complexity but inserting or removing a value before the end of the array is O(n) because subsequent elements need to be copied. They are efficient for dense datasets where every element is populated but inefficient for sparse datasets that involve insertion and removal.

Workloads that involve insertion or removal are not practical when the cost is O(n) for realistic sizes of n. That's why programs often use in-memory data structures like hash tables or balanced trees instead of arrays. Key-value stores can be thought of as the on-disk equivalent to these in-memory data structures. Inserting or removing values from a key-value store takes sub-linear time, perhaps O(log n) or even better amortized time. We won't go into the data structures used to implement key-value stores, but B+ trees and Log-Structured Merge-Trees are popular choices.

This gives us an intuition about when key-value stores are needed and why they are an effective tool. Now let's look at how file systems and databases can be built on top of key-value stores next.

Building a file system on a key-value store

First let's start with how data is stored in files. A file system locates file data on disk by translating file offsets to Logical Block Addresses (LBAs). This is necessary because file data may not be stored contiguously on disk and files can be sparse with unallocated "holes" where nothing has been written yet. Thus, each file can be implemented as a key-value store with <Offset, <LBA, Length>> key-value pairs that comprise the translations needed to locate data on disk:

Reading and writing to the file involves looking up Offset -> LBA translations and inserting new translations when new blocks are allocated for the file. This is a good fit for a key-value store, but it's not the only place where file systems employ key-value stores.

File systems track free blocks that are not in used by files or metadata so that the block allocator can quickly satisfy allocation requests. This can be implemented as a key-value store with <LBA, Length> key-value pairs representing all free LBA ranges.

If the block allocator needs to satisfy contiguous allocation requests then a second key-value store with <Length, LBA> key-value pairs can serve as an efficient lookup or index. A best-fit allocator uses this key-value store by looking up the requested contiguous allocation size. Either an free LBA range of the matching size will be found or the next ordered key can be traversed when lookup fails to find a bigger free range capable of satisfying this allocation request. This is an important pattern with key-value stores: we can have one main key-value store plus one or more indices that are derived from the same key-value pairs but use a different datum as the key than the primary key-value store, allowing efficient lookups and ordered traversal. The same pattern will come up in databases too.

Next, let's look at how to represent directory metadata in a key-value store. Files are organized into a hierarchy of directories (or folders). The file system stores the directory entries belonging to each directory. Each directory can be organized as a key-value store with filenames as keys and inode numbers as values. Path traversal consists of looking up directory entries in each directory along file path components like home, user, and file in the path /home/user/file. When a file is created, a new directory entry is inserted. When a file is deleted, its directory entry is removed. The contents of a directory can be listed by traversing the keys.

Some file systems like BTRFS use key-value stores for other on-disk structures such as snapshots, checksums, etc, too. There is even a root key-value store in BTRS from which all these other key-value stores can be looked up. We'll see that the same concept of a "forest of trees" or a root key-value store that points to other key-value stores also appears in databases below.

Building a database on a key-value store

The core concept in relational databases is the table, which contains the rows of the data we wish to store. The table columns are the various fields that are stored by each row. One or more columns make up the primary key by which table lookups are typically performed. The table can be implemented as a key-value store using the primary key columns as the key and the remainder of the columns as the value:

This key-value store can look up rows in the table by their Id. What if we want to look up a row by Username instead?

To enable efficient lookups by Username, a secondary key-value store called an index maintains a mapping from Username to Id. The index does not duplicate all the columns in the table, just the Username and Id. To perform a query like SELECT * FROM Users WHERE Username = 'codd', the index is first used to look up the Id and then the remainder of the columns are looked up from the table.

SQLite's file format documentation shows the details of how data is organized along these lines and the power of key-value stores. The file format has a header the references the "table b-tree" that points to the roots of all tables. This means there is an entry point key-value store that points to all the other key-value stores associated with tables, indices, etc in the database. This is similar to the forest of trees we saw in the BTRFS file system where the key-value store acts as the central data structure tying everything together.

Conclusion

If a disk is like an array in programming, then a key-value store is like a dict. It offers a convenient interface for storing and retrieving sparse data with good performance. Both file systems and databases are abundant with sparse data and therefore fit naturally on top of key-value store. The actual key-value store implementations inside file systems and databases may be specialized variants of B-trees and other data structures that don't even call themselves key-value stores, but the fundamental abstraction upon which file systems and databases are built is the key-value store.