Stefan Hajnoczi: Storage literature notes on free space management and snapshots

I recently looked at papers about free space management and snapshots in storage systems like file systems, volume managers, and key-value stores. I'm publishing my notes in case you find them useful, but the real value might simply be the links to papers in this field. They might be a useful starting point for someone wishing to read into this field.

My aim was to get an overview of data structures and algorithms used in modern storage systems for tracking free space and snapshotting objects.

Literature

For a 50-year overview of file systems, see this blog series.

The Zettabyte File system (2003)
- The Storage Pool Allocator (SPA) provides allocation and freeing of blocks across physical disks. It deals in disk virtual addresses (DVAs) so the caller is unaware of which disk storage is located. Blocks can be migrated between devices without changing their DVA because the SPA can just update translation metadata.
  - A slab allocator is used to satisfy contiguous block allocation requests of power-of-2 sizes (see details). Each device is divided into ~200 “metaslabs” (i.e. 0.5% of the device).
  - Allocations in a metaslab are written into a log called a space map and rewritten when the log becomes too long (see details). In memory, range trees are built from the on-disk log so that free space can be looked up by offset or length (see details).
- All blocks are checksummed. Checksums are stored along with the block pointer, so the integrity of the entire tree is protected via the checksum. When data is mirrored across drives it is possible to fix checksum failures.
- The Data Management Unit (DMU) provides an object storage interface for creating, accessing, and deleting objects on top of the SPA.
- The ZFS POSIX Layer (ZPL) implements POSIX file system semantics using the DMU to create objects for directories, files, etc.
- When there are too many data blocks to store the block pointers, ZFS uses indirect blocks (up to 6 levels). Indirect blocks are blocks containing block pointers.

B-trees, Shadowing, and Clones (2006)
- Uses a copy-on-write B+-tree to implement an object storage device (OSD).
- Requests are written to a log for recovery in between B+-tree checkpoints.
- B+-tree pages are kept cached in memory until checkpoint write-out so that multiple updates to the same page are batched.
- Hierarchical reference counts are used on tree nodes. This makes refcounts lazy and avoids having to increment/decrement refcounts on all blocks upfront.

FlexVol: Flexible, Efficient File Volume Virtualization in WAFL (2008)
- Introduces logical volumes into WAFL so that multiple file systems can be managed on the same physical storage with separate snapshots, policies, etc.
- Delayed Block Freeing: do not actually free blocks and instead defer until 2% of blocks are ready to be freed in the background.
- Cloning Volumes from Snapshots works like backing file chains in qcow2 or VMDK. WAFL knows which Snapshots are referenced and won’t free their metadata and blocks because Clone Volumes may still be using them. Clone Volumes can be detached from their Snapshots by copying out the data blocks to new blocks.

Tracking Back References in a Write-Anywhere File System (2010)
- Log-structured back references are write-optimized so that block allocation, snapshot creation, etc efficiently record users of physical blocks. This information is needed during defragmentation and other data reorganization operations.
- Serves queries from physical block address to logical block (inode, offset).
- Implemented using a log-structured merge tree (requires periodic compaction) and a Bloom filter.

MDB: A Memory-Mapped Database and Backend for OpenLDAP (2011)
- LMDB is a read-optimized key-value store implemented as a copy-on-write B+-tree
- Concurrency model: 1 writer and N readers at the same time
- Entire database file is mmapped but writes and flushes use syscalls
- Freelist B+-tree tracks free pages in database file

BTRFS: The Linux B-tree filesystem (2012)
- Extent-based free space management
  - Extent allocation tree stores back references, allowing extents to be moved later
  - Relies on contiguous free space, so background defragmentation is necessary
- Sub-volume tree nodes are reference counted
- A 4KB write creates new inodes, file extents, checksums, and back references and corresponding b-tree spine nodes. When there are multiple modifications, spatial locality (sequential I/O or inode changes in a directory) helps batch these changes together resulting in fewer than N new nodes for N operations. Random I/O is less efficient.

GCTrees: Garbage Collecting Snapshots (2015)
- Rodeh's hierarchical reference counting delays refcount updates by keep refcounts on tree nodes and updating only the node's refcount closest to the root. Further tree modifications might eventually make it necessary to update subsets of refcounts in tree leaves. This can be combined with a refcount log to reduce the random I/O involved in updating many scattered refcounts.
- GCTrees node store an offset to the parent GCTree node and a borrowed bitmap tracking which blocks are shared with the parent.
  - When a GCTree is deleted:
    - Blocks are ignored when the borrowed bit is set
    - The borrowed bit is checked in immediate child GCTree nodes to determine if the remaining blocks are still in use:
      - If not in use, free the block
      - If in use, clear the borrowed bit in the child to transfer ownership of the block to the child (paper doesn't explain how this works when multiple immediate children borrow the same block because this research only considers read-only snapshots without writeable clone support)
    - The linked list (relationship between GCTree nodes) is updated

Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL (2017)
- WAFL keeps free space metadata up-to-date instead of eventually consistent (relying on scanning metadata in the background to identify free space).
- Free space is recorded in a bitmap called activemap. Blocks are allocated near each other (e.g. contiguous), if possible, to minimize updates to the activemap.
- WAFL implements background and inline defragmentation to make contiguous free space available.
- File deletion does not instantly clear bits in the activemap because doing so would be expensive on large files. Deleted files are incrementally freed across checkpoints.
- The Batched Free Log (BFLog) holds deleted blocks and sorts them so they can be deleted incrementally.

How to Copy Files (2020)
- Aims to create what they call "nimble clones" (fast creation, fast read/write I/O, and efficient space utilization)
- Read performance with btrfs, ZFS, xfs degrades after successive rounds of clone + write. The intuition is that at some point it's better to copy the blocks to avoid fragmentation instead of sharing them.
  - They call this Copy-on-Abundant-Write (CAW)
- Implemented in BetrFS, a file system based on a Bε-tree key-value store that uses path names as keys instead of inode numbers.
  - Uses hierarchical reference counts to track nodes
  - Free space is tracked in a bitmap in the node translation table, which is used for indirection to avoid rewriting nodes when physical block locations are updated
  - Didn't look in detail at the Bε-tree DAG technique introduced to implement efficient copies

Data structures

B+ trees: common in file systems and databases for ordered indexes

Bitmaps: widely used to track block allocation

Log-structured merge trees: write-optimized key-value stores that require periodic compaction

Bloom filters: probabilistic data structure for set membership tests sacrificing accuracy (there can be false positives) for low space requirements

Skip lists: probabilistic O(log n) multi-level linked list data structure atop a sorted array but not as popular as B+ trees for on-disk structures

Monday, January 1, 2024

Storage literature notes on free space management and snapshots

Literature

Data structures