Thursday, April 30, 2020

How the Linux VFS, block layer, and device drivers fit together

The Linux kernel storage stack consists of several components including the Virtual File System (VFS) layer, the block layer, and device drivers. This article gives an overview of the main objects that a device driver interacts with and their relationships to each other. Actual I/O requests are not covered, instead the focus is on the objects representing the disk.

Let's start with a diagram of the key data structures and then an explanation of how they work together.

The Virtual File System (VFS) layer

The VFS layer is where file system concepts like files and directories are handled. The VFS provides an interface that file systems like ext4, XFS, and NFS implement to register themselves with the kernel and participate in file system operations. The struct file_operations interface is the most interesting for device drivers as we are about to see.

System calls like open(2), read(2), etc are handled by the VFS and dispatched to the appropriate struct file_operations functions.

Block device nodes like /dev/sda are implemented in fs/block_dev.c, which forms a bridge between the VFS and the Linux block layer. The block layer handles the actual I/O requests and is aware of disk-specific information like capacity and block size.

The main VFS concept that device drivers need to be aware of is struct block_device_operations and the struct block_device instances that represent block devices in Linux. A struct block_device connects the VFS inode and struct file_operations interface with the block layer struct gendisk and struct request_queue.

In Linux there are separate device nodes for the whole device (/dev/sda) and its partitions (/dev/sda1, /dev/sda2, etc). This is handled by struct block_device so that a partition has a pointer to its parent in bd_contains.

The block layer

The block layer handles I/O request queues, disk partitions, and other disk-specific functionality. Each disk is represented by a struct gendisk and may have multiple struct hd_struct partitions. There is always part0, a special "partition" covering the entire block device.

I/O requests are placed into queues for processing. Requests can be merged and scheduled by the block layer. Ultimately a device driver receives a request for submission to the physical device. Queues are represented by struct request_queue.

The device driver

The disk device driver registers a struct genhd with the block layer and sets up the struct request_queue to receive requests that need to be submitted to the physical device.

There is one struct genhd for the entire device even though userspace may open struct block_device instances for multiple partitions on the disk. Disk partitions are not visible at the driver level because I/O requests have already had their Logical Block Address (LBA) adjusted with the partition start offset.

How it all fits together

The VFS is aware of the block layer struct gendisk. The device driver is aware of both the block layer and the VFS struct block_device. The block layer does not have direct connections to the other components but the device driver provides callbacks.

One of the interesting aspects is that a device driver may drop its reference to struct gendisk but struct block_device instances may still have their references. In this case no I/O can occur anymore because the driver has stopped the disk and the struct request_queue, but userspace processes can still call into the VFS and struct block_device_operations callbacks in the device driver can still be invoked.

Thinking about this case is why I drew the diagram and ended up writing about this topic!

Monday, April 20, 2020

virtio-fs has landed in QEMU 5.0!

The virtio-fs shared host<->guest file system has landed in QEMU 5.0! It consists of two parts: the QEMU -device vhost-user-fs-pci and the actual file server called virtiofsd. Guests need to have a virtio-fs driver in order to access shared file systems. In Linux the driver is called virtiofs.ko and has been upstream since Linux v5.4.

Using virtio-fs

Thanks to libvirt virtio-fs support, it's possible to share directories trees from the host with the guest like this:

<filesystem type='mount' accessmode='passthrough'>
    <driver type='virtiofs'/>
    <binary xattr='on'>
       <lock posix='on' flock='on'/>
    </binary>
    <source dir='/path/on/host'/>
    <target dir='mount_tag'/>
</filesystem>

The host /path/on/host directory tree can be mounted inside the guest like this:

# mount -t virtiofs mount_tag /mnt

Applications inside the guest can then access the files as if they were local files. For more information about virtio-fs, see the project website.

How it works

For the most part, -device vhost-user-fs-pci just facilitates the connection to virtiofsd where the real work happens. When guests submit file system requests they are handled directly by the virtiofsd process on the host and don't need to go via the QEMU process.

virtiofsd is a FUSE file system daemon with virtio-fs extensions. virtio-fs is built on top of the FUSE protocol and therefore supports the POSIX file system semantics that applications expect from a native Linux file system. The Linux guest driver shares a lot of code with the traditional fuse.ko kernel module.

Resources on virtio-fs

I have given a few presentations on virtio-fs:

Future features

A key feature of virtio-fs is the ability to directly access the host page cache, eliminating the need to copy file contents into guest RAM. This so-called DAX support is not upstream yet.

Live migration is not yet implemented. It is a little challenging to transfer all file system state to the destination host and seamlessly continue file system operation without remounting, but it should be doable.

There is a Rust implementation of virtiofsd that is close to reaching maturity and will replace the C implementation. The advantage is that Rust has better memory and thread safety than C so entire classes of bugs can be eliminated. Also, the codebase is written from scratch whereas the C implementation was a combination of several existing pieces of software that were not designed together.