Queues and their implementation using shared memory ring buffers are a standard tool for communicating with I/O devices and between CPUs. Although ring buffers are widely used, there is no standard memory layout and it's interesting to compare the differences between designs. When defining libblkio's APIs, I surveyed the ring buffer designs in VIRTIO, NVMe, and io_uring. This article examines some of the differences between the ring buffers and queue semantics in VIRTIO, NVMe, and io_uring.
Ring buffer basics
A ring buffer is a circular array where new elements are written or produced on one side and read or consumed on the other side. Often terms such as head and tail or reader and writer are used to describe the array indices at which the next element is accessed. When the end of the array is reached, one moves back to the start of the array. The empty and full conditions are special states that must be checked to avoid underflow and overflow.
VIRTIO, NVMe, and io_uring all use single producer, single consumer shared memory ring buffers. This allows a CPU and an I/O device or two CPUs to communicate across a region of memory to which both sides have access.
Embedding data in descriptors
At a minimum a ring buffer element, or descriptor, contains the memory address and size of a data buffer:
    | Offset | Type | Name | 
|---|
    | 0x0 | u64 | buf | 
    | 0x8 | u64 | len | 
In a storage device the data buffer contains a request structure with information about the I/O request (logical block address, number of sectors, etc). In order to process a request, the device first loads the descriptor and then loads the request structure described by the descriptor. Performing two loads is sub-optimal and it would be faster to fetch the request structure in a single load.
Embedding the data buffer in the descriptor is a technique that reduces the number of loads. The descriptor layout looks like this:
    | Offset | Type | Name | 
|---|
    | 0x0 | u64 | remainder_buf | 
    | 0x8 | u64 | remainder_len | 
    | 0x10 | ... | request structure | 
The descriptor is extended to make room for the data. If the size of the data varies and is sometimes too large for a descriptor, then the remainder is put into an external buffer. The common case will only require a single load but larger variable-sized buffers can still be handled with 2 loads as before.
VIRTIO does not embed data in descriptors due to its layered design. The data buffers are defined by the device type (net, blk, etc) and virtqueue descriptors are one layer below device types. They have no knowledge of the data buffer layout and therefore cannot embed data.
NVMe embeds the request structure into the Submission Queue Entry. The
Command Dword 10, 11, 12, 13, 14, and 15 fields contain the request data and
their meaning depends on the Opcode (request type). I/O buffers are still
external and described by Physical Region Pages (PRPs) or Scatter Gather Lists
(SGLs).
io_uring's struct io_uring_sqe embeds the request structure. Only
I/O buffer(s) need to be external as their size varies, would be too large for
the ring buffer, and typically zero-copy is desired due to the size of the
data.
It seems that VIRTIO could learn from NVMe and io_uring. Instead of having
small 16-byte descriptors, it could embed part of the data buffer into the
descriptor so that devices need to perform fewer loads during request
processing. The 12-byte struct virtio_net_hdr and 16-byte struct
virtio_blk_req request headers would fit into a new 32-byte descriptor
layout. I have not prototyped and benchmarked this optimization, so I don't
know how effective it is.
Descriptor chaining vs external descriptors
I/O requests often include variable size I/O buffers that require
scatter-gather lists similar to POSIX struct iovec arrays. Long arrays
don't fit into a descriptor so descriptors have fields that point to an
external array of descriptors.
Another technique for scatter-gather lists is to chain descriptors
together within the ring buffer instead of relying on memory external to the
ring buffer. When descriptor chaining is used, I/O requests that don't fit into
a single descriptor can occupy multiple descriptors.
Advantages of chaining are better cache locality when a sequence of
descriptors is used and no need to allocate separate
per-request external descriptor memory.
A consequence of descriptor chaining is that the maximum queue size, or
queue depth, becomes variable. It is not possible to guarantee space
for specific number of I/O requests because the available number of descriptors
depends on the chain size of requests placed into the ring buffer.
VIRTIO supports descriptor chaining although drivers usually forego it when VIRTIO_F_RING_INDIRECT_DESC is available.
NVMe and io_uring do not support descriptor chaining, instead relying on embedded and external descriptors.
Limits on in-flight requests
The maximum number of in-flight requests depends on the ring buffer design. Designs where descriptors are occupied from submission until completion prevent descriptor reuse for other requests while the current request is in flight.
An alternative design is where the device processes submitted descriptors and they are considered free again as soon as the device has looked at them. This approach is natural when separate submission and completion queues are used and there is no relationship between the two descriptor rings.
VIRTIO requests occupy descriptors for the duration of their lifetime, at least in the Split Virtqueue format. Therefore the number of in-flight requests is influenced by the descriptor table size.
NVMe has separate Submission Queues and Completion Queues, but its design still limits the number of in-flight requests to the queue size. The Completion Queue Entry's SQ Head Pointer (SQHD) field precludes having more requests in flight than the Submission Queue size because the field would no longer be unique. Additionally, the driver has no way of detecting Submission Queue Head changes, so it only knows there is space for more submissions when completions occur.
io_uring has independent submission (SQ) and completions queues (CQ) with support for more in-flight requests than the ring buffer size. When there are more in-flight requests than CQ capacity, it's possible to overflow the CQ. io_uring has a backlog mechanism for this case, although the intention is for applications to properly size queues to avoid hitting the backlog often.
Conclusion
VIRTIO, NVMe, and io_uring have slightly different takes on queue design. The semantics and performance vary due to these differences. VIRTIO lacks data embedding inside descriptors. io_uring supports more in-flight requests than the queue size. NVMe and io_uring rely on external descriptors with no ability to chain descriptors.