Saturday, July 18, 2020

Rethinking event loop integration for libraries

APIs for operations that take a long time are often asynchronous so that applications can continue with other tasks while an operation is running. Asynchronous APIs initiate an operation and then return immediately. The application is notified when the operation completes through a callback or by monitoring a file descriptor for activity (for example, when data arrives on a TCP socket).

Asynchronous applications are usually built around an event loop that waits for the next event and invokes a function to handle the event. Since the details of event loops differ between applications, libraries need to be designed carefully to integrate well with a variety of event loops.

The current model

A popular library with asynchronous APIs is the libcurl file transfer library that is used for making HTTP requests. It has the following (slightly simplified) event loop integration API:

#define CURL_WAIT_POLLIN    0x0001   /* Ready to read? */
#define CURL_WAIT_POLLOUT   0x0004   /* Ready to write? */

int socket_callback(CURL *easy,      /* easy handle */
                    int fd,          /* socket */
                    int what,        /* describes the socket */
                    void *userp,     /* private callback pointer */
                    void *socketp);  /* private socket pointer */

libcurl invokes the applications socket_callback() to start or stop monitoring file descriptors. When the application's event loop detects file descriptor activity, the application invokes libcurl's curl_multi_socket_action() API to let the library process the file descriptor.

There are variations on this theme but generally libraries expose file descriptors and event flags (read/write/error) so the application can monitor file descriptors from its own event loop. The library then performs the read(2) or write(2) call when the file descriptor becomes ready.

How io_uring changes the picture

The Linux io_uring API (pdf) can be used to implement traditional event loops that monitor file descriptors. But it also supports asynchronous system calls like read(2) and write(2) (best used when IORING_FEAT_FAST_POLL is available). The latter is interesting because it combines two syscalls into a single efficient syscall:

  1. Waiting for file descriptor activity.
  2. Reading/writing the file descriptor.

Existing applications use syscalls like epoll_wait(2), poll(2), or the old select(2) to wait for file descriptor activity. They can also use io_uring's IORING_OP_POLL_ADD to achieve the same effect.

After the file descriptor becomes ready, a second syscall like read(2) or write(2) is required to actually perform I/O.

io_uring's asynchronous IORING_OP_READ or IORING_OP_WRITE (including variants for vectored I/O or sockets) only requires a single io_uring_enter(2) call. If io_uring sqpoll is enabled then a syscall may not even be required to submit these operations!

To summarize, it's more efficient to perform a single asynchronous read/write instead of first monitoring file descriptor activity and then performing a read(2) or write(2).

A new model

Existing library APIs do not fit the asynchronous read/write approach because they expect the application to wait for file descriptor activity and then for the library to invoke a syscall to perform I/O. A new model is needed where the library tells the application about I/O instead of asking the application to monitor file descriptors for activity.

The library can use a new callback API that lets the application perform asynchronous I/O:

 * The application invokes this callback when an aio operation has completed.
 * @cb_arg: the cb_arg passed to a struct aio_operations function by the library
 * @ret: the return value of the aio operation (negative errno for failure)
typedef void aio_completion_fn(void *cb_arg, ssize_t ret);

 * Asynchronous I/O operation callbacks provided to the library by the
 * application.
 * These functions initiate an I/O operation and then return immediately. When
 * the operation completes the @cb callback is invoked with @cb_arg. Note that
 * @cb may be invoked before the function returns (typically in the case of an
 * early error).
struct aio_operations {
    void read(int fd, void *data, size_t len, aio_completion_fn *cb,
              void *cb_arg);
    void write(int fd, void *data, size_t len, aio_completion_fn *cb,
               void *cb_arg);

The concept of monitoring file descriptor activity is gone. Instead the API focusses on asynchronous I/O operations that can be implemented by the application however it sees fit.

Applications using io_uring can use IORING_OP_READ and IORING_OP_WRITE to implement asynchronous operations efficiently. Traditional applications can still use their event loops but now also perform the read(2), write(2), etc syscalls on behalf of the library.

Some libraries don't need a full set of struct aio_operations callbacks because they only perform I/O in limited ways. For example, a library that only has a Linux eventfd can instead present this simplified API:

 * Return an eventfd(2) file descriptor that the application must read from and
 * call lib_eventfd_fired() when a non-zero value was read.
int lib_get_eventfd(struct libobject *obj);

 * The application must call this function when the eventfd returned by
 * lib_get_eventfd() read a non-zero value.
void lib_eventfd_fired(struct libobject *obj);

Although this simplified API is similar to traditional event loop integration APIs it is now the application's responsibility to perform the eventfd read(2), not the library's. This way applications using io_uring can implement the read efficiently.

Does an extra syscall matter?

Whether it is worth eliminating the extra syscall depends on one's performance requirements. When I/O is relatively infrequent then the overhead of the additional syscall may not matter.

While working on QEMU I found that the extra read(2) on eventfds causes a measurable overhead.


Splitting file descriptor monitoring from I/O is suboptimal for Linux io_uring applications. Unfortunately, existing library APIs are often designed in this way. Letting the application perform asynchronous I/O on behalf of the library allows a more efficient implementation with io_uring while still supporting applications that use older event loops.

Thursday, July 2, 2020

Avoiding bitrot in C macros

A common approach to debug messages that can be toggled at compile-time in C programs is:

#define DPRINTF(fmt, ...) do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0)
#define DPRINTF(fmt, ...)

Usually the ENABLE_DEBUG macro is not defined in normal builds, so the C preprocessor expands the debug printfs to nothing. No messages are printed at runtime and the program's binary size is smaller since no instructions are generated for the debug printfs.

This approach has the disadvantage that it suffers from bitrot, the tendency for source code to break over time when it is not actively built and used. Consider what happens when one of the variables used in the debug printf is not updated after being renamed:

- int r;
+ int radius;
  DPRINTF("radius %d\n", r);

The code continues to compile after r is renamed to radius because the DPRINTF() macro expands to nothing. The compiler does not syntax check the debug printf and misses that the outdated variable name r is still in use. When someone defines ENABLE_DEBUG months or years later, the compiler error becomes apparent and that person is confronted with fixing a new bug on top of whatever they were trying to debug when they enabled the debug printf!

It's actually easy to avoid this problem by writing the macro differently:

#define ENABLE_DEBUG 0
#define DPRINTF(fmt, ...) do { \
        if (ENABLE_DEBUG) { \
            fprintf(stderr, fmt, ## __VA_ARGS__); \
        } \
    } while (0)

When ENABLE_DEBUG is not defined the macro expands to:

do {
    if (0) {
        fprintf(stderr, fmt, ...);
} while (0)

What is the difference? This time the compiler parses and syntax checks the debug printf even when it is disabled. Luckily compilers are smart enough to eliminate deadcode, code that cannot be executed, so the binary size remains small.

This applies not just to debug printfs. More generally, all preprocessor conditionals suffer from bitrot. If an #if ... #else ... #endif can be replaced with equivalent unconditional code then it's often worth doing.