Friday, October 15, 2021

A new approach to usermode networking with passt

There is a new project called passt that Stefano Brivio has been working on to implement usermode networking, the magic that forwards network packets between QEMU guests and the host network.

passt is designed as a replacement for QEMU's --netdev user (also known as slirp), a feature that is commonly used but not really considered production-ready. What passt improves on is security and performance, finally making usermode networking production-ready. That's all you need to know to try it out but I thought the internals of how passt works are interesting, so this article explains the approach.

Why usermode networking is necessary

Guests send and receive Ethernet frames through emulated network interface cards like virtio-net. Those packets need to be injected into the host network but popular operating systems don't provide an API for sending and receiving Ethernet frames because that poses a security risk (spoofing) or could simply interfere with other applications on the host.

Actually that's not quite true, operating systems do provide specialized APIs for injecting Ethernet frames but they come with limitations. For example, the Linux tun/tap driver requires additional network configuration steps as well as administrator privileges. Sometimes it's not possible to take advantage of tap due to these limitations and we really need a solution for unprivileged users. That's what usermode networking is about.

Transmuting Ethernet frames to Socket API calls

Since an unprivileged user cannot inject Ethernet frames into the host network, we have to make due with the POSIX Sockets API that is available to unprivileged users. Each Ethernet frame sent by the guest needs to be converted into equivalent Sockets API calls on the host so that the desired effect is achieved even though we weren't able to transmit the original Ethernet frame byte-for-byte. Incoming packets from the external network need to be received via the Sockets API and repackaged into Ethernet frames that the guest network interface card can receive.

In networking parlance this conversion between Ethernet frames and Sockets API calls is a Layer 2 (Data Link Layer)/Layer 4 (Transport Layer) conversion. The Ethernet frames have additional packet headers including the Ethernet header, IP header, and the TCP/UDP header that the Sockets API calls don't include. Careful use of the Sockets API makes it possible to synthesize Ethernet frames that are similar enough to the original ones that the guest can communicate successfully.

For the most part this conversion requires parsing and building, respectively, packet headers in a straightforward way. The TCP protocol makes things more interesting though because a TCP connection involves non-trivial state that is normally handled by the TCP/IP stack. For example, data sent over a TCP connection might arrive out of order or some chunks may have been dropped. There is also a state machine for the TCP connection lifecycle including its famous three-way handshake. This means TCP connections must be carefully tracked so that these low-level protocol features can be simulated correctly.

How passt works

Passt runs as an unprivileged host userspace process that is connected to QEMU through --netdev socket, a way to transfer Ethernet frames from QEMU's network interface card emulation to another process like passt. When passt reads an Ethernet frame like a UDP message from the guest it sends the data onwards through an equivalent AF_INET SOCK_DGRAM socket on the host. It also keeps the socket open so replies can be read on the host and then packaged into Ethernet frames that are written to the guest. The effect of this is that guest network communication appears like it's coming from the passt process on the host and integrates nicely into host networking.

How TCP works is a bit more interesting. Since TCP connections require acknowledgement messages for reliable delivery, passt uses the recvmmsg(2) MSG_PEEK flag to fetch data while keeping it queued in the host network stack's rcvbuf until the guest acknowledges it. This avoids extra buffer management code in passt and is part of its strategy of implementing only a subset of TCP. There is no need to duplicate the full TCP/IP stack since the host and guest already have them, but achieving this requires detailed knowledge of TCP so that passt can maintain just enough state.

Incoming connections are handled by port forwarding. This means passt can bind to port 2222 on the host and forward connections to port 22 inside the guest. This is very useful for usermode networking since the user may not have permission to bind to low-numbered ports on the host or there might already be host services listening on those ports. If you don't want to mess with port forwarding you can use passt's all mode, which simply listens on all non-ephemeral ports (basically a brute force approach).

A few basic network protocols are necessary for network communication: ARP, ICMP, DHCP, DNS, and IPv6 services. Passt offers these because the guest cannot talk to those services on the host network directly. They can be disabled when the guest has knowledge of the network configuration and doesn't need them.

Why passt is unique

Thanks to running in a separate process from QEMU and by taking a minimalist approach, passt is able to tighten security. Its seccomp filters are stronger than anything the larger QEMU process could do. The code is clean and specifically designed for security and simplicity. I think writing passt in C was a missed opportunity. Some users may rule it out entirely for this reason. Userspace parsing of untrusted network packets should be done in a memory-safe programming language nowadays. Nevertheless, it's a step up from slirp, which has a poor track record of security issues and runs as part of the QEMU process.

I'm excited to see how passt performs in comparison to slirp. passt uses techniques like recvmmsg(2)/sendmmsg(2) to batch message transfer and reads multiple Ethernet frames from the guest in a single syscall to amortize the cost of syscalls over multiple packets. There is no dynamic memory allocation and packet headers are pre-populated to minimize the number of CPU cycles spent in the data path. While this is promising, QEMU's --netdev socket isn't the fastest (packets first take a trip through QEMU's net subsystem queues), but still a good trade-off between performance and simplicity/security. Based on reading the code, I think passt will be faster than slirp but I haven't benchmarked it myself.

There is another mode that passt supports for containers instead of virtualization. Although it's not relevant to QEMU, this so-called pasta mode is a cool feature for container networking. In this mode pasta connects a network namespace with the outside world (init namespace) through a tap device. This might become passt's killer feature, because the same software can be used for both virtualization and containers, so why bother investing in two separate solutions?

Conclusion

Passt is a promising replacement for slirp (on Linux hosts at least). It looks like there will finally be a production-ready usermode networking feature for QEMU that is fast and secure. Passt's functionality is generic enough that other projects besides QEMU will be able to use it, which is great because this kind of networking code is non-trivial to develop. I look forward to passt becoming available for use with QEMU in Linux distributions soon!