Stefan Hajnoczi: 2020

Friday, December 11, 2020

Building a git forge using git apps

The two previous blog posts about why git forges are von Neumann machines and the Radicle peer-to-peer git forge explored models for git forges. In this final post I want to cover yet another model that draws from the previous ones but has its own unique twist.

Peer-to-peer git apps

I previously showed how applications can be built on centralized git forges using CI/CD functionality for executing code, webhooks for interacting with the outside world, and disjoint branches for storing data.

A more elegant architecture is a peer-to-peer one where instead of many clients and one server there are just peers. Each peer has full access to the data. There is no client/server application code split, instead each peer runs an application for itself.

First, this makes it easier to move the data to new hosting infrastructure or fork a project since all data resides in the git repository. Merge requests, issues, wikis, and even the app settings are all stored in the git repo itself.

Second, this gives more power to the users who can process data however they want without being limited by the server's API. All peers are on equal footing and users don't need permission to alter applications, because they run locally.

Finally, it is easier to develop a local application than a client/server application. Being able to open a file and tweak the code is immediate and less hassle than testing and deploying a server-side application.

Internet peer-to-peer systems typically still require some central point for bootstrapping and this is no exception. A publicly-accessible git repository is still needed so that peers can fetch and push changes. However, in this model the git server does not run application code but "git apps" like merge requests, issue trackers, wikis, etc can still be implemented. Here is how it works...

The anti-application server

The git server is not allowed to run application code in our model, so apps like merge requests won't be processing data on the server side. However, the repository does need some primitives to make peer-to-peer git apps possible. These primitives are access control policies for refs and directories/files.

Peers run applications locally and the git server is "dumb" with the sole job of enforcing access control. You can imagine this like a multi-user UNIX machine where users have access to a shared directory. UNIX file permissions determine how processes can access the data. By choosing permissions carefully, multiple users can collaborate in the shared directory in a safe and controlled manner.

This is an anti-application server because no application code runs on the server side. The server is just a git repository that stores data and enforces access control on git push.

Access control

Repositories that accept push requests need a pre-receive hook (see githooks(5)) that checks incoming requests against the access control policy. If the request complies with the access control policy then the git push is accepted. Otherwise the git push is rejected and changes are not made to the git repository.

The first type of access control is on git refs. Git refs are the namespace where branches and tags are stored in a git repository. If a regular expression matches the ref and the operation type (create, fast-forward, force, delete) then it is allowed. For example, this policy rule allows any user to push to refs/heads/foo but force pushes and deletion are not allowed:

anyone create,fast-forward ^heads/foo$

The operations available on refs include:

Operation	Description
`create-branch`	Push a new branch that doesn't exist yet
`create-tag`	Push a new tag that doesn't exist yet
`fast-forward`	Push a commit that is a descendent of the current commit
`force`	Push a commit or tag replacing the previous ref
`delete`	Delete a ref

What's more interesting is that $user_id is expanded to the git push user's identifier so we can write rules to limit access to per-user ref namespaces:

anyone create-branch,fast-forward,force,delete ^heads/$user_id/.*$

This would allow Alice to push her own branches but Alice could not push to Bob's branches.

We have covered how to define access control policies on refs. Access control policies are also needed on branches so that multiple users can modify the same branch in a controlled and safe manner. The syntax is similar but the policy applies to changes made by commits to directories/files (what git calls a tree). The following allows users to create files in a directory but not delete or modify them (somewhat similar to the UNIX restricted deletion or "sticky" bit on world-writable directories):

anyone create-file ^shared-dir/.*$

The operations available on branches include:

Operation	Description
`create-directory`	Create a new directory
`create-file`	Create a new file
`create-symlink`	Create a symlink
`modify`	Change an existing file or symlink
`delete-file`	Delete a file
...

$user_id expansion is also available for branch access control. Here the user can create, modify, and delete files in a per-user directory:

anyone create-file,modify,delete-file ^$user_id/.*$

User IDs

You might be wondering how user identifiers work. Git supports GPG-signed push requests with git push --signed. We can use the GPG key ID as the user identifier, eliminating the need for centralized user accounts. Remember that the GPG key ID is based on the public key. Key pairs are randomly generated and it is improbable that the same key will be generated by two different users. That said, GPG key ID uniqueness has been weak in the past when the default size was 32 bits. Git explicitly enables long 64-bit GPG key IDs but I wonder if collisions could be a problem. Maybe an ID with more bits based on the public key should be used instead, but for now let's assume the GPG key ID is unique.

The downside of this approach is that user IDs are not human-friendly. Git apps can allow the user to assign aliases to avoid displaying raw user IDs. Doing this automatically either requires an external ID issuer like confirming email address ownership, which is tedious for new users, or by storing a registry of usernames in the git repo, which means a first-come-first-server policy for username allocation and possible conflicts when merging from two repositories that don't share history. Due to these challenges I think it makes sense to use raw GPG key IDs at the data storage level and make them prettier at the user interface level.

The GPG key ID approach works well for desktop clients but not for web clients. The web application (even if implemently on the client side) would need access to the private key so it can push to the git repository. Users should not trust remotely hosted web applications with their private keys. Maybe there is a standard Web API that can help but I'm not aware one. More thought is needed here.

The pre-receive git hook checks that signature verification passed and has access to the GPG key ID in the GIT_PUSH_CERT_KEY environment variable. Then the access control policy can be checked.

Access control is a git app

Access control is the first and most fundamental git app. The access control policies that were described above are stored as files in the apps/access-control branch in the repository. Pushes to that branch are also subject to access control checks. Here is the branch's initial layout:

branches/ - access control policies for branches
  owner.conf
groups/ - group definitions (see below)
  ...
refs/ - access control policies for refs
  owner.conf

The default branches/owner.conf access control policy is as follows:

owner create-file,create-directory,modify,delete ^.*$

The default refs/owner.conf access control policy is as follows:

owner create-branch,create-tag,fast-foward,force,delete ^.*$

This gives the owner the ability to push refs and modify branches as they wish. The owner can grant other users access by pushing additional access control policy files or changing exsting files on the apps/access-control branch.

Each access control policy file in refs/ or branches/ is processed in turn. If no access control rule matches the operation then the entire git push is rejected.

Groups can be defined to alias one or more user identifiers. This avoids duplicating access control rules when more than one user should have the same access. There are two automatic groups: owner contains just the user who owns the git repository and anyone is the group of all users.

This completes the description of the access control app. Now let's look at how other functionality is built on top of this.

The merge requests app

A merge requests app can be built on top of this model. The refs access control policy is as follows:

# The data branch contains the titles, comments, etc
anyone modify ^apps/merge-reqs/data$

# Each merge request revision is pushed as a tag in a per-user namespace
anyone create-tag ^apps/merge-reqs/$user_id/[0-9]+-v[0-9]+$

The branch access control policy is:

# Merge requests are per-user and numbered
anyone create-directory ^merge-reqs/$user_id/[0-9]+$

# Title string
anyone create-file,modify ^merge-reqs/$user_id/[0-9]+/title$

# Labels (open, needs-review, etc) work like this:
#
#   merge-reqs/<user-id>/<merge-req-num>/labels/
#     needs-review -> /labels/needs-review
#     ...
#   labels/
#     needs-review/
#       <user-id>/
#         <merge-req-num> -> /merge-reqs/<user-id>/<merge-req-num>
#         ...
#       ...
#     ...
#
# This directory and symlink layout makes it possible to enumerate labels for a
# given merge request and to enumerate merge requests for a given label.
#
# Both the merge request author and maintainers can add/remove labels to/from a
# merge request.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/labels$
anyone create-symlink,delete ^merge-reqs/$user_id/[0-9]+/labels/.*$
maintainers create-symlink,delete ^merge-reqs/[^/]+/[0-9]+/labels/.*$
maintainers create-directory ^labels/[^/]+$
anyone create-symlink,delete ^labels/[^/]+/$user_id/[0-9]+$
maintainers create-symlink,delete ^labels/[^/]+/[^/]+/[0-9]+$

# Comments are stored as individual files in per-user directories. Each file
# contains a timestamp and the contents of the comment. The timestamp can be
# used to sort comments chronologically.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments$
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments/$user_id$
anyone create-file,modify ^merge-reqs/[^/]+/[0-9]+/comments/$user_id/[0-9]+$

When a user creates a merge request they provide a title, an initial comment, apply labels, and push a v1 tag for review and merging. Other users can comment by adding files into the merge request's per-user comments directory. Labels can be added and removed by changing symlinks in the labels directories.

The user can publish a new revision of the merge request by pushing a v2 tag and adding a comment describing the changes. Once the maintainers are satisfied they merge the final revision tag into the relevant branch (e.g. "main") and relabel the merge request from open/needs-review to closed/merged.

This workflow can be implemented by a tool that performs the necessary git operations so users do not need to understand the git app's internal data layout. Users just need to interact with the tool that displays merge requests, allows commenting, provides searches, etc. A natural way to implement this tool is as a git alias so it integrates alongside git's built-in commands.

One issue with this approach is that it uses the file system as a database. Performance and scalability are likely to be worst than using a database or application-specific file format. However, the reason for this approach is that it allows the access control app to enforce a policy that ensures users cannot modify or delete other user's data without running application-specific code on the server and while keeping everything stored in a git repository.

An example where this approach performs poorly is for full-text search. The application would need to search all title and comment files for a string. There is no index for efficient lookups. However, if applications find that git-grep(1) does not perform well they can maintain their own index and cache files locally.

I hope that this has shown how git apps can be built without application code running on the server.

Continuous integration bots

Now that we have the merge requests app it's time to think how a continuous integration service could interface with it. The goal is to run tests on each revision of a merge request and report failures so the author of the merge request can rectify the situation.

A CI bot watches the repository for changes. In particular, it needs to watch for tags created with the ref name apps/merge-reqs/[^/]+/[0-9]+-v[0-9]+.

When a new tag is found the CI bot checks it out and runs tests. The results of the tests are posted as a comment by creating a file in merge-regs/<user-id>/>merge-req-num>/comments/<ci-bot-user-id>/0 on the apps/merge-reqs/data branch. A ci-pass or ci-fail label can also be applied to the merge request so that the CI status can be easily queried by users and tools.

Going further

There are many loose ends. How can non-git users participate on issue trackers and wikis? It might be possible to implement a full peer as a client-side web application using isomorphic-git, a JavaScript git implementation. As mentioned above, the GPG key ID approach is not very browser-friendly because it requires revealing the private key to the web page and since keys are user identifiers using temporary keys does not work well.

The data model does not allow efficient queries. A full copy of the data is necessary in order to query it. That's acceptable for local applications because they can maintain their own indexes and are expected to keep the data for a long period of time. It works less well for short-lived web page sessions like a casual user filing a new bug on the issue tracker.

The git push --signed technique is not the only option. Git also supports signed commits and signed tags. The difference between signed pushes and signed tags/commits is significant. The signed push approach only validates the access control policy when the repository is changed and leaves no audit log for future reference. The signed commit/tag approach keeps the signatures in the git history. Signed commits/tags can be propagated in a peer-to-peer network and each peer can validate the access control policy itself. While signed commits/tags apply the access control policy to each object in the repository, signed pushes apply the access control policy to each change made to the repository. The difference is that it's easy to rebase and include work from different authors with signed pushes. Signed commits/tags require re-signing for rebasing and each commit is validated against its signature, which may be different from the user who is making the push request.

There are a lot of interesting approaches and trade-offs to explore here. This model we've discussed fits closely with how I've seen developers use git in open source projects. It is designed around a "main" repository/server that contributors push their code to. But each clone of the repository has all the data and can be published as a new "main" repository, if necessary.

Although these ideas are unfinished I decided to write them up with the knowledge that I probably won't implement them myself. QEMU is moving to GitLab with a traditional centralized git forge. I don't think this is the right time to develop this idea and try to convince the QEMU community to use it. For projects that have fewer infrastructure requirements it would give their contributors more power than being confined to a centralized git forge.

I hope this was an interesting read for anyone thinking about git forges and building git apps.

Sunday, December 6, 2020

Understanding Peer-to-Peer Git Forges with Radicle

Git is a distributed version control system and does not require a central server. Although repositories are usually published at a well-known location for convenient cloning and fetching of the latest changes, this is actually not necessary. Each clone can have the full commit history and evolve independently. Furthermore, code changes can be exchanged via email or other means. Finally, even the clone itself does not need to be made from a well-known domain that hosts a git repository (see git-bundle(1)).

Given that git itself is already fully decentralized one would think there is no further work to do. I came across the Radicle project and its somewhat psychedelic website. Besides having a website with a wild color scheme, the project aims to offer a social coding experiment or git forge functionality using a peer-to-peer network architecture. According to the documentation the motivation seems to be that git's built-in functionality works but is not user-friendly enough to make it accessible. In particular, it lacks social coding features.

The goal is to add git forge features like project and developer discovery, issue trackers, wikis, etc. Additional, distinctly decentralized functionality, is also touched on involving Ethereum as a way to anchor project metadata, pay contributors, etc. Radicle is still in early development so these features are not yet implemented. Here is my take on the How it Works documentation, which is a little confusing due to its early stage and some incomplete sentences or typos. I don't know whether my understanding actually corresponds to the Radicle implementation that exists today or its eventual vision, because I haven't studied the code or tried running the software. However, the ideas that the documentation has brought up are interesting and fruitful in their own right, so I wanted to share them and explain them in my own words in case you also find them worth exploring.

The git data model

Let's quickly review the git data model because it is important for understanding peer-to-peer git forges. A git repository contains a refs/ subdirectory that provides a namespace for local branch heads (refs/heads/), local and remotely fetched tags (refs/tags/), and remotely fetched branches (refs/remotes/<remote>/). Actually this namespace layout is just a convention for everyday git usage and it's possible to use the refs/ namespace differently as we will see. The git client fetches refs from a remote according to a refspec rule that maps remote refs to local refs. This gives the client the power to fetch only certain refs from the server. The client can also put them in a different location in its local refs/ directory than the server. For details, see the git-fetch(1) man page.

Refs files contain the commit hash of an object stored in the repository's object database. An object can be a commit, tree (directory), tag, or a blob (file). Branch refs point to the latest commit object. A commit object refers to a tree object that may refer to further tree objects for sub-directories and finally the blob objects that make up the files being stored. Note that a git repository supports disjoint branches that share no history. Perhaps the most well-known example of disjoint branches are the GitHub Pages and GitLab Pages features where these git forges publish static websites from the HTML/CSS/JavaScript/image files on a specific branch in the repository. That branch shares no version history with other branches and the directories/files typically have no similarity to the repository's main branch.

Now we have covered enough git specifics to talk about peer-to-peer git forges. If you want to learn more about how git objects are actually stored, check out my article on the repository layout and pack files.

Identity and authority

Normally a git repository has one or more owners who are allowed to push refs. No one else has permission to modify the refs namespace. What if we tried to share a single refs namespace with the whole world and everyone could push? There would be chaos due to naming conflicts and malicious users would delete or change other users' refs. So it seems like an unworkable idea unless there is some way to enforce structure on the global refs namespace.

Peer-to-peer systems have solutions to these problems. First, a unique identity can be created by picking a random number with a sufficient number of bits so that the chance of collision is improbable. That unique identity can be used as a prefix in the global ref namespace to avoid accidental collisions. Second, there needs to be a way to prevent unauthorized users from modifying the part of the global namespace that is owned by other users.

Public-key cryptography provides the primitive for achieving both these things. A public key or its hash can serve as the unique identifier that provides identity and prevents accidental collisions. Ownership can be enforced by verifying that changes to the global namespace are signed with the private key corresponding to the unique identity.

For example, we fetch the following refs from a peer:

<identity>/
  heads/
    main
  metadata/
    signed_refs

This is a simplified example based on the Radicle documentation. Here identity is the unique identity based on a public key. Remember no one else in the world has the same identity because the chance of generating the same public key is improbable. The heads/ refs are normal git refs to commit objects - these are published branches. The signed_refs ref points to an git object that contains a list of commit hashes and a signature generated using the public key. The signature can be verified using the public key.

Next we need to verify these changes to check that they were created with the private key that is only known to the identity's owner. First, we check the signature on the object pointed to by the signed_refs ref. If the signature is not valid we reject these changes and do not store them in our local repository. Next, we look up each ref in heads/ against the list in signed_refs. If a ref is missing from the list then we reject these refs and do not allow them into our local repository.

This scheme lends itself to peer-to-peer systems because the refs can be propagated (copied) between peers and verified at each step. The identity owner does not need to be present at each copy step since their cryptographic signature is all we need to be certain that they authorized these refs. So I can receive refs originally created by identity A from peer B and still be sure that peer B did not modify them since identity A's signature is intact.

Now we have a global refs namespace that is partitioned so that each identity is able to publish refs and peers can verify that these changes are authorized.

Gossip

It may not be clear yet that it's not necessary to clone the entire global namespace. In fact, it's possible that no single peer will ever have a full copy of the entire global namespace! That's because this is a distributed system. Peers only fetch refs that they care about from their peers. Peers fetch from each other and this forms a network. The network does not need to be fully connected and it's possible to have multiple clusters of peers running without full global connectivity.

To bootstrap the global namespace there are seed repositories. Seeds are a common concept in peer-to-peer systems. They provide an entry point for new peers to learn about and start participating with other peers. In BitTorrent this is called a "tracker" rather than a "seed".

According to the Radicle documentation it is possible to directly fetch from peers. This probably means a git-daemon(1) or git-http-backend(1) needs to be accessible on the public internet. Many peers will not have sufficient network connectivity due to NAT limitations. I guess Radicle does not expect every user to participate as a repository.

Interestingly, there is a gossip system for propagating refs through the network. Let's revisit the refs for an identity in the global namespace:

<identity>/
  heads/
    main
  metadata/
    signed_refs
  remotes/
    <another-identity>/
      heads/
        main
        foo
      metadata/
        signed_refs
      remotes/
        ...

We can publish identities that we track in remotes/. It's a recursive refs layout. This is how someone tracking our refs can find out about related identities and their refs.

Thanks to git's data model the commit, tree, and blob objects can be shared even though we duplicate refs published by another identity. Since git is a content-addressable object database the data is stored once even though multiple refs point to it.

Now we not only have a global namespace where anyone can publish git refs, but also ways to build a peer-to-peer network and propagate data throughout the network. It's important to note that data is only propagated if peers are interested in fetching it. Peers are not forced to store data that they are not interested in.

How data is stored locally

Let's bring the pieces together and show how the system stores data. The peer creates a local git repository called the monorepo for the purpose of storing portions of the global namespace. It fetches refs from seeds or direct peers to get started. Thanks to the remotes/ refs it also learns about other refs on the network that it did not request directly.

This git repository is just a data store, it is not usable for normal git workflows. The conventional git branch and git tag commands would not work well with the global namespace layout and verification requirements. Instead we can clone a local file:/// repository from the monorepo that fetches a subset of the refs into the conventional git refs layout. The files can be shared because git-clone(1) supports hard links to local repositories. Thanks to githooks(5) and/or extensible git-push(1) remote helper support it's possible to generate the necessary global namespace metadata (e.g. signatures) when we push from the local clone to the local monorepo. The monorepo can then publish the final refs to other peers.

Building a peer-to-peer git forge

There are neat ideas in Radicle and it remains to be seen how well it will grow to support git forge functionality. A number of challenges need to be addressed:

Usability - Radicle is a middle-ground between centralized git forges and email-based decentralized development. The goal is to be easy to use like git forges. Peer-to-peer systems often have challenges providing a human-friendly interface on top of public key identities (having usernames without centralized user accounts). Users will probably prefer to think in terms of repositories, merge requests, issues, and wikis instead of peers, gossip, identities, etc.
Security - The global namespace and peer-to-peer model is a target that malicious users will attack by trying to impersonate or steal identities, flood the system with garbage, game reputation systems with sockpuppets, etc.
Scalability - Peers only care about certain repositories and don't want to be slowed down by all the other refs in the global namespace. The recursive refs layout seems like it could cause performance problems and maybe users will configure clients to limit the depth to a low number like 3. At first glance Radicle should be able to scale well since peers only need to fetch refs they are interested in, but it's a novel way of using git refs, so we can expect scalability problems as the system grows.
Data model - How will this data model grow to handle wikis, issue trackers, etc? Issue tracker comments are an example of a data structure that requires conflict resolution in a distributed system. If two users post comments on an issue, how will this be resolved without a conflict? Luckily there is quite a lot of research on distributed data structures such as Conflict-free Replicated Data Types (CRDTs) and it may be possible to avoid most conflicts by eliminating concepts like linear comment numbering.
CI/CD - As mentioned in my blog post about why git forges are von Neumann machines, git forges are more than just data stores. They also have a computing model, initially used for Continuous Integration and Continuous Delivery, but really a general serverless computing platform. This is hard to do securely and without unwanted resource usage in a peer-to-peer system. Maybe Radicle will use Ethereum for compute credits?

Conclusion

Radicle is a cool idea and I look forward to seeing where it goes. It is still at an early stage but already shows interesting approaches with the global refs namespace and monorepo data store.

Wednesday, December 2, 2020

Software Freedom Conservancy announces donation matching

Software Freedom Conservancy, the non-profit charity home of QEMU, Git, Inkscape, and many other free and open source software projects is running its annual fundraiser. They have announced a generous donation matching pledge so donations made until January 15th 2021 will be doubled! Full details are here.

What makes Software Freedom Conservancy important, besides being a home for numerous high-profile free and open source software projects, is that it is backed by individuals and works for the public interest. It is not a trade association funded by companies to represent their interests. With the success of free and open source software it's important we don't lose these freedoms or use them just to benefit businesses. That's why I support Software Freedom Conservancy.

Monday, November 30, 2020

Git Forge Apps: Why git forges are serverless computing providers

In this post I want to explore an idea for a new type of application made possible by the power of git forges like GitLab and GitHub. I don't have a proof-of-concept and in fact we'll discuss hurdles that could make the idea infeasible. But it's an interesting way of thinking that is worth sharing as it offers a new way of looking at and using git forges.

Git forges offer git repository hosting at their core and then add on related features like pull requests, wikis, issue trackers, static website hosting, continuous integration, and more. They are now a long way away from the initial idea of a hosted git repository where you can publish code. They do so much more. They have a von Neumann architecture and are Turing complete. It's possible to do interesting things with that.

A git forge app (GFA) is an application that runs on a git forge. A hosted git repository is not just a place to publish and back up the source code. It's where the app runs. You can view the git forge as a Platform-as-a-Service (PaaS) or serverless computing provider. The app doesn't need to be deployed elsewhere, because the git forge itself is the execution environment.

A GFA must be able to:

Process data.
Store data.
Interact with the outside world.

This is basically what computer applications do. Git forges have become powerful enough to do this.

Imagine the following: you fork a repository and this instantiates the GFA. The GFA is now running under your git forge account. The web interface is available at https://<user>.gitlab.com/<app>/ and HTTP POST requests can also be used to interact with the application. The GFA could be a todo list, an RSS reader, a blog, etc. To the website visitor it appears like any other web application but everything happens within the git forge and no other hosting provider is necessary.

Let's look at how it's possible to use a git forge as the execution environment for an app.

Processing Data

The first order of business is executing code so that the application can process data. A configuration file is placed into a git repository to define runnable jobs. The job execution systems offered by git forges are GitLab CI and GitHub Actions, respectively. When the job is triggered it executes in a temporary environment, for example a Linux container image. It allows code hosted in the git repository to run on a server somewhere for a little while. There is a free allowance that grants a certain number of hours of unpaid execution time. This is how GFAs process data.

For example, say we are building an RSS reader. Our repo looks like this:

rss-reader/
    .gitlab-ci.yml - the job configuration file
    fetch-new-feeds.sh - download RSS feeds and check for new posts

The .gitlab-ci.yml file will contain a scheduled pipeline that runs fetch-new-feeds.sh every 15 minutes.

Storing Data

Applications need to store data persistently. The primary way of doing that in a git forge is by storing data in the git repository itself. Mutable data can be kept on a separate branch called data and can be force-pushed to avoid forever increasing the repository size when we don't need to store previous revisions of the data.

This works because each git branch has an independent commit history. The file namespace it stores is completely separate from other branches. This means it's possible to have the GFA's code on the main branch and to store data on the data branch without upsetting the commit history or files on the main branch.

Data storage is available to jobs because they are allowed to manipulate their own git repository thanks to an authentication token. On GitHub GITHUB_TOKEN allows jobs to push to their git repository. This gives jobs a way of storing data.

There are other forms of data storage besides the git repository. Artifacts are files produced by a job that can be downloaded via a URL. Artifacts are subject to expiry time and file size limitations. Caching is available to temporarily store files between job runs, although this data can be lost at any time.

At this point you may be thinking that this is nice but there is no way to store private data if the git repository is public. Git forges offer a secrets mechanism where variables can be stored privately and only made available during job execution. This can be used to stash an encryption key so that the data is stored in public but is encrypted. Anyone can download the encrypted data but they will not have the key needed to decrypt it. Some applications can also take advantage of client-side encryption and homomorphic encryption.

Interacting with the Outside World

Git forges offer a wealth of ways to interact with the outside world, called triggers. Triggers can run a job when an HTTP POST request is made. POST requests typically need a trigger token for authentication, but that token can be published if the triggered job is safe to be invoked by anyone. Both browser clients and Webhooks can invoke triggers through HTTP POST requests.

For example, imagine our RSS reader app needs an API to mark feeds as read. We define an HTTP POST trigger that runs a mark-feed-read.sh script that updates the feeds stored in the data branch. Since only the git repository owner should be able to invoke this trigger we do not publish the trigger token. Instead it is stored encrypted in the repository and the user provides a passphrase for decrypting this secret.

At this point it is also useful to offer a web interface. This is possible thanks to the static pages hosting feature called GitLab Pages and GitHub Pages. Pushing HTML, JavaScript, CSS, and images to a special branch publishes the static website at https://<user>.gitlab.com/<app>/.

For the most part GFAs are asynchronous, they cannot handle HTTP requests synchronously like a web application that outputs an HTTP response. Incoming HTTP requests simply schedule GFA jobs that run sometime later. There are a few ways around this. The browser can poll for results using XMLHttpRequest or similar techniques. Alternatively, a GFA can fire up a job that runs for several minutes although I haven't investigated if there is a practical way to communicate with a web application running in a job (I guess incoming HTTPS is tricky to achieve).

Triggers offer a lot more fun than what I've already covered. They can respond to the creation and modification of wiki pages, issues, and pull request comments. This means it's possible to use those entities for interacting with the outside world. The GFA could act as a chat bot on its issue tracker, for example.

Limitations

Git forges weren't designed for GFAs. But then they weren't initially designed to be Turing complete and offering ways to interact with the outside world either. That was added incrementally as demand for that functionality grew. Maybe git forges will evolve into full-blown serverless hosting competitors to today's cloud hosting providers.

GFAs are a hack that uses features like static pages, CI jobs, and webhooks in a creative way to run applications on a git forge. Building GFAs that are actually useful is likely to hit some challenges:

No synchronous requests - it is hard to implement search queries and other dynamic behavior in GFAs because they are primarily asynchronous (and slow!). This limitation matters for certain classes of applications. A lot can be done client-side instead to make up for this deficiency. But maybe someone can figure out how to do synchronous requests in GFAs.
Security - I have outlined how to make data publicly available and also how to encrypt it so only the git repository owner can view the plaintext. This is enough for personal web applications, but it's not sufficient for multi-user applications. Maybe git forges offer a form of authentication that works with GFAs so multiple users can store private data in a single GFA instance (the git repository owner could view all users' data but other users could not).
Free usage tiers - job execution time, storage capacity, and request throttling will limit how resource-intensive a GFA can be before it outgrows the free tier and eventually even the paid tier.

The first generation of GFAs could be written from scratch with each job carefully designed. Then a second generation of GFAs could be built on top of frameworks that abstract the tedious git forge integration with standard APIs and data models. For example, a Vue.js frontend could use a key/value store API and the whole thing can be hosted as a GFA.

Even if GFAs don't become a thing because git forges decide there is not enough demand to make them work really well, changing your perspective to think of git forges in this way is valuable. For example, I have a git repo for building a container image that I deploy on my server and that pushes output files to a web server host. All of that can be replaced with a git repository that runs a job and publishes to GitLab Pages. This simplifies things and frees me from maintaining infrastructure.

Conclusion

Git forges offer jobs for processing data, git repositories and artifacts for storing data, and triggers for interacting with the outside world. It is possible to build applications that exist solely as a git repository on a git forge. There is no longer a need to deploy code because the git forge itself is powerful enough to run non-trivial applications. I look forward to how this evolves and whether git forges eventually become full-blown cloud service providers.

Friday, November 27, 2020

Call for QEMU Advent Calendar 2020 disk images

QEMU Advent Calendar publishes a disk image surprise each day from December 1-24. It's a QEMU community tradition that is back again this year!

If you want to contribute disk images to this year's advent calendar (puzzles, games, niche operating systems, retro computing fun, etc), please check out the call for disk images for details.

Friday, October 9, 2020

Requirements for out-of-process device emulation

Over the past months I have participated in discussions about out-of-process device emulation. This post describes the requirements that have become apparent. I hope this will be a useful guide to understanding the big picture about out-of-process device emulation.

What is out-of-process device emulation?

Device emulation is traditionally implemented in the program that executes guest code. This approach is natural because accesses to device registers are trapped as part of the CPU run loop that sits at the core of an emulator or virtual machine monitor (VMM).

In some use cases it is advantageous to perform device emulation in separate processes. For example, software-defined network switches can minimize data copies by emulating network cards directly in the switch process. Out-of-process device emulation also enables privilege separation and tighter sandboxing for security.

Why are these requirements important?

When emulated devices are implemented in the VMM they use common VMM APIs. Adding new devices is relatively easy because the APIs are already there and the developer can focus on the device specifics. Out-of-process device emulation potentially leaves developers without APIs since the device emulation program is a separate program that literally starts from main(). Developers want to focus on implementing their specific device, not on solving general problems related to out-of-process device emulation infrastructure.

It is not only a lot of work to implement an out-of-process device completely from scratch, but there is also a risk of developing the wrong solution because some subtleties of device emulation are not obvious at first glance.

I hope sharing these requirements will help in the creation of common infrastructure so it's easy to implement high-quality out-of-process devices.

Not all use cases have the full set of requirements. Therefore it's best if requirements are addressed in separate, reusable libraries so that device implementors can pick the ones that are relevant to them.

Device emulation

Device resources

Devices provide resources that drivers interact with such as hardware registers, memory, or interrupts. The fundamental requirement of out-of-process device emulation is exposing device resources.

The following types of device resources are needed:

Synchronous MMIO/PIO accesses

The most basic device emulation operation is the hardware register access. This is a memory-mapped I/O (MMIO) or programmed I/O (PIO) access to the device. A read loads a value from a device register. A write stores a value to a device register. These operations are synchronous because the vCPU is paused until completion.

Asynchronous doorbells

Devices often have doorbell registers, allowing the driver to inform the device that new requests are ready for processing. The vCPU does not need to wait since the access is a posted write.

The kvm.ko ioeventfd mechanism can be used to implement asynchronous doorbells.

Shared device memory

Devices may have memory-like regions that the CPU can access (such as PCI Memory BARs). The device emulation process therefore needs to share a region of its memory space with the VMM so the guest can access it. This mechanism also allows device emulation to busy wait (poll) instead of using synchronous MMIO/PIO accesses or asynchronous doorbells for notifications.

Direct Memory Access (DMA)

Devices often require read and write access to a memory address space belonging to the CPU. This allows network cards to transmit packet payloads that are located in guest RAM, for example.

Early out-of-process device emulation interfaces simply shared guest RAM. The allowed DMA to any guest physical memory address. More advanced IOMMU and address space identifier mechanisms are now becoming ubiquitous. Therefore, new out-of-process device emulation interfaces should incorporate IOMMU functionality.

The key requirement for IOMMU mechanisms is allowing the VMM to grant access to a region of memory so the device emulation process can read from and/or write to it.

Interrupts

Devices notify the CPU using interrupts. An interrupt is simply a message sent by the device emulation process to the VMM. Interrupt configuration is flexible on modern devices, meaning the driver may be able to select the number of interrupts and a mapping (using one interrupt with multiple event sources). This can be implemented using the Linux eventfd mechanism or via in-band device emulation protocol messages, for example.

Extensibility for new bus types

It should be possible to support multiple bus types. vhost-user only supports vhost devices. VFIO is more extensible but currently focussed on PCI devices. It is likely that QEMU SysBus devices will be desirable for implementing ad-hoc out-of-process devices (especially for System-on-Chip target platforms).

Bus-level APIs, not protocol bindings

Developers should not need to learn the out-of-process device emulation protocol (vfio-user, etc). APIs should focus on bus-level concepts such as defining VIRTIO or PCI devices rather than protocol bindings for dealing with protocol messages, file descriptor passing, and shared memory.

In other words, developers should be thinking in terms of the problem domain, not worrying about how out-of-process device emulation is implemented. The protocol should be hidden behind bus-level APIs.

Multi-threading support from the beginning

Threading issues arise often in device emulation because asynchronous requests or multi-queue devices can be implemented using threads. Therefore it is necessary to clearly document what threading models are supported and how device lifecycle operations like reset interact with in-flight requests.

Live migration, live upgrade, and crash recovery

There are several related issues around device state and restarting the device emulation program without disrupting the guest.

Live migration

Live migration transfers the state of a device from one device emulation process to another (typically running on another host). This requires the following functionality:

Quiescing the device

Some devices can be live migrated at any point in time without any preparation, while others must be put into a quiescent state to avoid issues. An example is a storage controller that has a write request in flight. It is not safe to live migration until the write request has completed or been canceled. Failure to wait might result in data corruption if the write takes effect after the destination has resumed execution.

Therefore it is necessary to quiesce a device. After this point there is no further device activity and no guest-visible changes will be made by the device.

Saving/loading device state

It must be possible to save and load device state. Device state includes the contents of hardware registers as well as device-internal state necessary for resuming operation.

It is typically necessary to determine whether the device emulation processes on the migration source and destination are compatible before attempting migration. This avoids migration failure when the destination tries to load the device state and discovers it doesn't support it. It may be desirable to support loading device state that was generated by a different implementation of the same device type (for example, two virtio-net implementations).

Dirty memory logging

Pre-copy live migration starts with an iterative phase where dirty memory pages are copied from the migration source to the destination host. Devices need to participate in dirty memory logging so that all written pages are transferred to the destination and no pages are "missed".

Crash recovery

If the device emulation process crashes it should be possible to restart it and resume device emulation without disrupting the guest (aside from a possible pause during reconnection).

Doing this requires maintaining device state (contents of hardware registers, etc) outside the device emulation process. This way the state remains even if the process crashes and it can be resume when a new process starts.

Live upgrade

It must be possible to upgrade the device emulation process and the VMM without disrupting the guest. Upgrading the device emulation process is similar to crash recovery in that the process terminates and a new one resumes with the previous state.

Device versioning

The guest-visible aspects of the device must be versioned. In the simplest case the device emulation program would have a --compat-version=N command-line option that controls which version of the device the guest sees. When guest-visible changes are made to the program the version number must be increased.

By giving control of the guest-visible device behavior it is possible to save/load and live migrate reliably. Otherwise loading device state in a newer device emulation program could affect the running guest. Guest drivers typically are not prepared for the device to change underneath them and doing so could result in guest crashes or data corruption.

Security

The trust model

The VMM must not trust the device emulation program. This is key to implementing privilege separation and the principle of least privilege. If a compromised device emulation program is able to gain control of the VMM then out-of-process device emulation has failed to provide isolation between devices.

The device emulation program must not trust the VMM to the extent that this is possible. For example, it must validate inputs so that the VMM cannot gain control of the device emulation process through memory corruptions or other bugs. This makes it so that even if the VMM has been compromised, access to device resources and associated system calls still requires further compromising the device emulation process.

Unprivileged operation

The device emulation program should run unprivileged to the extent that this is possible. If special permissions are required to access hardware resources then these resources can sometimes be provided via file descriptor passing by a more privileged parent process.

Sandboxing

Operating system sandboxing mechanisms can be applied to device emulation processes more effectively than monolithic VMMs. Seccomp can limit the Linux system calls that may be invoked. SELinux can restrict access to system resources.

Sandboxing is a common task that most device emulation programs need. Therefore it is a good candidate for a library or launcher tool that is shared by device emulation programs.

Management

Command-line interface

A common command-line interface should be defined where possible. For example, vhost-user's standard --socket-path=PATH argument makes it easy to launch any vhost-user device backend. Protocol-specific options (e.g. socket path) and device type-specific options (e.g. virtio-net) can be standardized.

Some options are necessarily specific to the device emulation program and therefore cannot be standardized.

The advantage of standard options is that management tools like libvirt can launch the device emulation programs without further user configuration.

RPC interface

It may be necessary to issue commands at runtime. Examples include adjusting throttling limits, enabling/disabling logging, etc. These operations can be performed over an RPC interface.

Various RPC interfaces are used throughout open source virtualization software. Adopting a widely-used RPC protocol and standardizing commands is beneficial because it makes it easy to communicate with the software and management tools can support them relatively easily.

Conclusion

This was largely a brain dump but I hope it is useful food for thought as out-of-process device emulation interfaces are designed and developed. There is a lot more to it than simply implementing a protocol for device register accesses and guest RAM DMA. Developing open source libraries in Rust and C that can be used as needed will ensure that out-of-process devices are high-quality and easy for users to deploy.

Sunday, September 27, 2020

On unifying vhost-user and VIRTIO

The recent development of a Linux driver framework called VIRTIO Data Path Acceleration (vDPA) has laid the groundwork for exciting new vhost-user features. The implications of vDPA have not yet rippled through the community so I want to share my thoughts on how the vhost-user protocol can take advantage of new ideas from vDPA.

This post is aimed at developers and assumes familiarity with the vhost-user protocol and VIRTIO. No knowledge of vDPA is required.

vDPA helps with writing drivers for hardware that supports VIRTIO offload. Its goal is to enable hybrid hardware/software VIRTIO devices, but as a nice side-effect it has overcome limitations in the kernel vhost interface. It turns out that applying ideas from vDPA to the vhost-user protocol solves the same issues there. In this article I'll show how extending the vhost-user protocol with vDPA has the following benefits:

Allows existing VIRTIO device emulation code to act as a vhost-user device backend.
Removes the need for shim devices in the virtual machine monitor (VMM).
Replaces undocumented conventions with a well-defined device model.

These things can be done while reusing existing vhost-user and VIRTIO code. In fact, this is especially good news for existing codebases like QEMU because they already have a wealth of vhost-user and VIRTIO code that can now finally be connected together!

Let's look at the advantages of extending vhost-user with vDPA first and then discuss how to do it.

Why extend vhost-user with vDPA?

Reusing VIRTIO emulation code for vhost-user backends

It is a common misconception that a vhost device is a VIRTIO device. VIRTIO devices are defined in the VIRTIO specification and consist of a configuration space, virtqueues, and a device lifecycle that includes feature negotiation. A vhost device is a subset of the corresponding VIRTIO device. The exact subset depends on the device type, and some vhost devices are closer to the full functionality of their corresponding VIRTIO device than others. The most well-known example is that vhost-net devices have rx/tx virtqueues and but lack the virtio-net control virtqueue. Also, the configuration space and device lifecycle are only partially available to vhost devices.

This difference makes it impossible to use a VIRTIO device as a vhost-user device and vice versa. There is an impedance mismatch and missing functionality. That's a shame because existing VIRTIO device emulation code is mature and duplicating it to provide vhost-user backends creates additional work.

If there was a way to reuse existing VIRTIO device emulation code it would be easier to move to a multi-process architecture in QEMU. Want to run --netdev user,id=netdev0 --device virtio-net-pci,netdev=netdev0 in a separate, sandboxed process? Easy, run it as a vhost-user-net device instead of as virtio-net.

Making VMM device shims optional

Today each new vhost device type requires a shim device in the VMM. QEMU has --device vhost-user-blk-pci, --device vhost-user-input-pci, and so on. Why can't there be a single --device vhost-user device?

This limitation is a consequence of the fact that vhost devices are not full VIRTIO devices. In fact, a vhost device does not even have a way to report its device type (net, blk, scsi, etc). Therefore it is impossible for today's VMMs to offer a generic device. Each vhost device type requires a shim device.

In some cases a shim device is desirable because it allows the VMM to handle some aspects of the device instead of passing everything through to the vhost device backend. But requiring shims by default involves lots of tedious boilerplate code and prevents new device types from being used by older VMMs.

Providing a well-defined device model in vhost-user

Although vhost-user works well for users, it is difficult for developers to learn and extend. The protocol does not have a well-defined device model. Each device type has its own undocumented set of protocol messages that are used. For example, the vhost-user-blk device uses the configuration space whereas most other device types do not use the configuration space at all.

Since protocol use is not fully documented in the specification, developers might resort to reading Linux, QEMU, and DPDK code in order to figure out how to make their devices work. They typically have to debug vhost-user protocol messages and adjust their code until it appears to work. Hopefully the critical bugs are caught before the code ships. This is problematic because it's hard to produce high-quality vhost-user implementations.

Although the protocol specification can certainly be cleaned up, the problem is more fundamental. vhost-user badly needs a well-defined device model so that protocol usage is clear and uniform for all device types. The best way to do that is to simply adopt the VIRTIO device model. The VIRTIO specification already defines the device lifecycle and details of the device types. By making vhost-user devices full VIRTIO devices there is no need for additional vhost device specifications. The vhost-user specification just becomes a transport for the established VIRTIO device model. Luckily that is effectively what vDPA has done for kernel vhost ioctls.

How to do this in QEMU

The following QEMU changes are needed to implement vhost-user vDPA support. Below I will focus on vhost-user-net but most of the work is generic and benefits all device types.

Import vDPA ioctls into vhost-user

vDPA extends the Linux vhost ioctl interface. It uses a subset of vhost ioctls and adds new vDPA-specific ioctls that are implemented in the vhost_vdpa.ko kernel module. These new ioctls enable the full VIRTIO device model, including device IDs, the status register, configuration space, and so on.

In theory vhost-user could be fixed without vDPA, but it would involve effectively adding the same set of functionality that vDPA has already added onto kernel vhost. Reusing the vDPA ioctls allows VMMs to support both kernel vDPA and vhost-user with minimal code duplication.

This can be done by adding a VHOST_USER_PROTOCOL_F_VDPA feature bit to the vhost-user protocol. If both the vhost-user frontend and backend support vDPA then all vDPA messages are available. Otherwise they can either fall back on legacy vhost-user behavior or drop the connection.

The vhost-user specification could be split into a legacy section and a modern vDPA-enabled section. The modern protocol will remove vhost-user messages that are not needed by vDPA, simplifying the protocol for new developers while allowing existing implementations to support both with minimal changes.

One detail is that vDPA does not use the memory table mechanism for sharing memory. Instead it relies on the richer IOMMU message family that is optional in vhost-user today. This approach can be adopted in vhost-user too, making the IOMMU code path standard for all implementations and dropping the memory table mechanism.

Add vhost-user vDPA to the vhost code

QEMU's hw/virtio/vhost*.c code supports kernel vhost, vhost-user, and kernel vDPA. A vhost-user vDPA mode must be added to implement the new protocol. It can be implemented as a combination of the vhost-user and kernel vDPA code already in QEMU. Most likely the existing vhost-user code can simply be extended to enable vDPA support if the backend supports it.

Only small changes to hw/net/virtio-net.c and net/vhost-user.c are needed to use vhost-user vDPA with net devices. At that point QEMU can connect to a vhost-user-net device backend and use vDPA extensions.

Add a vhost-user vDPA VIRTIO transport

Next a vhost-user-net device backend can be put together using QEMU's virtio-net emulation. A translation layer is needed between the vhost-user protocol and the VirtIODevice type in QEMU. This can be done by implementing a new VIRTIO transport alongside the existing pci, mmio, and ccw transports. The transport processes vhost-user protocol messages from the UNIX domain socket and invokes VIRTIO device emulation code inside QEMU. It acts as a VIRTIO bus so that virtio-net-device, virtio-blk-device, and other device types can be plugged in.

This piece is the most involved but the vhost-user protocol communication part was already implemented in the virtio-vhost-user prototype that I worked on previously. Most of the communication code can be reused and the remainder is implementing the VirtioBusClass interface.

To summarize, a new transport acts as the vhost-user device backend and invokes QEMU VirtIODevice methods in response to vhost-user protocol messages. The syntax would be something like --device virtio-net-device,id=virtio-net0 --device vhost-user-backend,device=virtio-net0,addr.type=unix,addr.path=/tmp/vhost-user-net.sock.

Where this converges with multi-process QEMU

At this point QEMU can run ad-hoc vhost-user backends using existing VIRTIO device models. It is possible to go further by creating a qemu-dev launcher executable that implements the vhost-user spec's "Backend program conventions". This way a minimal device emulator executable hosts the device instead of a full system emulator.

The requirements for this are similar to the multi-process QEMU effort, which aims to run QEMU devices as separate processes. One of the main open questions is how to design build system and Kconfig support for building minimal device emulator executables.

In the case of vhost-user-net the qemu-dev-vhost-user-net executable would contain virtio-net-device, vhost-user-backend, any netdevs the user wishes to include, a QMP monitor, and a vhost-user backend command-line interface.

Where does this leave us? QEMU's existing VIRTIO device models can be used as vhost-user devices and run in a separate processes from the VMM. It's a great way of reusing code and having the flexibility to deploy it in the way that makes most sense for the intended use case.

Conclusion

The vhost-user protocol can be simplified by adopting the vhost vDPA ioctls that have recently been introduced in Linux. This turns vhost-user into a VIRTIO transport and vhost-user devices become full VIRTIO devices. Existing VIRTIO device emulation code can then be reused in vhost-user device backends.

Thursday, September 17, 2020

Initial applications for Outreachy December-March internships Sept 20

QEMU is participating in the Outreachy December-March open source internship program. The internships are 12 weeks of full-time paid remote work contributing to QEMU/KVM. For more information about eligibility and what the internships are like, please see the Outreachy website.

The initial application deadline is September 20 and it only takes 5-30 minutes to complete: https://www.outreachy.org/apply/

If you or someone you know are considering doing an internship with QEMU, Linux kernel, or another participating organization, please remember to submit an initial application by September 20th!

Saturday, August 29, 2020

Using kcov code coverage with meson

The meson build system has built-in code coverage support, making it easy to identify lines of code that are not exercised by tests. Meson's code coverage support works with the gcov-based tools gcovr and lcov. This post shows how to use kcov with meson instead so that code coverage can be reported when gcov is unavailable.

How do code coverage tools work?

The gcov-based tools rely on compiler instrumentation, which both gcc and llvm support. Special compiler options instruct the compiler to emit instrumentation in every compiled function in order to record which lines of code are reached.

The kcov tool takes a different approach that does not require compiler support. It uses run-time instrumentation (like breakpoints) instead of compile-time instrumentation. This makes it possible to use kcov on existing binaries without recompilation, as long as debug information is available. The tool maps program instructions to lines of source code using the debug information.

There are pros and cons regarding exact features, performance, limitations, etc. For the most part the gcov approach works well when recompilation is possible and the compiler supports gcov. In other cases kcov is needed.

How to run meson tests under kcov

Meson's built-in code coverage support is designed for gcov and therefore works as a post-processing step after meson test was run. The workflow is different with kcov since the test itself must be run under kcov so it can instrument the process.

Run meson test as follows to get per-test coverage results:

$ meson test --wrapper='kcov kcov-output'

The $BUILD_DIR/kcov-output/ directory will contain the coverage results, one set for each test that was run.

Merging coverage results

If your goal is a single coverage percentage for the entire test suite, then the per-test results need to be merged. The follow wrapper script can be used:

$ cat kcov-wrapper.sh
#!/bin/sh
test_name=$(basename $1)
exec kcov kcov-runs/$test_name "$@"

And it is invoked like this:

$ rm -rf $BUILD_DIR/kcov-runs
$ mkdir $BUILD_DIR/kcov-runs
$ meson test --wrapper "$SOURCE_DIR/kcov-wrapper.sh"
$ rm -rf $BUILD_DIR/kcov-output
$ kcov --merge $BUILD_DIR/kcov-output $BUILD_DIR/kcov-runs/*

The merged results are located in the $BUILD_DIR/kcov-output/ directory.

Conclusion

Meson already has built-in support for gcov-based code coverage. If you cannot use gcov, then kcov is an alternative that is fairly easy to integrate into a meson project.

Monday, August 24, 2020

QEMU Internals: Event loops

This post explains event loops in QEMU v5.1.0 and their unique features compared to other event loop implementations. The APIs are not covered in detail here since they are explained in doc comments. Instead, the focus is on the big picture and why things work the way they do.

Event loops are central to many I/O-bound applications like network services and graphical desktop applications. QEMU also has I/O-bound work that fits well into an event loop. Examples include the QMP monitor, disk I/O, and timers.

An event loop monitors event sources for activity and invokes a callback function when an event occurs. This makes it possible to process multiple event sources within a single CPU thread. The application can appear to do multiple things at once without multithreading because it switches between handling different event sources. This architecture is common in Javascript, Python Twisted/asyncio, and many other environments. Sometimes the event loop is hidden underneath coroutines or async/await language features (QEMU has coroutines but often the event loop is still used directly).

The most important event sources in QEMU are:

File descriptors such as sockets and character devices.
Event notifiers (implemented as eventfds on Linux).
Timers for delayed function execution.
Bottom-halves (BHs) for invoking a function in another thread or deferring a function call to avoid reentrancy.

Event loops and threads

QEMU has several different types of threads:

vCPU threads that execute guest code and perform device emulation synchronously with respect to the vCPU.
The main loop that runs the event loops (yes, there is more than one!) used by many QEMU components.
IOThreads that run event loops for device emulation concurrently with vCPUs and "out-of-band" QMP monitor commands.

It's worth explaining how device emulation interacts with threads. When guest code accesses a device register the vCPU thread traps the access and dispatches it to an emulated device. The device's read/write function runs in the vCPU thread. The vCPU thread cannot resume guest code execution until the device's read/write function returns. This means long-running operations like emulating a timer chip or disk I/O cannot be performed synchronously in the vCPU thread since they would block the vCPU. Most devices solve this problem using the main loop thread's event loops. They add timer or file descriptor monitoring callbacks to the main loop and return back to guest code execution. When the timer expires or the file descriptor becomes ready the callback function runs in the main loop thread. The final part of emulating a guest timer or disk access therefore runs in the main loop thread and not a vCPU thread.

Some devices perform the guest device register access in the main loop thread or an IOThread thanks to ioeventfd. ioeventfd is a Linux KVM API and also has a userspace fallback implementation for TCG that traps vCPU device accesses and writes to a file descriptor so another thread can handle the access.

The key point is that vCPU threads do not run an event loop. The main loop thread and IOThreads run event loops. vCPU threads can add event sources to the main loop or IOThread event loops. Callbacks run in the main loop thread or IOThreads.

How the main loop and IOThreads differ

The main loop and IOThreads share some code but are fundamentally different. The common code is called AioContext and is QEMU's native event loop API. Commonly-used functions include aio_set_fd_handler(), aio_set_event_handler(), aio_timer_init(), and aio_bh_new().

The main loop actually has a glib GMainContext and two AioContext event loops. QEMU components can use any of these event loop APIs and the main loop combines them all into a single event loop function os_host_main_loop_wait() that calls qemu_poll_ns() to wait for event sources. This makes it possible to combine glib-based code with code using the native QEMU AioContext APIs.

The reason why the main loop has two AioContexts is because one, called iohandler_ctx, is used to implement older qemu_set_fd_handler() APIs whose handlers should not run when the other AioContext, called qemu_aio_context, is run using aio_poll(). The QEMU block layer and newer code uses qemu_aio_context while older code uses iohandler_ctx. Over time it may be possible to unify the two by converting iohandler_ctx handlers to safely execute in qemu_aio_context.

IOThreads have an AioContext and a glib GMainContext. The AioContext is run using the aio_poll() API, which enables the advanced features of the event loop. If a glib event loop is needed then the GMainContext can be run using g_main_loop_run() and the AioContext event sources will be included.

Code that relies on the AioContext aio_*() APIs will work with both the main loop and IOThreads. Older code using qemu_*() APIs only works with the main loop. glib code works with both the main loop and IOThreads.

The key difference between the main loop and IOThreads is that the main loop uses a traditional event loop that calls qemu_poll_ns() while IOThreads AioContext aio_poll() has advanced features that result in better performance.

AioContext features

AioContext has the following event loop features that traditional event loops do not have:

Adaptive polling support for lower latency but slightly higher CPU consumption. AioContext event sources can have a userspace polling function that detects events without performing syscalls (e.g. peeking at a memory location). This allows the event loop to avoid block syscalls that might lead the host kernel scheduler to yield the thread and put the physical CPU into a low power state. Keeping the CPU busy and avoiding entering the kernel minimizes latency.
O(1) time complexity with respect to the number of monitored file descriptors. When there are thousands of file descriptors O(n) APIs like poll(2) spend time scanning over all file descriptors, even those that have no activity. This scalability bottleneck can be avoided with Linux io_uring and epoll APIs, both of which are supported by AioContext aio_poll(2).
Nanosecond timers. glib's event loop only has millisecond timers, which is not sufficient for emulating hardware timers.

These features are required for performance reasons. Unfortunately glib's event loop does not support them, otherwise QEMU could use GMainContext as its only event loop.

Conclusion

QEMU uses both its native AioContext event loop and glib's GMainContext. The QEMU main loop and IOThreads work differently, with IOThreads offering the best performance thanks to its AioContext aio_poll() event loop. Modern QEMU code should use AioContext APIs for optimal performance and so that the code can be used in both the main loop and IOThreads.

Thursday, August 6, 2020

Why QEMU should move from C to Rust

Welcome Redditors and HackerNews folks! This post is getting attention outside the QEMU community, so I'd like to highlight two things that may not be immediately clear: I am a QEMU maintainer and I'm not advocating to Rewrite It In Rust. Enjoy! :)

My KVM Forum 2018 presentation titled Security in QEMU: How Virtual Machines provide Isolation (pdf) (video) reviewed security bugs in QEMU and found the most common causes were C programming bugs. This includes buffer overflows, use-after-free, uninitialized memory, and more. In this post I will argue for using Rust as a safer language that prevents these classes of bugs.

In 2018 the choice of a safer language was not clear. C++ offered safe abstractions without an effective way to prohibit unsafe language features. Go also offered safety but with concerns about runtime costs. Rust looked promising but few people had deep experience with it. In 2018 I was not able to argue confidently for moving away from C in QEMU.

Now in 2020 the situation is clearer. C programming bugs are still the main cause of CVEs in QEMU. Rust has matured, its ecosystem is growing and healthy, and there are virtualization projects like Crosvm, Firecracker, and cloud-hypervisor that prove Rust is an effective language for writing Virtual Machine Monitors (VMM). In the QEMU community Paolo Bonzini and Sergio Lopez's work on rust-vmm and vhost-user code inspired me to look more closely at moving away from C.

Do we need to change programming language?

Most security bugs in QEMU are C programming bugs. This is easy to verify by looking through the CVE listings. Although I have only reviewed CVEs it seems likely that non-security bugs are also mostly C programming bugs.

Eliminating C programming bugs does not necessarily require switching programming languages. Other approaches to reducing bug rates in software include:

Coding style rules that forbid unsafe language features.
Building safe abstractions and prohibiting unsafe language features or library APIs.
Static checkers that scan source code for bugs.
Dynamic sanitizers that run software with instrumentation to identify bugs.
Unit testing and fuzzing.

The problem is, the QEMU community has been doing these things for years but new bugs are still introduced despite these efforts. It is certainly possible to spend more energy on these efforts but the evidence shows that bugs continue to slip through.

There are two issues with these approaches to reducing bugs. First, although these approaches help find existing bugs, eliminating classes of bugs so they cannot exist in the first place is a stronger approach. This is hard to do with C since the language is unsafe, placing the burden of safety on the programmer.

Second, much of the ability to write safe C code comes with experience. Custom conventions, APIs, tooling, and processes to reduce bugs is a hurdle for one-time contributors or newcomers. It makes the codebase inaccessible unless we accept lower standards for some contributors. Code quality should depend as little on experience as possible but C is notorious for being a programming language that requires a lot of practice before you can write production-quality code.

Why Rust?

Safe languages eliminate memory safety bugs (and other classes like concurrency bugs). Rust made this a priority in its design:

Use-after-free, double-free, memory leaks, and other lifetime bugs are prevented at compile-time by the borrow checker where the compiler checks ownership of data.
Buffer overflows and other memory corruptions are prevented by compile-time and runtime bounds-checking.
Pointer deference bugs are prevented by the absense of NULL pointers and strict ownership rules.
Uninitialized memory is prevented because all variables and fields must be initialized.

Rust programs can still "panic" at runtime when safety cannot be proven at compile time but this does not result in undefined behavior as seen in C programs. The program simply aborts with a backtrace. Bugs that could have resulted in arbitrary code execution in C become at most denial-of-service bugs in Rust. This reduces the severity of bugs.

As a result of this language design most C programming bugs that plague QEMU today are either caught by the compiler or turn into a safe program termination. It is reasonable to expect CVEs to reduce in number and in severity when switching to Rust.

At the same time Rust eliminates the need for many of the measures that the QEMU community added onto C because the Rust programming language and its compiler already enforce safety. This means newcomers and one-time contributors will not need QEMU-specific experience, can write production-quality code more easily, and can get their code merged more quickly. It also means reviewers will have to spend less time pointing out C programming bugs or asking for changes that comply with QEMU's way of doing things.

That said, Rust has a reputation for being a scary language due to the borrow checker. Most programmers have not thought about object lifetimes and ownership as systematically and explicitly as required by Rust. This raises the bar to learning the language, but I look at it this way: learning Rust is humanly possible, writing bug-free C code is not.

How can we change programming language?

When I checked in 2018 QEMU was 1.5 million lines of code. It has grown since then. Moving a large codebase to a new programming language is extremely difficult. If people want to convert QEMU to Rust that would be great, but I personally don't have the appetite to do it because I think the integration will be messy, result in a lot of duplication, and there is too much un(der)maintained code that is hard to convert.

The reason I am writing this post is because device emulation, the main security attack surface for VMMs, can be done in a separate program. That program can be written in any language and this is where Rust comes in. For vhost devices it is possible to write Rust device backends today and I hope this will become the default approach to writing new devices.

For non-vhost devices the vfio-user project is working on an interface out-of-process device emulation. It will be possible to implement devices in Rust there too.

If you are implementing new device emulation code please consider doing it in Rust!

Conclusion

Most security bugs in QEMU today are C programming bugs. Switching to a safer programming language will significantly reduce security bugs in QEMU. Rust is now mature and proven enough to use as the language for device emulation code. Thanks to vhost-user and vfio-user using Rust for device emulation does not require a big conversion of QEMU code, it can simply be done in a separate program. This way attack surfaces can be written in Rust to make them less susceptible to security bugs going forward.

Saturday, July 18, 2020

Rethinking event loop integration for libraries

APIs for operations that take a long time are often asynchronous so that applications can continue with other tasks while an operation is running. Asynchronous APIs initiate an operation and then return immediately. The application is notified when the operation completes through a callback or by monitoring a file descriptor for activity (for example, when data arrives on a TCP socket).

Asynchronous applications are usually built around an event loop that waits for the next event and invokes a function to handle the event. Since the details of event loops differ between applications, libraries need to be designed carefully to integrate well with a variety of event loops.

The current model

A popular library with asynchronous APIs is the libcurl file transfer library that is used for making HTTP requests. It has the following (slightly simplified) event loop integration API:

#define CURL_WAIT_POLLIN    0x0001   /* Ready to read? */
#define CURL_WAIT_POLLOUT   0x0004   /* Ready to write? */

int socket_callback(CURL *easy,      /* easy handle */
                    int fd,          /* socket */
                    int what,        /* describes the socket */
                    void *userp,     /* private callback pointer */
                    void *socketp);  /* private socket pointer */

libcurl invokes the applications socket_callback() to start or stop monitoring file descriptors. When the application's event loop detects file descriptor activity, the application invokes libcurl's curl_multi_socket_action() API to let the library process the file descriptor.

There are variations on this theme but generally libraries expose file descriptors and event flags (read/write/error) so the application can monitor file descriptors from its own event loop. The library then performs the read(2) or write(2) call when the file descriptor becomes ready.

How io_uring changes the picture

The Linux io_uring API (pdf) can be used to implement traditional event loops that monitor file descriptors. But it also supports asynchronous system calls like read(2) and write(2) (best used when IORING_FEAT_FAST_POLL is available). The latter is interesting because it combines two syscalls into a single efficient syscall:

Waiting for file descriptor activity.
Reading/writing the file descriptor.

Existing applications use syscalls like epoll_wait(2), poll(2), or the old select(2) to wait for file descriptor activity. They can also use io_uring's IORING_OP_POLL_ADD to achieve the same effect.

After the file descriptor becomes ready, a second syscall like read(2) or write(2) is required to actually perform I/O.

io_uring's asynchronous IORING_OP_READ or IORING_OP_WRITE (including variants for vectored I/O or sockets) only requires a single io_uring_enter(2) call. If io_uring sqpoll is enabled then a syscall may not even be required to submit these operations!

To summarize, it's more efficient to perform a single asynchronous read/write instead of first monitoring file descriptor activity and then performing a read(2) or write(2).

A new model

Existing library APIs do not fit the asynchronous read/write approach because they expect the application to wait for file descriptor activity and then for the library to invoke a syscall to perform I/O. A new model is needed where the library tells the application about I/O instead of asking the application to monitor file descriptors for activity.

The library can use a new callback API that lets the application perform asynchronous I/O:

/*
 * The application invokes this callback when an aio operation has completed.
 *
 * @cb_arg: the cb_arg passed to a struct aio_operations function by the library
 * @ret: the return value of the aio operation (negative errno for failure)
 */
typedef void aio_completion_fn(void *cb_arg, ssize_t ret);

/*
 * Asynchronous I/O operation callbacks provided to the library by the
 * application.
 *
 * These functions initiate an I/O operation and then return immediately. When
 * the operation completes the @cb callback is invoked with @cb_arg. Note that
 * @cb may be invoked before the function returns (typically in the case of an
 * early error).
 */
struct aio_operations {
    void read(int fd, void *data, size_t len, aio_completion_fn *cb,
              void *cb_arg);
    void write(int fd, void *data, size_t len, aio_completion_fn *cb,
               void *cb_arg);
    ...
};

The concept of monitoring file descriptor activity is gone. Instead the API focusses on asynchronous I/O operations that can be implemented by the application however it sees fit.

Applications using io_uring can use IORING_OP_READ and IORING_OP_WRITE to implement asynchronous operations efficiently. Traditional applications can still use their event loops but now also perform the read(2), write(2), etc syscalls on behalf of the library.

Some libraries don't need a full set of struct aio_operations callbacks because they only perform I/O in limited ways. For example, a library that only has a Linux eventfd can instead present this simplified API:

/*
 * Return an eventfd(2) file descriptor that the application must read from and
 * call lib_eventfd_fired() when a non-zero value was read.
 */
int lib_get_eventfd(struct libobject *obj);

/*
 * The application must call this function when the eventfd returned by
 * lib_get_eventfd() read a non-zero value.
 */
void lib_eventfd_fired(struct libobject *obj);

Although this simplified API is similar to traditional event loop integration APIs it is now the application's responsibility to perform the eventfd read(2), not the library's. This way applications using io_uring can implement the read efficiently.

Does an extra syscall matter?

Whether it is worth eliminating the extra syscall depends on one's performance requirements. When I/O is relatively infrequent then the overhead of the additional syscall may not matter.

While working on QEMU I found that the extra read(2) on eventfds causes a measurable overhead.

Conclusion

Splitting file descriptor monitoring from I/O is suboptimal for Linux io_uring applications. Unfortunately, existing library APIs are often designed in this way. Letting the application perform asynchronous I/O on behalf of the library allows a more efficient implementation with io_uring while still supporting applications that use older event loops.

Thursday, July 2, 2020

Avoiding bitrot in C macros

A common approach to debug messages that can be toggled at compile-time in C programs is:

#ifdef ENABLE_DEBUG
#define DPRINTF(fmt, ...) do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0)
#else
#define DPRINTF(fmt, ...)
#endif

Usually the ENABLE_DEBUG macro is not defined in normal builds, so the C preprocessor expands the debug printfs to nothing. No messages are printed at runtime and the program's binary size is smaller since no instructions are generated for the debug printfs.

This approach has the disadvantage that it suffers from bitrot, the tendency for source code to break over time when it is not actively built and used. Consider what happens when one of the variables used in the debug printf is not updated after being renamed:

- int r;
+ int radius;
  ...
  DPRINTF("radius %d\n", r);

The code continues to compile after r is renamed to radius because the DPRINTF() macro expands to nothing. The compiler does not syntax check the debug printf and misses that the outdated variable name r is still in use. When someone defines ENABLE_DEBUG months or years later, the compiler error becomes apparent and that person is confronted with fixing a new bug on top of whatever they were trying to debug when they enabled the debug printf!

It's actually easy to avoid this problem by writing the macro differently:

#ifndef ENABLE_DEBUG
#define ENABLE_DEBUG 0
#endif
#define DPRINTF(fmt, ...) do { \
        if (ENABLE_DEBUG) { \
            fprintf(stderr, fmt, ## __VA_ARGS__); \
        } \
    } while (0)

When ENABLE_DEBUG is not defined the macro expands to:

do {
    if (0) {
        fprintf(stderr, fmt, ...);
    }
} while (0)

What is the difference? This time the compiler parses and syntax checks the debug printf even when it is disabled. Luckily compilers are smart enough to eliminate deadcode, code that cannot be executed, so the binary size remains small.

This applies not just to debug printfs. More generally, all preprocessor conditionals suffer from bitrot. If an #if ... #else ... #endif can be replaced with equivalent unconditional code then it's often worth doing.

Friday, May 22, 2020

How to check VIRTIO feature bits inside Linux guests

VIRTIO devices have feature bits that indicate the presence of optional features. The feature bit space is divided into core VIRTIO features (e.g. notify on empty), transport-specific features (PCI, MMIO, CCW), and device-specific features (e.g. virtio-net checksum offloading). This article shows how to check whether a feature is enabled inside Linux guests.

The feature bits are used during VIRTIO device initialization to negotiate features between the device and the driver. The device reports a fixed set of features, typically all the features that the device implementors wanted to offer from the VIRTIO specification version that they developed against. The driver also reports features, typically all the features that the driver developers wanted to offer from the VIRTIO specification version that they developed against.

Feature bit negotiation determines the subset of features supported by both the device and the driver. A new driver might not be able to enable all the features it supports if the device is too old. The same is true vice versa. This offers compatibility between devices and drivers. It also means that you don't know which features are enabled until the device and driver have negotiated them at runtime.

Where to find feature bit definitions

VIRTIO feature bits are listed in the VIRTIO specification. You can also grep the linux/virtio-*.h header files:

$ grep VIRTIO.*_F_ /usr/include/linux/virtio_*.h
virtio_ring.h:#define VIRTIO_RING_F_INDIRECT_DESC 28
virtio_ring.h:#define VIRTIO_RING_F_EVENT_IDX  29
virtio_scsi.h:#define VIRTIO_SCSI_F_INOUT                    0
virtio_scsi.h:#define VIRTIO_SCSI_F_HOTPLUG                  1
virtio_scsi.h:#define VIRTIO_SCSI_F_CHANGE                   2
...

Here the VIRTIO_SCSI_F_INOUT (0) constant is for the 1st bit (1ull << 0). Bit-numbering can be confusing because different standards, vendors, and languages express it differently. Here it helps to think of a bit shift operation like 1 << BIT.

How to check feature bits inside the guest

The Linux virtio.ko driver that is used for all VIRTIO devices has a sysfs file called features. This file contains the feature bits in binary representation starting with the 1st bit on the left and more significant bits to the right. The reported bits are the subset that both the device and the driver support.

To check if the virtio-blk device /dev/vda has the VIRTIO_RING_F_EVENT_IDX (29) bit set:

$ python -c "print('$(</sys/block/vda/device/driver/virtio*/features)'[29])"
01100010011101100000000000100010100

Other device types can be found through similar sysfs paths.