Friday, December 11, 2020

Building a git forge using git apps

The two previous blog posts about why git forges are von Neumann machines and the Radicle peer-to-peer git forge explored models for git forges. In this final post I want to cover yet another model that draws from the previous ones but has its own unique twist.

Peer-to-peer git apps

I previously showed how applications can be built on centralized git forges using CI/CD functionality for executing code, webhooks for interacting with the outside world, and disjoint branches for storing data.

A more elegant architecture is a peer-to-peer one where instead of many clients and one server there are just peers. Each peer has full access to the data. There is no client/server application code split, instead each peer runs an application for itself.

First, this makes it easier to move the data to new hosting infrastructure or fork a project since all data resides in the git repository. Merge requests, issues, wikis, and even the app settings are all stored in the git repo itself.

Second, this gives more power to the users who can process data however they want without being limited by the server's API. All peers are on equal footing and users don't need permission to alter applications, because they run locally.

Finally, it is easier to develop a local application than a client/server application. Being able to open a file and tweak the code is immediate and less hassle than testing and deploying a server-side application.

Internet peer-to-peer systems typically still require some central point for bootstrapping and this is no exception. A publicly-accessible git repository is still needed so that peers can fetch and push changes. However, in this model the git server does not run application code but "git apps" like merge requests, issue trackers, wikis, etc can still be implemented. Here is how it works...

The anti-application server

The git server is not allowed to run application code in our model, so apps like merge requests won't be processing data on the server side. However, the repository does need some primitives to make peer-to-peer git apps possible. These primitives are access control policies for refs and directories/files.

Peers run applications locally and the git server is "dumb" with the sole job of enforcing access control. You can imagine this like a multi-user UNIX machine where users have access to a shared directory. UNIX file permissions determine how processes can access the data. By choosing permissions carefully, multiple users can collaborate in the shared directory in a safe and controlled manner.

This is an anti-application server because no application code runs on the server side. The server is just a git repository that stores data and enforces access control on git push.

Access control

Repositories that accept push requests need a pre-receive hook (see githooks(5)) that checks incoming requests against the access control policy. If the request complies with the access control policy then the git push is accepted. Otherwise the git push is rejected and changes are not made to the git repository.

The first type of access control is on git refs. Git refs are the namespace where branches and tags are stored in a git repository. If a regular expression matches the ref and the operation type (create, fast-forward, force, delete) then it is allowed. For example, this policy rule allows any user to push to refs/heads/foo but force pushes and deletion are not allowed:

anyone create,fast-forward ^heads/foo$

The operations available on refs include:

OperationDescription
create-branchPush a new branch that doesn't exist yet
create-tagPush a new tag that doesn't exist yet
fast-forwardPush a commit that is a descendent of the current commit
forcePush a commit or tag replacing the previous ref
deleteDelete a ref

What's more interesting is that $user_id is expanded to the git push user's identifier so we can write rules to limit access to per-user ref namespaces:

anyone create-branch,fast-forward,force,delete ^heads/$user_id/.*$

This would allow Alice to push her own branches but Alice could not push to Bob's branches.

We have covered how to define access control policies on refs. Access control policies are also needed on branches so that multiple users can modify the same branch in a controlled and safe manner. The syntax is similar but the policy applies to changes made by commits to directories/files (what git calls a tree). The following allows users to create files in a directory but not delete or modify them (somewhat similar to the UNIX restricted deletion or "sticky" bit on world-writable directories):

anyone create-file ^shared-dir/.*$

The operations available on branches include:

OperationDescription
create-directoryCreate a new directory
create-fileCreate a new file
create-symlinkCreate a symlink
modifyChange an existing file or symlink
delete-fileDelete a file
...

$user_id expansion is also available for branch access control. Here the user can create, modify, and delete files in a per-user directory:

anyone create-file,modify,delete-file ^$user_id/.*$

User IDs

You might be wondering how user identifiers work. Git supports GPG-signed push requests with git push --signed. We can use the GPG key ID as the user identifier, eliminating the need for centralized user accounts. Remember that the GPG key ID is based on the public key. Key pairs are randomly generated and it is improbable that the same key will be generated by two different users. That said, GPG key ID uniqueness has been weak in the past when the default size was 32 bits. Git explicitly enables long 64-bit GPG key IDs but I wonder if collisions could be a problem. Maybe an ID with more bits based on the public key should be used instead, but for now let's assume the GPG key ID is unique.

The downside of this approach is that user IDs are not human-friendly. Git apps can allow the user to assign aliases to avoid displaying raw user IDs. Doing this automatically either requires an external ID issuer like confirming email address ownership, which is tedious for new users, or by storing a registry of usernames in the git repo, which means a first-come-first-server policy for username allocation and possible conflicts when merging from two repositories that don't share history. Due to these challenges I think it makes sense to use raw GPG key IDs at the data storage level and make them prettier at the user interface level.

The GPG key ID approach works well for desktop clients but not for web clients. The web application (even if implemently on the client side) would need access to the private key so it can push to the git repository. Users should not trust remotely hosted web applications with their private keys. Maybe there is a standard Web API that can help but I'm not aware one. More thought is needed here.

The pre-receive git hook checks that signature verification passed and has access to the GPG key ID in the GIT_PUSH_CERT_KEY environment variable. Then the access control policy can be checked.

Access control is a git app

Access control is the first and most fundamental git app. The access control policies that were described above are stored as files in the apps/access-control branch in the repository. Pushes to that branch are also subject to access control checks. Here is the branch's initial layout:

branches/ - access control policies for branches
  owner.conf
groups/ - group definitions (see below)
  ...
refs/ - access control policies for refs
  owner.conf

The default branches/owner.conf access control policy is as follows:

owner create-file,create-directory,modify,delete ^.*$

The default refs/owner.conf access control policy is as follows:

owner create-branch,create-tag,fast-foward,force,delete ^.*$

This gives the owner the ability to push refs and modify branches as they wish. The owner can grant other users access by pushing additional access control policy files or changing exsting files on the apps/access-control branch.

Each access control policy file in refs/ or branches/ is processed in turn. If no access control rule matches the operation then the entire git push is rejected.

Groups can be defined to alias one or more user identifiers. This avoids duplicating access control rules when more than one user should have the same access. There are two automatic groups: owner contains just the user who owns the git repository and anyone is the group of all users.

This completes the description of the access control app. Now let's look at how other functionality is built on top of this.

The merge requests app

A merge requests app can be built on top of this model. The refs access control policy is as follows:

# The data branch contains the titles, comments, etc
anyone modify ^apps/merge-reqs/data$

# Each merge request revision is pushed as a tag in a per-user namespace
anyone create-tag ^apps/merge-reqs/$user_id/[0-9]+-v[0-9]+$

The branch access control policy is:

# Merge requests are per-user and numbered
anyone create-directory ^merge-reqs/$user_id/[0-9]+$

# Title string
anyone create-file,modify ^merge-reqs/$user_id/[0-9]+/title$

# Labels (open, needs-review, etc) work like this:
#
#   merge-reqs/<user-id>/<merge-req-num>/labels/
#     needs-review -> /labels/needs-review
#     ...
#   labels/
#     needs-review/
#       <user-id>/
#         <merge-req-num> -> /merge-reqs/<user-id>/<merge-req-num>
#         ...
#       ...
#     ...
#
# This directory and symlink layout makes it possible to enumerate labels for a
# given merge request and to enumerate merge requests for a given label.
#
# Both the merge request author and maintainers can add/remove labels to/from a
# merge request.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/labels$
anyone create-symlink,delete ^merge-reqs/$user_id/[0-9]+/labels/.*$
maintainers create-symlink,delete ^merge-reqs/[^/]+/[0-9]+/labels/.*$
maintainers create-directory ^labels/[^/]+$
anyone create-symlink,delete ^labels/[^/]+/$user_id/[0-9]+$
maintainers create-symlink,delete ^labels/[^/]+/[^/]+/[0-9]+$

# Comments are stored as individual files in per-user directories. Each file
# contains a timestamp and the contents of the comment. The timestamp can be
# used to sort comments chronologically.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments$
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments/$user_id$
anyone create-file,modify ^merge-reqs/[^/]+/[0-9]+/comments/$user_id/[0-9]+$

When a user creates a merge request they provide a title, an initial comment, apply labels, and push a v1 tag for review and merging. Other users can comment by adding files into the merge request's per-user comments directory. Labels can be added and removed by changing symlinks in the labels directories.

The user can publish a new revision of the merge request by pushing a v2 tag and adding a comment describing the changes. Once the maintainers are satisfied they merge the final revision tag into the relevant branch (e.g. "main") and relabel the merge request from open/needs-review to closed/merged.

This workflow can be implemented by a tool that performs the necessary git operations so users do not need to understand the git app's internal data layout. Users just need to interact with the tool that displays merge requests, allows commenting, provides searches, etc. A natural way to implement this tool is as a git alias so it integrates alongside git's built-in commands.

One issue with this approach is that it uses the file system as a database. Performance and scalability are likely to be worst than using a database or application-specific file format. However, the reason for this approach is that it allows the access control app to enforce a policy that ensures users cannot modify or delete other user's data without running application-specific code on the server and while keeping everything stored in a git repository.

An example where this approach performs poorly is for full-text search. The application would need to search all title and comment files for a string. There is no index for efficient lookups. However, if applications find that git-grep(1) does not perform well they can maintain their own index and cache files locally.

I hope that this has shown how git apps can be built without application code running on the server.

Continuous integration bots

Now that we have the merge requests app it's time to think how a continuous integration service could interface with it. The goal is to run tests on each revision of a merge request and report failures so the author of the merge request can rectify the situation.

A CI bot watches the repository for changes. In particular, it needs to watch for tags created with the ref name apps/merge-reqs/[^/]+/[0-9]+-v[0-9]+.

When a new tag is found the CI bot checks it out and runs tests. The results of the tests are posted as a comment by creating a file in merge-regs/<user-id>/>merge-req-num>/comments/<ci-bot-user-id>/0 on the apps/merge-reqs/data branch. A ci-pass or ci-fail label can also be applied to the merge request so that the CI status can be easily queried by users and tools.

Going further

There are many loose ends. How can non-git users participate on issue trackers and wikis? It might be possible to implement a full peer as a client-side web application using isomorphic-git, a JavaScript git implementation. As mentioned above, the GPG key ID approach is not very browser-friendly because it requires revealing the private key to the web page and since keys are user identifiers using temporary keys does not work well.

The data model does not allow efficient queries. A full copy of the data is necessary in order to query it. That's acceptable for local applications because they can maintain their own indexes and are expected to keep the data for a long period of time. It works less well for short-lived web page sessions like a casual user filing a new bug on the issue tracker.

The git push --signed technique is not the only option. Git also supports signed commits and signed tags. The difference between signed pushes and signed tags/commits is significant. The signed push approach only validates the access control policy when the repository is changed and leaves no audit log for future reference. The signed commit/tag approach keeps the signatures in the git history. Signed commits/tags can be propagated in a peer-to-peer network and each peer can validate the access control policy itself. While signed commits/tags apply the access control policy to each object in the repository, signed pushes apply the access control policy to each change made to the repository. The difference is that it's easy to rebase and include work from different authors with signed pushes. Signed commits/tags require re-signing for rebasing and each commit is validated against its signature, which may be different from the user who is making the push request.

There are a lot of interesting approaches and trade-offs to explore here. This model we've discussed fits closely with how I've seen developers use git in open source projects. It is designed around a "main" repository/server that contributors push their code to. But each clone of the repository has all the data and can be published as a new "main" repository, if necessary.

Although these ideas are unfinished I decided to write them up with the knowledge that I probably won't implement them myself. QEMU is moving to GitLab with a traditional centralized git forge. I don't think this is the right time to develop this idea and try to convince the QEMU community to use it. For projects that have fewer infrastructure requirements it would give their contributors more power than being confined to a centralized git forge.

I hope this was an interesting read for anyone thinking about git forges and building git apps.