Friday, December 11, 2020

Building a git forge using git apps

The two previous blog posts about why git forges are von Neumann machines and the Radicle peer-to-peer git forge explored models for git forges. In this final post I want to cover yet another model that draws from the previous ones but has its own unique twist.

Peer-to-peer git apps

I previously showed how applications can be built on centralized git forges using CI/CD functionality for executing code, webhooks for interacting with the outside world, and disjoint branches for storing data.

A more elegant architecture is a peer-to-peer one where instead of many clients and one server there are just peers. Each peer has full access to the data. There is no client/server application code split, instead each peer runs an application for itself.

First, this makes it easier to move the data to new hosting infrastructure or fork a project since all data resides in the git repository. Merge requests, issues, wikis, and even the app settings are all stored in the git repo itself.

Second, this gives more power to the users who can process data however they want without being limited by the server's API. All peers are on equal footing and users don't need permission to alter applications, because they run locally.

Finally, it is easier to develop a local application than a client/server application. Being able to open a file and tweak the code is immediate and less hassle than testing and deploying a server-side application.

Internet peer-to-peer systems typically still require some central point for bootstrapping and this is no exception. A publicly-accessible git repository is still needed so that peers can fetch and push changes. However, in this model the git server does not run application code but "git apps" like merge requests, issue trackers, wikis, etc can still be implemented. Here is how it works...

The anti-application server

The git server is not allowed to run application code in our model, so apps like merge requests won't be processing data on the server side. However, the repository does need some primitives to make peer-to-peer git apps possible. These primitives are access control policies for refs and directories/files.

Peers run applications locally and the git server is "dumb" with the sole job of enforcing access control. You can imagine this like a multi-user UNIX machine where users have access to a shared directory. UNIX file permissions determine how processes can access the data. By choosing permissions carefully, multiple users can collaborate in the shared directory in a safe and controlled manner.

This is an anti-application server because no application code runs on the server side. The server is just a git repository that stores data and enforces access control on git push.

Access control

Repositories that accept push requests need a pre-receive hook (see githooks(5)) that checks incoming requests against the access control policy. If the request complies with the access control policy then the git push is accepted. Otherwise the git push is rejected and changes are not made to the git repository.

The first type of access control is on git refs. Git refs are the namespace where branches and tags are stored in a git repository. If a regular expression matches the ref and the operation type (create, fast-forward, force, delete) then it is allowed. For example, this policy rule allows any user to push to refs/heads/foo but force pushes and deletion are not allowed:

anyone create,fast-forward ^heads/foo$

The operations available on refs include:

OperationDescription
create-branchPush a new branch that doesn't exist yet
create-tagPush a new tag that doesn't exist yet
fast-forwardPush a commit that is a descendent of the current commit
forcePush a commit or tag replacing the previous ref
deleteDelete a ref

What's more interesting is that $user_id is expanded to the git push user's identifier so we can write rules to limit access to per-user ref namespaces:

anyone create-branch,fast-forward,force,delete ^heads/$user_id/.*$

This would allow Alice to push her own branches but Alice could not push to Bob's branches.

We have covered how to define access control policies on refs. Access control policies are also needed on branches so that multiple users can modify the same branch in a controlled and safe manner. The syntax is similar but the policy applies to changes made by commits to directories/files (what git calls a tree). The following allows users to create files in a directory but not delete or modify them (somewhat similar to the UNIX restricted deletion or "sticky" bit on world-writable directories):

anyone create-file ^shared-dir/.*$

The operations available on branches include:

OperationDescription
create-directoryCreate a new directory
create-fileCreate a new file
create-symlinkCreate a symlink
modifyChange an existing file or symlink
delete-fileDelete a file
...

$user_id expansion is also available for branch access control. Here the user can create, modify, and delete files in a per-user directory:

anyone create-file,modify,delete-file ^$user_id/.*$

User IDs

You might be wondering how user identifiers work. Git supports GPG-signed push requests with git push --signed. We can use the GPG key ID as the user identifier, eliminating the need for centralized user accounts. Remember that the GPG key ID is based on the public key. Key pairs are randomly generated and it is improbable that the same key will be generated by two different users. That said, GPG key ID uniqueness has been weak in the past when the default size was 32 bits. Git explicitly enables long 64-bit GPG key IDs but I wonder if collisions could be a problem. Maybe an ID with more bits based on the public key should be used instead, but for now let's assume the GPG key ID is unique.

The downside of this approach is that user IDs are not human-friendly. Git apps can allow the user to assign aliases to avoid displaying raw user IDs. Doing this automatically either requires an external ID issuer like confirming email address ownership, which is tedious for new users, or by storing a registry of usernames in the git repo, which means a first-come-first-server policy for username allocation and possible conflicts when merging from two repositories that don't share history. Due to these challenges I think it makes sense to use raw GPG key IDs at the data storage level and make them prettier at the user interface level.

The GPG key ID approach works well for desktop clients but not for web clients. The web application (even if implemently on the client side) would need access to the private key so it can push to the git repository. Users should not trust remotely hosted web applications with their private keys. Maybe there is a standard Web API that can help but I'm not aware one. More thought is needed here.

The pre-receive git hook checks that signature verification passed and has access to the GPG key ID in the GIT_PUSH_CERT_KEY environment variable. Then the access control policy can be checked.

Access control is a git app

Access control is the first and most fundamental git app. The access control policies that were described above are stored as files in the apps/access-control branch in the repository. Pushes to that branch are also subject to access control checks. Here is the branch's initial layout:

branches/ - access control policies for branches
  owner.conf
groups/ - group definitions (see below)
  ...
refs/ - access control policies for refs
  owner.conf

The default branches/owner.conf access control policy is as follows:

owner create-file,create-directory,modify,delete ^.*$

The default refs/owner.conf access control policy is as follows:

owner create-branch,create-tag,fast-foward,force,delete ^.*$

This gives the owner the ability to push refs and modify branches as they wish. The owner can grant other users access by pushing additional access control policy files or changing exsting files on the apps/access-control branch.

Each access control policy file in refs/ or branches/ is processed in turn. If no access control rule matches the operation then the entire git push is rejected.

Groups can be defined to alias one or more user identifiers. This avoids duplicating access control rules when more than one user should have the same access. There are two automatic groups: owner contains just the user who owns the git repository and anyone is the group of all users.

This completes the description of the access control app. Now let's look at how other functionality is built on top of this.

The merge requests app

A merge requests app can be built on top of this model. The refs access control policy is as follows:

# The data branch contains the titles, comments, etc
anyone modify ^apps/merge-reqs/data$

# Each merge request revision is pushed as a tag in a per-user namespace
anyone create-tag ^apps/merge-reqs/$user_id/[0-9]+-v[0-9]+$

The branch access control policy is:

# Merge requests are per-user and numbered
anyone create-directory ^merge-reqs/$user_id/[0-9]+$

# Title string
anyone create-file,modify ^merge-reqs/$user_id/[0-9]+/title$

# Labels (open, needs-review, etc) work like this:
#
#   merge-reqs/<user-id>/<merge-req-num>/labels/
#     needs-review -> /labels/needs-review
#     ...
#   labels/
#     needs-review/
#       <user-id>/
#         <merge-req-num> -> /merge-reqs/<user-id>/<merge-req-num>
#         ...
#       ...
#     ...
#
# This directory and symlink layout makes it possible to enumerate labels for a
# given merge request and to enumerate merge requests for a given label.
#
# Both the merge request author and maintainers can add/remove labels to/from a
# merge request.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/labels$
anyone create-symlink,delete ^merge-reqs/$user_id/[0-9]+/labels/.*$
maintainers create-symlink,delete ^merge-reqs/[^/]+/[0-9]+/labels/.*$
maintainers create-directory ^labels/[^/]+$
anyone create-symlink,delete ^labels/[^/]+/$user_id/[0-9]+$
maintainers create-symlink,delete ^labels/[^/]+/[^/]+/[0-9]+$

# Comments are stored as individual files in per-user directories. Each file
# contains a timestamp and the contents of the comment. The timestamp can be
# used to sort comments chronologically.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments$
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments/$user_id$
anyone create-file,modify ^merge-reqs/[^/]+/[0-9]+/comments/$user_id/[0-9]+$

When a user creates a merge request they provide a title, an initial comment, apply labels, and push a v1 tag for review and merging. Other users can comment by adding files into the merge request's per-user comments directory. Labels can be added and removed by changing symlinks in the labels directories.

The user can publish a new revision of the merge request by pushing a v2 tag and adding a comment describing the changes. Once the maintainers are satisfied they merge the final revision tag into the relevant branch (e.g. "main") and relabel the merge request from open/needs-review to closed/merged.

This workflow can be implemented by a tool that performs the necessary git operations so users do not need to understand the git app's internal data layout. Users just need to interact with the tool that displays merge requests, allows commenting, provides searches, etc. A natural way to implement this tool is as a git alias so it integrates alongside git's built-in commands.

One issue with this approach is that it uses the file system as a database. Performance and scalability are likely to be worst than using a database or application-specific file format. However, the reason for this approach is that it allows the access control app to enforce a policy that ensures users cannot modify or delete other user's data without running application-specific code on the server and while keeping everything stored in a git repository.

An example where this approach performs poorly is for full-text search. The application would need to search all title and comment files for a string. There is no index for efficient lookups. However, if applications find that git-grep(1) does not perform well they can maintain their own index and cache files locally.

I hope that this has shown how git apps can be built without application code running on the server.

Continuous integration bots

Now that we have the merge requests app it's time to think how a continuous integration service could interface with it. The goal is to run tests on each revision of a merge request and report failures so the author of the merge request can rectify the situation.

A CI bot watches the repository for changes. In particular, it needs to watch for tags created with the ref name apps/merge-reqs/[^/]+/[0-9]+-v[0-9]+.

When a new tag is found the CI bot checks it out and runs tests. The results of the tests are posted as a comment by creating a file in merge-regs/<user-id>/>merge-req-num>/comments/<ci-bot-user-id>/0 on the apps/merge-reqs/data branch. A ci-pass or ci-fail label can also be applied to the merge request so that the CI status can be easily queried by users and tools.

Going further

There are many loose ends. How can non-git users participate on issue trackers and wikis? It might be possible to implement a full peer as a client-side web application using isomorphic-git, a JavaScript git implementation. As mentioned above, the GPG key ID approach is not very browser-friendly because it requires revealing the private key to the web page and since keys are user identifiers using temporary keys does not work well.

The data model does not allow efficient queries. A full copy of the data is necessary in order to query it. That's acceptable for local applications because they can maintain their own indexes and are expected to keep the data for a long period of time. It works less well for short-lived web page sessions like a casual user filing a new bug on the issue tracker.

The git push --signed technique is not the only option. Git also supports signed commits and signed tags. The difference between signed pushes and signed tags/commits is significant. The signed push approach only validates the access control policy when the repository is changed and leaves no audit log for future reference. The signed commit/tag approach keeps the signatures in the git history. Signed commits/tags can be propagated in a peer-to-peer network and each peer can validate the access control policy itself. While signed commits/tags apply the access control policy to each object in the repository, signed pushes apply the access control policy to each change made to the repository. The difference is that it's easy to rebase and include work from different authors with signed pushes. Signed commits/tags require re-signing for rebasing and each commit is validated against its signature, which may be different from the user who is making the push request.

There are a lot of interesting approaches and trade-offs to explore here. This model we've discussed fits closely with how I've seen developers use git in open source projects. It is designed around a "main" repository/server that contributors push their code to. But each clone of the repository has all the data and can be published as a new "main" repository, if necessary.

Although these ideas are unfinished I decided to write them up with the knowledge that I probably won't implement them myself. QEMU is moving to GitLab with a traditional centralized git forge. I don't think this is the right time to develop this idea and try to convince the QEMU community to use it. For projects that have fewer infrastructure requirements it would give their contributors more power than being confined to a centralized git forge.

I hope this was an interesting read for anyone thinking about git forges and building git apps.

Sunday, December 6, 2020

Understanding Peer-to-Peer Git Forges with Radicle

Git is a distributed version control system and does not require a central server. Although repositories are usually published at a well-known location for convenient cloning and fetching of the latest changes, this is actually not necessary. Each clone can have the full commit history and evolve independently. Furthermore, code changes can be exchanged via email or other means. Finally, even the clone itself does not need to be made from a well-known domain that hosts a git repository (see git-bundle(1)).

Given that git itself is already fully decentralized one would think there is no further work to do. I came across the Radicle project and its somewhat psychedelic website. Besides having a website with a wild color scheme, the project aims to offer a social coding experiment or git forge functionality using a peer-to-peer network architecture. According to the documentation the motivation seems to be that git's built-in functionality works but is not user-friendly enough to make it accessible. In particular, it lacks social coding features.

The goal is to add git forge features like project and developer discovery, issue trackers, wikis, etc. Additional, distinctly decentralized functionality, is also touched on involving Ethereum as a way to anchor project metadata, pay contributors, etc. Radicle is still in early development so these features are not yet implemented. Here is my take on the How it Works documentation, which is a little confusing due to its early stage and some incomplete sentences or typos. I don't know whether my understanding actually corresponds to the Radicle implementation that exists today or its eventual vision, because I haven't studied the code or tried running the software. However, the ideas that the documentation has brought up are interesting and fruitful in their own right, so I wanted to share them and explain them in my own words in case you also find them worth exploring.

The git data model

Let's quickly review the git data model because it is important for understanding peer-to-peer git forges. A git repository contains a refs/ subdirectory that provides a namespace for local branch heads (refs/heads/), local and remotely fetched tags (refs/tags/), and remotely fetched branches (refs/remotes/<remote>/). Actually this namespace layout is just a convention for everyday git usage and it's possible to use the refs/ namespace differently as we will see. The git client fetches refs from a remote according to a refspec rule that maps remote refs to local refs. This gives the client the power to fetch only certain refs from the server. The client can also put them in a different location in its local refs/ directory than the server. For details, see the git-fetch(1) man page.

Refs files contain the commit hash of an object stored in the repository's object database. An object can be a commit, tree (directory), tag, or a blob (file). Branch refs point to the latest commit object. A commit object refers to a tree object that may refer to further tree objects for sub-directories and finally the blob objects that make up the files being stored. Note that a git repository supports disjoint branches that share no history. Perhaps the most well-known example of disjoint branches are the GitHub Pages and GitLab Pages features where these git forges publish static websites from the HTML/CSS/JavaScript/image files on a specific branch in the repository. That branch shares no version history with other branches and the directories/files typically have no similarity to the repository's main branch.

Now we have covered enough git specifics to talk about peer-to-peer git forges. If you want to learn more about how git objects are actually stored, check out my article on the repository layout and pack files.

Identity and authority

Normally a git repository has one or more owners who are allowed to push refs. No one else has permission to modify the refs namespace. What if we tried to share a single refs namespace with the whole world and everyone could push? There would be chaos due to naming conflicts and malicious users would delete or change other users' refs. So it seems like an unworkable idea unless there is some way to enforce structure on the global refs namespace.

Peer-to-peer systems have solutions to these problems. First, a unique identity can be created by picking a random number with a sufficient number of bits so that the chance of collision is improbable. That unique identity can be used as a prefix in the global ref namespace to avoid accidental collisions. Second, there needs to be a way to prevent unauthorized users from modifying the part of the global namespace that is owned by other users.

Public-key cryptography provides the primitive for achieving both these things. A public key or its hash can serve as the unique identifier that provides identity and prevents accidental collisions. Ownership can be enforced by verifying that changes to the global namespace are signed with the private key corresponding to the unique identity.

For example, we fetch the following refs from a peer:

<identity>/
  heads/
    main
  metadata/
    signed_refs

This is a simplified example based on the Radicle documentation. Here identity is the unique identity based on a public key. Remember no one else in the world has the same identity because the chance of generating the same public key is improbable. The heads/ refs are normal git refs to commit objects - these are published branches. The signed_refs ref points to an git object that contains a list of commit hashes and a signature generated using the public key. The signature can be verified using the public key.

Next we need to verify these changes to check that they were created with the private key that is only known to the identity's owner. First, we check the signature on the object pointed to by the signed_refs ref. If the signature is not valid we reject these changes and do not store them in our local repository. Next, we look up each ref in heads/ against the list in signed_refs. If a ref is missing from the list then we reject these refs and do not allow them into our local repository.

This scheme lends itself to peer-to-peer systems because the refs can be propagated (copied) between peers and verified at each step. The identity owner does not need to be present at each copy step since their cryptographic signature is all we need to be certain that they authorized these refs. So I can receive refs originally created by identity A from peer B and still be sure that peer B did not modify them since identity A's signature is intact.

Now we have a global refs namespace that is partitioned so that each identity is able to publish refs and peers can verify that these changes are authorized.

Gossip

It may not be clear yet that it's not necessary to clone the entire global namespace. In fact, it's possible that no single peer will ever have a full copy of the entire global namespace! That's because this is a distributed system. Peers only fetch refs that they care about from their peers. Peers fetch from each other and this forms a network. The network does not need to be fully connected and it's possible to have multiple clusters of peers running without full global connectivity.

To bootstrap the global namespace there are seed repositories. Seeds are a common concept in peer-to-peer systems. They provide an entry point for new peers to learn about and start participating with other peers. In BitTorrent this is called a "tracker" rather than a "seed".

According to the Radicle documentation it is possible to directly fetch from peers. This probably means a git-daemon(1) or git-http-backend(1) needs to be accessible on the public internet. Many peers will not have sufficient network connectivity due to NAT limitations. I guess Radicle does not expect every user to participate as a repository.

Interestingly, there is a gossip system for propagating refs through the network. Let's revisit the refs for an identity in the global namespace:

<identity>/
  heads/
    main
  metadata/
    signed_refs
  remotes/
    <another-identity>/
      heads/
        main
        foo
      metadata/
        signed_refs
      remotes/
        ...

We can publish identities that we track in remotes/. It's a recursive refs layout. This is how someone tracking our refs can find out about related identities and their refs.

Thanks to git's data model the commit, tree, and blob objects can be shared even though we duplicate refs published by another identity. Since git is a content-addressable object database the data is stored once even though multiple refs point to it.

Now we not only have a global namespace where anyone can publish git refs, but also ways to build a peer-to-peer network and propagate data throughout the network. It's important to note that data is only propagated if peers are interested in fetching it. Peers are not forced to store data that they are not interested in.

How data is stored locally

Let's bring the pieces together and show how the system stores data. The peer creates a local git repository called the monorepo for the purpose of storing portions of the global namespace. It fetches refs from seeds or direct peers to get started. Thanks to the remotes/ refs it also learns about other refs on the network that it did not request directly.

This git repository is just a data store, it is not usable for normal git workflows. The conventional git branch and git tag commands would not work well with the global namespace layout and verification requirements. Instead we can clone a local file:/// repository from the monorepo that fetches a subset of the refs into the conventional git refs layout. The files can be shared because git-clone(1) supports hard links to local repositories. Thanks to githooks(5) and/or extensible git-push(1) remote helper support it's possible to generate the necessary global namespace metadata (e.g. signatures) when we push from the local clone to the local monorepo. The monorepo can then publish the final refs to other peers.

Building a peer-to-peer git forge

There are neat ideas in Radicle and it remains to be seen how well it will grow to support git forge functionality. A number of challenges need to be addressed:

  • Usability - Radicle is a middle-ground between centralized git forges and email-based decentralized development. The goal is to be easy to use like git forges. Peer-to-peer systems often have challenges providing a human-friendly interface on top of public key identities (having usernames without centralized user accounts). Users will probably prefer to think in terms of repositories, merge requests, issues, and wikis instead of peers, gossip, identities, etc.
  • Security - The global namespace and peer-to-peer model is a target that malicious users will attack by trying to impersonate or steal identities, flood the system with garbage, game reputation systems with sockpuppets, etc.
  • Scalability - Peers only care about certain repositories and don't want to be slowed down by all the other refs in the global namespace. The recursive refs layout seems like it could cause performance problems and maybe users will configure clients to limit the depth to a low number like 3. At first glance Radicle should be able to scale well since peers only need to fetch refs they are interested in, but it's a novel way of using git refs, so we can expect scalability problems as the system grows.
  • Data model - How will this data model grow to handle wikis, issue trackers, etc? Issue tracker comments are an example of a data structure that requires conflict resolution in a distributed system. If two users post comments on an issue, how will this be resolved without a conflict? Luckily there is quite a lot of research on distributed data structures such as Conflict-free Replicated Data Types (CRDTs) and it may be possible to avoid most conflicts by eliminating concepts like linear comment numbering.
  • CI/CD - As mentioned in my blog post about why git forges are von Neumann machines, git forges are more than just data stores. They also have a computing model, initially used for Continuous Integration and Continuous Delivery, but really a general serverless computing platform. This is hard to do securely and without unwanted resource usage in a peer-to-peer system. Maybe Radicle will use Ethereum for compute credits?

Conclusion

Radicle is a cool idea and I look forward to seeing where it goes. It is still at an early stage but already shows interesting approaches with the global refs namespace and monorepo data store.

Wednesday, December 2, 2020

Software Freedom Conservancy announces donation matching

Software Freedom Conservancy, the non-profit charity home of QEMU, Git, Inkscape, and many other free and open source software projects is running its annual fundraiser. They have announced a generous donation matching pledge so donations made until January 15th 2021 will be doubled! Full details are here.

What makes Software Freedom Conservancy important, besides being a home for numerous high-profile free and open source software projects, is that it is backed by individuals and works for the public interest. It is not a trade association funded by companies to represent their interests. With the success of free and open source software it's important we don't lose these freedoms or use them just to benefit businesses. That's why I support Software Freedom Conservancy.