Monday, April 5, 2021

Learning programming languages

You may be productive and comfortable in one programming language but find the idea of learning a new programming language daunting. Or you may know and use multiple programming languages but haven't learnt a new one in a while. Or you might be a programming language geek who is just curious about how others dive into new programming languages and get productive quickly. No matter how easy or difficult it is for you to engage in new programming languages, this article explains how I like to learn new programming languages. Although people learn best in different ways, I hope you'll find my thought process interesting even if you decide to take a different approach.

Background

Language N+1

This article isn't aimed at learning to program. Learning your first programming language is much harder than learning an additional one. The reason is that many abstract concepts are involved in computer programming. When you first encounter programming, most languages require you to understand concepts like iteration, scopes, (im)mutability, arrays, modules, functions, and much more. The good news is that when you learn an additional language you'll already be familiar with common concepts and can therefore take a more streamlined approach in order to get up to speed quickly.

Courses, videos, exercises

There is a lot of educational material online that teaches various programming languages, but I don't find structured courses, videos, or exercises efficient. If you already know common programming concepts and have an idea of what you want to build in the new programming language, then it's more efficient to chart your own course. Materials that let you jump/skip around will let you focus on information that is novel and that you actually need. Working through a series of exercises that someone else designed may be time spent practicing the wrong things since usually you are the one with the best idea of what to practice. Courses, videos, and exercises tend to be an "on the rails" experience where you are exposed to information in a linear fashion whether it's useful at the moment or not.

1. Understanding the computational model

The first question about a new programming language is "what is its computational model?". Sadly, many language manuals and websites do not describe the computational model beyond what programming paradigms are supported (object-oriented, concatenative, functional, logic programming, etc). The actual computational model may only become fully apparent later. Or it might be expressed in too much detail in a language standards document to be of use early on. In any case, it's worthwhile reading the programming language's website for information on the computational model to grasp the big picture.

It's the computational model that you need to understand in order to write programs. Often we think about syntax and language features too much when learning a new language. The computational model informs us how to break down requirements into programs. We approach logic programming differently from object-oriented programming in how we organize data and code. The syntax and to an extent even the language features don't matter.

Understanding the computational model also helps you situate the new programming language relative to others, especially programming languages that you already know. It will give you an idea of how different programming will be and where you'll need to learn new concepts.

2. The language tutorial

After familiarizing yourself with the computational model of the programming language, the next step is to learn the basic syntax and concepts. Most modern programming languages have an official tutorial available online. The tutorial introduces the language elements, usually with short examples, and its table of contents gives an overview of what the language consists of. The tutorial can be completed in a few hours or days. Unlike full courses, official programming language tutorials often lend themselves to non-linear reading, which is helpful when certain aspects of the language are already familiar or will not be relevant to you.

I remember reading the Python tutorial in an afternoon years ago, but watch out: at this point you might be able to write valid syntax but you won't be writing idiomatic code yet. There's that saying "you can write FORTRAN in any language". In order to write programs that are expressed naturally and take advantage of the language effectively, more effort will be necessary.

3. Writing toy programs

After becoming aware of the language elements the next step is to explore how the language works. This can be done by writing small programs. Often these toy programs are familiar tasks you've already solved in other languages. If you want to write games, maybe it's Pong. If you write web applications, it could be a todo list. There are lots of different well-known programs to write.

During the course of writing toy programs you'll encounter syntax errors or issues with the program. Learning to interpret common error messages is important because they will come up in more complicated scenarios later where it can be harder to resolve them if you haven't seen them before.

You'll also hit common tasks for which you need to find solutions in the standard library or language reference manual. Whether it's parsing command-line options, regular expression matching, HTTP requests, or error handling, the language probably has a way of doing it. Toy programs present a simple environment in which to explore the basic facilities of a programming language.

4. Gaining a deeper appreciation for the language

Once you have written some toy programs you'll be able to start writing your own programs that solve new problems. At this stage you start being productive but there is still more to learn. In particular, the language's idioms and patterns must be studied in order to write natural code. Once I have experience with the basics of a language I like to read the source code to the standard library, popular libraries, and popular applications. In the beginning this is hard because they use unfamiliar language features or library dependencies, but after following up on unknown parts of one program, you'll find it becomes easier to read other programs because your knowledge of the language has expanded.

At this point it is also worth looking for style guides, manuals on language idioms, and documentation on common gotchas or anti-patterns. These will provide the information about thinking natively in the new programming language. This is what's needed to become fluent in the language and capable of reading and writing real programs confidently.

Although I have presented steps in a linear order, learning complex subjects is often an iterative process. Sometimes I find myself jumping back and forth between steps as my understanding evolves.

Conclusion

Learning a new programming language is time-consuming no matter how you do it. However, it doesn't all need to happen upfront and after a few days of reading the documentation and experimenting with toy programs, it's possible to peform basic tasks. Learning how to use a language effectively by studying popular programs and reading guides is the quickest way I've found to reaching fluency. Finally, it just takes practice!

Friday, March 12, 2021

Overcoming fear of public communication in open source

When given the choice between communicating in private or in public, many people opt for private communication. They send questions or unfinished patches to individuals instead of posting on mailing lists, forums, or chat rooms. This post explains why public communication is a faster and more efficient way for developers to communicate in open source communities than private communication. A big factor in the private vs public decision is psychological and I've provided a checklist to give you confidence when communicating in public.

Time and time again I find people initiating discussions through private channels when I know it would be advantageous to have them in public instead. I even keep an email reply template handy asking the sender to reach out to the public mailing list instead of communicating with me in private. Communicating in public is a good default unless you need confidentiality (security bugs, business reasons, etc). So why do people prefer to ask questions or send patches off-list?

Why we fear public communication

I'm interested in understanding why people avoid public communications channels in open source communities. The purpose of mailing lists, chat rooms, and forums is to engage in discussions and share knowledge. Members of these communities are interested in the topic and want to engage in discussion. But there are some common reasons I've found why people don't make use of public communications channels:

  • I don't want to create noise. Mailing lists are home to important discussions between experienced members of the community. Sending a relatively simple question or unfinished patch that no expert would need to ask can make you doubt whether it deserves to be sent at all. This is a fallacy. As long as the question or patches are relevant to the community and you have done your homework (see the checklist below), there is no need to worry about creating noise.
  • I will look dumb or seem like a bad programmer. When you need help it's likely that your understanding is incomplete. We often hold ourselves to artifically high standards when asking for help in public, yet when we observe others interacting in public we don't hold their questions or unfinished code against them. Only if they show a repeated pattern of sloppy work does it damage their reputation. Asking for help in public won't harm your image.
  • Too much traffic. Big open source communities are often so active that no single person can keep track of everything that is going on. Even core members of the community rely on filtering only topics relevant to them. Since filtering becomes necessary at scale anyway, there's no need to worry about sending too many messages.

A note about tone: in the past people were more likely to be put off by the unfriendly tone in chat rooms and mailing lists. Over time the tone seems to have improved in general, probably due to multiple factors like open source becoming more professional, codes of conduct making participants aware of their behavior, etc. I don't want to cover the pros and cons of these factors, but suffice to say that nowadays tone is less of a problem and that's a good thing for everyone.

Why public communication is faster and more efficient

Next let's look at the reasons why public communication is beneficial:

Including the community from the start avoids waste

If you ask for help in private it's possible that the answer you get will not end up being consensus in the community. When you eventually send patches to the community they might disagree with the approach you settled on in private and you'll have to redo your work to get the patches merged. This can be avoided by including the community from the start.

Similarly, you can avoid duplicating work when multiple people are investigating the same problem in private without knowing about each other. If you discuss what you are doing in public then you can collaborate and avoid spending time creating multiple solutions, only one of which can be merged.

Public discussions create a searchable knowledge base

If you find a solution to your problem in private, that doesn't help others who have the same question. Public discussions are typically archived and searchable on the web. When the next person has the same question they will find the answer online and won't need to ask at all!

Additionally, everyone benefits and learns from each other when questions are asked in public. We soak up knowledge by following these discussions. If they are private then we don't have this opportunity.

Get a reply even if the person you asked is unavailable

The person you intended to reach may be away due to timezones, holidays, etc. A private discussion is blocked until they respond to your message. Public discussions, on the other hand, can progress even when the original participants are unavailable. You can get an answer to your question faster by asking in public.

Public activity gives visibility to your work

Private discussions are invisible to your teammates, managers, and other people affected by your progress. When you communicate in public they will be able plan better, help out when needed, and give you credit for the effort you are putting in.

A checklist before you press Send

Hopefully I have encouraged you to go ahead and ask questions or send unfinished patches in public. Here is a checklist that will help you feel confident about communicating in public:

Before asking questions...

  1. Have you searched the web, documentation, and code?
  2. Have you added printfs, GDB breakpoints, or enabled tracing to understand the behavior of the system?

Before sending patches...

  1. Is the code formatted according to the coding standard?
  2. Are the error cases handled, memory allocation/ownership correctly implemented, and thread safety addressed? Most of the time you can figure these out yourself and forgetting to do so may distract from the topics you want help with.
  3. Did you add todo comments pointing out unfinished aspects of the code? Being upfront about what is missing helps readers understand the status of the code and saves them time trying to distinguish requirements you forgot from things that are simply not yet implemented.
  4. Do the commit descriptions explain the purpose of the code changes and does the cover letter give an overview of the patch series?
  5. Are you using git-format-patch --subject-prefix RFC (same for git-publish) to mark your patches as a "request for comment"? This tells people you are seeking input on unfinished code.

Finally, do you know who to CC on emails or mention in comments/chats? Look up the relevant maintainers or active developers using git-log(1), scripts/get_maintainer.pl, etc.

Conclusion

Private communication can be slower, less efficient, and adds less value to an open source community. For many people public communication feels a little scary and they prefer to avoid it. I hope that by understanding the advantages of public communication you will be motivated to use public communication channels more.

Wednesday, February 24, 2021

Milestone Systems: Software that changes how things are done

Every few years a project comes out with a new approach that becomes influential. Often it involves combining existing concepts in a novel way. People argue about whether the project is actually novel or whether it was just in the right place at the right time and popularized existing technology. Regardless, I find these projects fascinating and try to learn about them because they are milestones that future systems are based on.

Here is a short list of projects that I think fall into this category. I hope you enjoy them (if you haven't already explored them). Send me your picks!

Tor

Tor is an onion router. It enables (mostly) anonymous communication by tunneling encrypted connections. The client does not know the IP address of the server (when connecting to so-called hidden services), the server does not know the IP address of the client, and the intermediate hops only know about their immediate predecessor and successor.

The design of Tor is described in a paper.

BitTorrent

BitTorrent is a decentralized peer-to-peer file sharing protocol that can be used to reduce load on file hosting servers and improve download times. It's commonly used to share copyrighted material, but is also used by Linux distributions to publish ISO images and by software update systems.

A central aspect to BitTorrent is that peers exchange pieces of the file amongst themselves thanks to a Merkle tree. Pieces received from untrusted peers are checked against the file's Markle tree to ensure that data has not been corrupted or manipulated.

A paper about the economics of BitTorrent described some of the ideas behind it. The actual protocol is described by the protocol specification.

git

Git is the most popular version control system as of 2020. It replaced the older CVS and Subversion systems that were widely used before it. Other systems like Mercurial, Darcs, Perforce, and BitKeeper had similar use cases and ideas.

Git is a content-addressable object store with a convention for representing trees of files as well as commits and tags. I wrote about how the object store is implemented here if you want to learn about pack files and deltas.

Bitcoin

Bitcoin is a decentralized currency, also known as cryptocurrency. A network of mutually untrusted nodes maintains a ledger called the blockchain that records transactions. Bitcoin is famous for mining where nodes compete to solve a computationally-expensive problem in order to extend the ledger.

What is interesting about Bitcoin is that the blockchain prevents abuse as long as at least half of the nodes are not controlled or colluding. In other words, it is a decentralized consensus - although there can be short-lived splits where not all nodes agree on the current state.

The Bitcoin paper gives an overview of how the system works.

Conclusion

I hope this was a fun post that motivated you to look at a system you haven't studied yet or made you think about systems that you consider milestone systems. Please get in touch if you want to share yours!

Tuesday, February 16, 2021

Video and slides available for "The Evolution of File Descriptor Monitoring in Linux"

My FOSDEM 2021 talk "The Evolution of File Descriptor Monitoring in Linux: From select(2) to io_uring" is now available:

The talk compares the file descriptor monitoring system calls available in Linux and discusses their design. Benchmark results show how well they scale when there are many file descriptors. I hope this is a useful overview to this important kernel feature that GUI applications, network services, and many other programs rely on.

If you are interested in API design and performance, this talk highlights how different approaches like stateless vs stateful APIs can affect performance and how to minimize the number of API calls through careful design.

Enjoy!

Tuesday, February 2, 2021

Keeping a clean git commit history

Does the commit history of your source repository look like this:

f02af91822 docs: fix incorrect subheadings
dd7bee8b38 cli: add --import option
900ca2936a cli: extract move_topic() helper function

Or like this:

7011cc9868 lunch time
a07c82331d resolve code review comments
331d79a8ff more fixes

?

The first is a clean git commit history where each commit has a clear purpose and is a single logical change. The second is a messy commit history where the commits have no inherent structure:

  • "more fixes" does not describe clearly what is being fixed and the plural ("fixes") hints it may contain multiple logical changes instead of just one.
  • "resolve code review comments" contains changes requested by code reviewers in relation to another commit, it's not a self-contained logical change.
  • "lunch time" is an unfinished commit that was created because the programmer wanted to save their work.

Commit anti-patterns

The example above illustrates several anti-patterns:

Vague commit messages

If the commit message is vague and does not express a clear purpose, then it is hard to know what a commit does from the commit message. If git-log(1) doesn't provide useful information about commits then one has to resort to searching the code diffs. That is very tedious and sometimes it's almost impossible to come up with a good code search query while a clear commit message would have been easy to search. So clear commit messages are the first step towards clean commit history.

Doing too many things in one commit

Commits that make several logical code changes are hard to review and impede backporting fixes to stable branches. For example, a commit that fixes a bug as well as adding a new feature may need to be rewritten for a stable branch. If instead the code had been split into two commits, then the bug fix commit could have been backported easily. Therefore it is good practice to separate distinct bug fixes, features, and other logical code changes into separate commits.

Addressing code review comments

The code review and testing history is usually not useful information once a commit has been merged. For example, if there was a continuous integration (CI) test failure and a pull request needed to be changed, then the change should be made directly to the buggy commit so that the final commit passes the tests. No one needs to know about the code review or testing history once the code is merged and keeping these artifacts makes the commit history unwieldy by spreading a logical code change across multiple incomplete commits.

Saving work

There are valid reasons to temporarily save your work in a commit, but work-in-progress (WIP) commits should be cleaned up before merging them. For example, sometimes people make arbitrary commits to save work at the end of the day. That is fine in a local branch, but those temporary commits can be restructured into clean commits using git-rebase(1). No one else needs to know about temporary commits.

Broken commits

It can be easy to accidentally include a commit that does not build or fails tests if a later commit happens to resolve the issue. Since the later commit hides the issue it may not be apparent when testing the branch. When reordering commits the risk of introducing broken commits increases because those commits were originally written in a different order. I use the git-rebase(1) exec action to build and run tests after every commit to detect broken commits when doing extensive rebases.

Why clean commit history is important

Not all reasons for maintaining a clean commit history are obvious. Unfortunately all the above anti-patterns make commit history less useful so it's interesting to note that if you value any of the following reasons for keeping a clean commit history, then all anti-patterns need to be avoided.

Code review

Reviewers have an easier time reading clean commits than an unstructured series of commits. For example, if there is a broken commit because a function is used before it is defined in a later commit, then that affects code reviewers who read the commits linearly. They will be puzzled by the non-existent function and unable to decide whether it is being used correctly because it has not been defined yet. Although code reviewers could put in extra effort to reread the commits multiple times and try to remember the misordered changes, it's better to let code reviewers spend time on real issues rather than on untangling poorly structured commits.

Capturing the rationale for code changes

When each commit is a single logical change it becomes possible to write good commit descriptions that give the rationale for the code change. Explanations for why a code change is necessary, as well as links to issue trackers, email discussions, etc can be valuable when revisiting the commit history later. If commits contain multiple logical code changes or are incomplete then it is hard to include a good commit description, so the commit history is less useful when referring back to it later on.

Making cherry-picking easy

Many software projects maintain stable branches that still receive bug fixes for some time. This allows development to introduce new features and less mature code while users can run a mature stable release. However, maintaining stable branches can be time-consuming. Maintainers need to identify commits suitable for stable branches and cherry-pick or backport them. This requires clean commit history so that bug fixes can be applied in isolation without dragging in other code changes that do not fit the criteria for stable branches.

Enabling git-bisect(1)

When a bug is observed it may not be clear which commit introduced it. The git-bisect(1) command systematically searches the commit history and identifies the commit that caused the bug. However, git-bisect(1) only works with clean commit history. If there are broken commits then bisection becomes unreliable because some portions of commit history cannot be tested. Poorly structured commits, such as huge changes that do many different things, also make it difficult to identify which line caused the bug even when git-bisect(1) has determined which commit is to blame.

When clean commit history does not matter

The reason I have found that not everyone practices clean commit history is that they may not need any of this. Especially small projects developed by a single author may involve little code review, backporting changes to stable branches, or git-bisect(1). In that case the effort required to split code changes into clean commits and write good commit messages may seem unjustified. Of course this can change but once the commit history is messy there is not much to be done. So it's worth thinking carefully about whether to take shortcuts.

Another factor is poor tooling. Gerrit and GitHub's code review has historically made it hard to practice clean commit history. They were not designed for reviewing commit series and favored anti-patterns like squashing everything into a single commit or adding additional commits to address code review feedback. These are tool limitations and luckily GitHub code review has become better over the years. Tools that encourage you to review a commit series as a single diff are not conducive to clean commit history.

Finally, clean commit history requires proficiency with git-rebase(1) and that you are comfortable with the idea of rewriting your local branch to clean it up before publishing it. It takes a little practice to become competent at reordering, squashing, and splitting commits. The process can be a little scary, although git-reflog(1) makes it possible to undo even the most serious errors where commits were accidentally lost. On a related note, some people falsely believe that a pull or merge request branch should not be rebased. Although it is good practice to avoid rewriting history of branches that other people track, rewriting history and force-pushing a pull request is different. Most of the time no one else will maintain a local branch based on it and therefore force-pushing will not inconvenience anyone. Even if it is necessary to develop branches based on someone else's not-yet-merged branches, one needs to weigh the trade-offs of having to do more work in the short-term with the drawbacks of having a messy commit history forever.

Conclusion

I hope this is a useful summary of why each commit should have a clear purpose and embody a single logical change. For source repositories that are used by more than one person it is especially important to think about commit best practices. Clean commit history facilitates better code review, bug-finding, and maintaining stable branches. Beyond that it also provides a useful form of communication and sharing knowledge about the codebase that is missing when commit history is disregarded.

Saturday, January 30, 2021

Why learning the Vim text editor is worth it

Vim logo, GNU General Public License v2 or Later

Many tools come and go as our software and devices change - or we get bored and want to try something new and shiny. One of the few exceptions for me has been the Vim text editor, which I use for programming, emails, and writing every day. In this post I want to share why Vim is remarkable but more generally why learning a text editor is a great investment.

It may not be obvious why text editors are useful tools. Many programs, like web browsers, email clients, and integrated development environments (IDEs), have built-in text editing functionality. Why use a separate text editor? Text editors go deep. They are much more powerful than text boxes in browsers and email clients while being more general than IDEs. While IDEs often have excellent programming language-specific functionality, they are rarely used for other text editing tasks like writing emails or documents because they are specialized tools. Text editors strike a good balance of powerful editing and support for programming without being boxed into a narrow use-case.

Getting started with a text editor is easy. Vim implements the arcane vi editor user interface but has many videos, cheatsheets, and tutorials that make it fun to try. Really getting familiar with the features and customizing the editor to your own needs takes time though. This is true for any popular text editor because the number of settings, extensions, or plugins available can be huge. However, once you are familiar with a text editor you will have a powerful tool that can be used for most tasks involving writing or manipulating text. The time investment will pay off as you use the editor for todo lists, emails, documentation, programming, configuration files, and more.

This explains why I've found Vim a useful and enduring tool that I use daily. But what makes it a particularly strong text editor compared to the other options? Text editors go in and out of fashion all the time. I remember many that attracted attention for a time but then faded away. Vim has remained popular and I think there are a few reasons for that.

Powerful text editing plus IDE-like functionality. The vi user interface is actually a language of text editing operators. The keys you press aren't just keyboard shortcuts, they are like a bytecode (!) for a text manipulation CPU that is Turing-complete. Years ago I wrote vi macros that solve the stable marriage problem to demonstrate this. For many geeks this alone might be enough to convince you to learn Vim! But on top of this crazy text editing power Vim also has IDE-like functionality including syntax highlighting, completion, code search and navigation, compiler error navigation, and diffing. There is a large collection of plugins and scripts if you want to extend Vim's functionality even further.

Vim is ubiquitous. It runs on all major operating systems but furthermore it is found on devices from tiny Wi-Fi routers to the largest servers. It has a GUI but also a terminal interface if you are connecting to a remote machine over SSH. I always find it strange when I see people use their editor of choice on their laptop but then use another, less-familiar editor when connecting to remote machines. Learn Vim and you can use it everywhere!

Keyboard-friendly. Constantly moving my hand between the mouse and keyboard is tiring and distracting. Vim has excellent keyboard support and many things can be done without leaving the home row on the keyboard, including navigating, inserting, and deleting text. I find there is no need to use the arrow keys, mouse, or anything that is hard to reach.

If you are looking for a powerful text editor that you can use for many years then I recommend Vim. It's also worth looking at Emacs, which has a different angle but is also a good time investment. Looking back on 17 years of using Vim, I'm happy I stopped switching between language-specific IDEs and instead found a text editor capable of handling all tasks.

Friday, December 11, 2020

Building a git forge using git apps

The two previous blog posts about why git forges are von Neumann machines and the Radicle peer-to-peer git forge explored models for git forges. In this final post I want to cover yet another model that draws from the previous ones but has its own unique twist.

Peer-to-peer git apps

I previously showed how applications can be built on centralized git forges using CI/CD functionality for executing code, webhooks for interacting with the outside world, and disjoint branches for storing data.

A more elegant architecture is a peer-to-peer one where instead of many clients and one server there are just peers. Each peer has full access to the data. There is no client/server application code split, instead each peer runs an application for itself.

First, this makes it easier to move the data to new hosting infrastructure or fork a project since all data resides in the git repository. Merge requests, issues, wikis, and even the app settings are all stored in the git repo itself.

Second, this gives more power to the users who can process data however they want without being limited by the server's API. All peers are on equal footing and users don't need permission to alter applications, because they run locally.

Finally, it is easier to develop a local application than a client/server application. Being able to open a file and tweak the code is immediate and less hassle than testing and deploying a server-side application.

Internet peer-to-peer systems typically still require some central point for bootstrapping and this is no exception. A publicly-accessible git repository is still needed so that peers can fetch and push changes. However, in this model the git server does not run application code but "git apps" like merge requests, issue trackers, wikis, etc can still be implemented. Here is how it works...

The anti-application server

The git server is not allowed to run application code in our model, so apps like merge requests won't be processing data on the server side. However, the repository does need some primitives to make peer-to-peer git apps possible. These primitives are access control policies for refs and directories/files.

Peers run applications locally and the git server is "dumb" with the sole job of enforcing access control. You can imagine this like a multi-user UNIX machine where users have access to a shared directory. UNIX file permissions determine how processes can access the data. By choosing permissions carefully, multiple users can collaborate in the shared directory in a safe and controlled manner.

This is an anti-application server because no application code runs on the server side. The server is just a git repository that stores data and enforces access control on git push.

Access control

Repositories that accept push requests need a pre-receive hook (see githooks(5)) that checks incoming requests against the access control policy. If the request complies with the access control policy then the git push is accepted. Otherwise the git push is rejected and changes are not made to the git repository.

The first type of access control is on git refs. Git refs are the namespace where branches and tags are stored in a git repository. If a regular expression matches the ref and the operation type (create, fast-forward, force, delete) then it is allowed. For example, this policy rule allows any user to push to refs/heads/foo but force pushes and deletion are not allowed:

anyone create,fast-forward ^heads/foo$

The operations available on refs include:

OperationDescription
create-branchPush a new branch that doesn't exist yet
create-tagPush a new tag that doesn't exist yet
fast-forwardPush a commit that is a descendent of the current commit
forcePush a commit or tag replacing the previous ref
deleteDelete a ref

What's more interesting is that $user_id is expanded to the git push user's identifier so we can write rules to limit access to per-user ref namespaces:

anyone create-branch,fast-forward,force,delete ^heads/$user_id/.*$

This would allow Alice to push her own branches but Alice could not push to Bob's branches.

We have covered how to define access control policies on refs. Access control policies are also needed on branches so that multiple users can modify the same branch in a controlled and safe manner. The syntax is similar but the policy applies to changes made by commits to directories/files (what git calls a tree). The following allows users to create files in a directory but not delete or modify them (somewhat similar to the UNIX restricted deletion or "sticky" bit on world-writable directories):

anyone create-file ^shared-dir/.*$

The operations available on branches include:

OperationDescription
create-directoryCreate a new directory
create-fileCreate a new file
create-symlinkCreate a symlink
modifyChange an existing file or symlink
delete-fileDelete a file
...

$user_id expansion is also available for branch access control. Here the user can create, modify, and delete files in a per-user directory:

anyone create-file,modify,delete-file ^$user_id/.*$

User IDs

You might be wondering how user identifiers work. Git supports GPG-signed push requests with git push --signed. We can use the GPG key ID as the user identifier, eliminating the need for centralized user accounts. Remember that the GPG key ID is based on the public key. Key pairs are randomly generated and it is improbable that the same key will be generated by two different users. That said, GPG key ID uniqueness has been weak in the past when the default size was 32 bits. Git explicitly enables long 64-bit GPG key IDs but I wonder if collisions could be a problem. Maybe an ID with more bits based on the public key should be used instead, but for now let's assume the GPG key ID is unique.

The downside of this approach is that user IDs are not human-friendly. Git apps can allow the user to assign aliases to avoid displaying raw user IDs. Doing this automatically either requires an external ID issuer like confirming email address ownership, which is tedious for new users, or by storing a registry of usernames in the git repo, which means a first-come-first-server policy for username allocation and possible conflicts when merging from two repositories that don't share history. Due to these challenges I think it makes sense to use raw GPG key IDs at the data storage level and make them prettier at the user interface level.

The GPG key ID approach works well for desktop clients but not for web clients. The web application (even if implemently on the client side) would need access to the private key so it can push to the git repository. Users should not trust remotely hosted web applications with their private keys. Maybe there is a standard Web API that can help but I'm not aware one. More thought is needed here.

The pre-receive git hook checks that signature verification passed and has access to the GPG key ID in the GIT_PUSH_CERT_KEY environment variable. Then the access control policy can be checked.

Access control is a git app

Access control is the first and most fundamental git app. The access control policies that were described above are stored as files in the apps/access-control branch in the repository. Pushes to that branch are also subject to access control checks. Here is the branch's initial layout:

branches/ - access control policies for branches
  owner.conf
groups/ - group definitions (see below)
  ...
refs/ - access control policies for refs
  owner.conf

The default branches/owner.conf access control policy is as follows:

owner create-file,create-directory,modify,delete ^.*$

The default refs/owner.conf access control policy is as follows:

owner create-branch,create-tag,fast-foward,force,delete ^.*$

This gives the owner the ability to push refs and modify branches as they wish. The owner can grant other users access by pushing additional access control policy files or changing exsting files on the apps/access-control branch.

Each access control policy file in refs/ or branches/ is processed in turn. If no access control rule matches the operation then the entire git push is rejected.

Groups can be defined to alias one or more user identifiers. This avoids duplicating access control rules when more than one user should have the same access. There are two automatic groups: owner contains just the user who owns the git repository and anyone is the group of all users.

This completes the description of the access control app. Now let's look at how other functionality is built on top of this.

The merge requests app

A merge requests app can be built on top of this model. The refs access control policy is as follows:

# The data branch contains the titles, comments, etc
anyone modify ^apps/merge-reqs/data$

# Each merge request revision is pushed as a tag in a per-user namespace
anyone create-tag ^apps/merge-reqs/$user_id/[0-9]+-v[0-9]+$

The branch access control policy is:

# Merge requests are per-user and numbered
anyone create-directory ^merge-reqs/$user_id/[0-9]+$

# Title string
anyone create-file,modify ^merge-reqs/$user_id/[0-9]+/title$

# Labels (open, needs-review, etc) work like this:
#
#   merge-reqs/<user-id>/<merge-req-num>/labels/
#     needs-review -> /labels/needs-review
#     ...
#   labels/
#     needs-review/
#       <user-id>/
#         <merge-req-num> -> /merge-reqs/<user-id>/<merge-req-num>
#         ...
#       ...
#     ...
#
# This directory and symlink layout makes it possible to enumerate labels for a
# given merge request and to enumerate merge requests for a given label.
#
# Both the merge request author and maintainers can add/remove labels to/from a
# merge request.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/labels$
anyone create-symlink,delete ^merge-reqs/$user_id/[0-9]+/labels/.*$
maintainers create-symlink,delete ^merge-reqs/[^/]+/[0-9]+/labels/.*$
maintainers create-directory ^labels/[^/]+$
anyone create-symlink,delete ^labels/[^/]+/$user_id/[0-9]+$
maintainers create-symlink,delete ^labels/[^/]+/[^/]+/[0-9]+$

# Comments are stored as individual files in per-user directories. Each file
# contains a timestamp and the contents of the comment. The timestamp can be
# used to sort comments chronologically.
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments$
anyone create-directory ^merge-reqs/[^/]+/[0-9]+/comments/$user_id$
anyone create-file,modify ^merge-reqs/[^/]+/[0-9]+/comments/$user_id/[0-9]+$

When a user creates a merge request they provide a title, an initial comment, apply labels, and push a v1 tag for review and merging. Other users can comment by adding files into the merge request's per-user comments directory. Labels can be added and removed by changing symlinks in the labels directories.

The user can publish a new revision of the merge request by pushing a v2 tag and adding a comment describing the changes. Once the maintainers are satisfied they merge the final revision tag into the relevant branch (e.g. "main") and relabel the merge request from open/needs-review to closed/merged.

This workflow can be implemented by a tool that performs the necessary git operations so users do not need to understand the git app's internal data layout. Users just need to interact with the tool that displays merge requests, allows commenting, provides searches, etc. A natural way to implement this tool is as a git alias so it integrates alongside git's built-in commands.

One issue with this approach is that it uses the file system as a database. Performance and scalability are likely to be worst than using a database or application-specific file format. However, the reason for this approach is that it allows the access control app to enforce a policy that ensures users cannot modify or delete other user's data without running application-specific code on the server and while keeping everything stored in a git repository.

An example where this approach performs poorly is for full-text search. The application would need to search all title and comment files for a string. There is no index for efficient lookups. However, if applications find that git-grep(1) does not perform well they can maintain their own index and cache files locally.

I hope that this has shown how git apps can be built without application code running on the server.

Continuous integration bots

Now that we have the merge requests app it's time to think how a continuous integration service could interface with it. The goal is to run tests on each revision of a merge request and report failures so the author of the merge request can rectify the situation.

A CI bot watches the repository for changes. In particular, it needs to watch for tags created with the ref name apps/merge-reqs/[^/]+/[0-9]+-v[0-9]+.

When a new tag is found the CI bot checks it out and runs tests. The results of the tests are posted as a comment by creating a file in merge-regs/<user-id>/>merge-req-num>/comments/<ci-bot-user-id>/0 on the apps/merge-reqs/data branch. A ci-pass or ci-fail label can also be applied to the merge request so that the CI status can be easily queried by users and tools.

Going further

There are many loose ends. How can non-git users participate on issue trackers and wikis? It might be possible to implement a full peer as a client-side web application using isomorphic-git, a JavaScript git implementation. As mentioned above, the GPG key ID approach is not very browser-friendly because it requires revealing the private key to the web page and since keys are user identifiers using temporary keys does not work well.

The data model does not allow efficient queries. A full copy of the data is necessary in order to query it. That's acceptable for local applications because they can maintain their own indexes and are expected to keep the data for a long period of time. It works less well for short-lived web page sessions like a casual user filing a new bug on the issue tracker.

The git push --signed technique is not the only option. Git also supports signed commits and signed tags. The difference between signed pushes and signed tags/commits is significant. The signed push approach only validates the access control policy when the repository is changed and leaves no audit log for future reference. The signed commit/tag approach keeps the signatures in the git history. Signed commits/tags can be propagated in a peer-to-peer network and each peer can validate the access control policy itself. While signed commits/tags apply the access control policy to each object in the repository, signed pushes apply the access control policy to each change made to the repository. The difference is that it's easy to rebase and include work from different authors with signed pushes. Signed commits/tags require re-signing for rebasing and each commit is validated against its signature, which may be different from the user who is making the push request.

There are a lot of interesting approaches and trade-offs to explore here. This model we've discussed fits closely with how I've seen developers use git in open source projects. It is designed around a "main" repository/server that contributors push their code to. But each clone of the repository has all the data and can be published as a new "main" repository, if necessary.

Although these ideas are unfinished I decided to write them up with the knowledge that I probably won't implement them myself. QEMU is moving to GitLab with a traditional centralized git forge. I don't think this is the right time to develop this idea and try to convince the QEMU community to use it. For projects that have fewer infrastructure requirements it would give their contributors more power than being confined to a centralized git forge.

I hope this was an interesting read for anyone thinking about git forges and building git apps.