Wednesday, February 24, 2021

Milestone Systems: Software that changes how things are done

Every few years a project comes out with a new approach that becomes influential. Often it involves combining existing concepts in a novel way. People argue about whether the project is actually novel or whether it was just in the right place at the right time and popularized existing technology. Regardless, I find these projects fascinating and try to learn about them because they are milestones that future systems are based on.

Here is a short list of projects that I think fall into this category. I hope you enjoy them (if you haven't already explored them). Send me your picks!

Tor

Tor is an onion router. It enables (mostly) anonymous communication by tunneling encrypted connections. The client does not know the IP address of the server (when connecting to so-called hidden services), the server does not know the IP address of the client, and the intermediate hops only know about their immediate predecessor and successor.

The design of Tor is described in a paper.

BitTorrent

BitTorrent is a decentralized peer-to-peer file sharing protocol that can be used to reduce load on file hosting servers and improve download times. It's commonly used to share copyrighted material, but is also used by Linux distributions to publish ISO images and by software update systems.

A central aspect to BitTorrent is that peers exchange pieces of the file amongst themselves thanks to a Merkle tree. Pieces received from untrusted peers are checked against the file's Markle tree to ensure that data has not been corrupted or manipulated.

A paper about the economics of BitTorrent described some of the ideas behind it. The actual protocol is described by the protocol specification.

git

Git is the most popular version control system as of 2020. It replaced the older CVS and Subversion systems that were widely used before it. Other systems like Mercurial, Darcs, Perforce, and BitKeeper had similar use cases and ideas.

Git is a content-addressable object store with a convention for representing trees of files as well as commits and tags. I wrote about how the object store is implemented here if you want to learn about pack files and deltas.

Bitcoin

Bitcoin is a decentralized currency, also known as cryptocurrency. A network of mutually untrusted nodes maintains a ledger called the blockchain that records transactions. Bitcoin is famous for mining where nodes compete to solve a computationally-expensive problem in order to extend the ledger.

What is interesting about Bitcoin is that the blockchain prevents abuse as long as at least half of the nodes are not controlled or colluding. In other words, it is a decentralized consensus - although there can be short-lived splits where not all nodes agree on the current state.

The Bitcoin paper gives an overview of how the system works.

Conclusion

I hope this was a fun post that motivated you to look at a system you haven't studied yet or made you think about systems that you consider milestone systems. Please get in touch if you want to share yours!

Tuesday, February 16, 2021

Video and slides available for "The Evolution of File Descriptor Monitoring in Linux"

My FOSDEM 2021 talk "The Evolution of File Descriptor Monitoring in Linux: From select(2) to io_uring" is now available:

The talk compares the file descriptor monitoring system calls available in Linux and discusses their design. Benchmark results show how well they scale when there are many file descriptors. I hope this is a useful overview to this important kernel feature that GUI applications, network services, and many other programs rely on.

If you are interested in API design and performance, this talk highlights how different approaches like stateless vs stateful APIs can affect performance and how to minimize the number of API calls through careful design.

Enjoy!

Tuesday, February 2, 2021

Keeping a clean git commit history

Does the commit history of your source repository look like this:

f02af91822 docs: fix incorrect subheadings
dd7bee8b38 cli: add --import option
900ca2936a cli: extract move_topic() helper function

Or like this:

7011cc9868 lunch time
a07c82331d resolve code review comments
331d79a8ff more fixes

?

The first is a clean git commit history where each commit has a clear purpose and is a single logical change. The second is a messy commit history where the commits have no inherent structure:

  • "more fixes" does not describe clearly what is being fixed and the plural ("fixes") hints it may contain multiple logical changes instead of just one.
  • "resolve code review comments" contains changes requested by code reviewers in relation to another commit, it's not a self-contained logical change.
  • "lunch time" is an unfinished commit that was created because the programmer wanted to save their work.

Commit anti-patterns

The example above illustrates several anti-patterns:

Vague commit messages

If the commit message is vague and does not express a clear purpose, then it is hard to know what a commit does from the commit message. If git-log(1) doesn't provide useful information about commits then one has to resort to searching the code diffs. That is very tedious and sometimes it's almost impossible to come up with a good code search query while a clear commit message would have been easy to search. So clear commit messages are the first step towards clean commit history.

Doing too many things in one commit

Commits that make several logical code changes are hard to review and impede backporting fixes to stable branches. For example, a commit that fixes a bug as well as adding a new feature may need to be rewritten for a stable branch. If instead the code had been split into two commits, then the bug fix commit could have been backported easily. Therefore it is good practice to separate distinct bug fixes, features, and other logical code changes into separate commits.

Addressing code review comments

The code review and testing history is usually not useful information once a commit has been merged. For example, if there was a continuous integration (CI) test failure and a pull request needed to be changed, then the change should be made directly to the buggy commit so that the final commit passes the tests. No one needs to know about the code review or testing history once the code is merged and keeping these artifacts makes the commit history unwieldy by spreading a logical code change across multiple incomplete commits.

Saving work

There are valid reasons to temporarily save your work in a commit, but work-in-progress (WIP) commits should be cleaned up before merging them. For example, sometimes people make arbitrary commits to save work at the end of the day. That is fine in a local branch, but those temporary commits can be restructured into clean commits using git-rebase(1). No one else needs to know about temporary commits.

Broken commits

It can be easy to accidentally include a commit that does not build or fails tests if a later commit happens to resolve the issue. Since the later commit hides the issue it may not be apparent when testing the branch. When reordering commits the risk of introducing broken commits increases because those commits were originally written in a different order. I use the git-rebase(1) exec action to build and run tests after every commit to detect broken commits when doing extensive rebases.

Why clean commit history is important

Not all reasons for maintaining a clean commit history are obvious. Unfortunately all the above anti-patterns make commit history less useful so it's interesting to note that if you value any of the following reasons for keeping a clean commit history, then all anti-patterns need to be avoided.

Code review

Reviewers have an easier time reading clean commits than an unstructured series of commits. For example, if there is a broken commit because a function is used before it is defined in a later commit, then that affects code reviewers who read the commits linearly. They will be puzzled by the non-existent function and unable to decide whether it is being used correctly because it has not been defined yet. Although code reviewers could put in extra effort to reread the commits multiple times and try to remember the misordered changes, it's better to let code reviewers spend time on real issues rather than on untangling poorly structured commits.

Capturing the rationale for code changes

When each commit is a single logical change it becomes possible to write good commit descriptions that give the rationale for the code change. Explanations for why a code change is necessary, as well as links to issue trackers, email discussions, etc can be valuable when revisiting the commit history later. If commits contain multiple logical code changes or are incomplete then it is hard to include a good commit description, so the commit history is less useful when referring back to it later on.

Making cherry-picking easy

Many software projects maintain stable branches that still receive bug fixes for some time. This allows development to introduce new features and less mature code while users can run a mature stable release. However, maintaining stable branches can be time-consuming. Maintainers need to identify commits suitable for stable branches and cherry-pick or backport them. This requires clean commit history so that bug fixes can be applied in isolation without dragging in other code changes that do not fit the criteria for stable branches.

Enabling git-bisect(1)

When a bug is observed it may not be clear which commit introduced it. The git-bisect(1) command systematically searches the commit history and identifies the commit that caused the bug. However, git-bisect(1) only works with clean commit history. If there are broken commits then bisection becomes unreliable because some portions of commit history cannot be tested. Poorly structured commits, such as huge changes that do many different things, also make it difficult to identify which line caused the bug even when git-bisect(1) has determined which commit is to blame.

When clean commit history does not matter

The reason I have found that not everyone practices clean commit history is that they may not need any of this. Especially small projects developed by a single author may involve little code review, backporting changes to stable branches, or git-bisect(1). In that case the effort required to split code changes into clean commits and write good commit messages may seem unjustified. Of course this can change but once the commit history is messy there is not much to be done. So it's worth thinking carefully about whether to take shortcuts.

Another factor is poor tooling. Gerrit and GitHub's code review has historically made it hard to practice clean commit history. They were not designed for reviewing commit series and favored anti-patterns like squashing everything into a single commit or adding additional commits to address code review feedback. These are tool limitations and luckily GitHub code review has become better over the years. Tools that encourage you to review a commit series as a single diff are not conducive to clean commit history.

Finally, clean commit history requires proficiency with git-rebase(1) and that you are comfortable with the idea of rewriting your local branch to clean it up before publishing it. It takes a little practice to become competent at reordering, squashing, and splitting commits. The process can be a little scary, although git-reflog(1) makes it possible to undo even the most serious errors where commits were accidentally lost. On a related note, some people falsely believe that a pull or merge request branch should not be rebased. Although it is good practice to avoid rewriting history of branches that other people track, rewriting history and force-pushing a pull request is different. Most of the time no one else will maintain a local branch based on it and therefore force-pushing will not inconvenience anyone. Even if it is necessary to develop branches based on someone else's not-yet-merged branches, one needs to weigh the trade-offs of having to do more work in the short-term with the drawbacks of having a messy commit history forever.

Conclusion

I hope this is a useful summary of why each commit should have a clear purpose and embody a single logical change. For source repositories that are used by more than one person it is especially important to think about commit best practices. Clean commit history facilitates better code review, bug-finding, and maintaining stable branches. Beyond that it also provides a useful form of communication and sharing knowledge about the codebase that is missing when commit history is disregarded.