Monday, November 30, 2020

Git Forge Apps: Why git forges are serverless computing providers

In this post I want to explore an idea for a new type of application made possible by the power of git forges like GitLab and GitHub. I don't have a proof-of-concept and in fact we'll discuss hurdles that could make the idea infeasible. But it's an interesting way of thinking that is worth sharing as it offers a new way of looking at and using git forges.

Git forges offer git repository hosting at their core and then add on related features like pull requests, wikis, issue trackers, static website hosting, continuous integration, and more. They are now a long way away from the initial idea of a hosted git repository where you can publish code. They do so much more. They have a von Neumann architecture and are Turing complete. It's possible to do interesting things with that.

A git forge app (GFA) is an application that runs on a git forge. A hosted git repository is not just a place to publish and back up the source code. It's where the app runs. You can view the git forge as a Platform-as-a-Service (PaaS) or serverless computing provider. The app doesn't need to be deployed elsewhere, because the git forge itself is the execution environment.

A GFA must be able to:

  1. Process data.
  2. Store data.
  3. Interact with the outside world.

This is basically what computer applications do. Git forges have become powerful enough to do this.

Imagine the following: you fork a repository and this instantiates the GFA. The GFA is now running under your git forge account. The web interface is available at https://<user>.gitlab.com/<app>/ and HTTP POST requests can also be used to interact with the application. The GFA could be a todo list, an RSS reader, a blog, etc. To the website visitor it appears like any other web application but everything happens within the git forge and no other hosting provider is necessary.

Let's look at how it's possible to use a git forge as the execution environment for an app.

Processing Data

The first order of business is executing code so that the application can process data. A configuration file is placed into a git repository to define runnable jobs. The job execution systems offered by git forges are GitLab CI and GitHub Actions, respectively. When the job is triggered it executes in a temporary environment, for example a Linux container image. It allows code hosted in the git repository to run on a server somewhere for a little while. There is a free allowance that grants a certain number of hours of unpaid execution time. This is how GFAs process data.

For example, say we are building an RSS reader. Our repo looks like this:

rss-reader/
    .gitlab-ci.yml - the job configuration file
    fetch-new-feeds.sh - download RSS feeds and check for new posts

The .gitlab-ci.yml file will contain a scheduled pipeline that runs fetch-new-feeds.sh every 15 minutes.

Storing Data

Applications need to store data persistently. The primary way of doing that in a git forge is by storing data in the git repository itself. Mutable data can be kept on a separate branch called data and can be force-pushed to avoid forever increasing the repository size when we don't need to store previous revisions of the data.

This works because each git branch has an independent commit history. The file namespace it stores is completely separate from other branches. This means it's possible to have the GFA's code on the main branch and to store data on the data branch without upsetting the commit history or files on the main branch.

Data storage is available to jobs because they are allowed to manipulate their own git repository thanks to an authentication token. On GitHub GITHUB_TOKEN allows jobs to push to their git repository. This gives jobs a way of storing data.

There are other forms of data storage besides the git repository. Artifacts are files produced by a job that can be downloaded via a URL. Artifacts are subject to expiry time and file size limitations. Caching is available to temporarily store files between job runs, although this data can be lost at any time.

At this point you may be thinking that this is nice but there is no way to store private data if the git repository is public. Git forges offer a secrets mechanism where variables can be stored privately and only made available during job execution. This can be used to stash an encryption key so that the data is stored in public but is encrypted. Anyone can download the encrypted data but they will not have the key needed to decrypt it. Some applications can also take advantage of client-side encryption and homomorphic encryption.

Interacting with the Outside World

Git forges offer a wealth of ways to interact with the outside world, called triggers. Triggers can run a job when an HTTP POST request is made. POST requests typically need a trigger token for authentication, but that token can be published if the triggered job is safe to be invoked by anyone. Both browser clients and Webhooks can invoke triggers through HTTP POST requests.

For example, imagine our RSS reader app needs an API to mark feeds as read. We define an HTTP POST trigger that runs a mark-feed-read.sh script that updates the feeds stored in the data branch. Since only the git repository owner should be able to invoke this trigger we do not publish the trigger token. Instead it is stored encrypted in the repository and the user provides a passphrase for decrypting this secret.

At this point it is also useful to offer a web interface. This is possible thanks to the static pages hosting feature called GitLab Pages and GitHub Pages. Pushing HTML, JavaScript, CSS, and images to a special branch publishes the static website at https://<user>.gitlab.com/<app>/.

For the most part GFAs are asynchronous, they cannot handle HTTP requests synchronously like a web application that outputs an HTTP response. Incoming HTTP requests simply schedule GFA jobs that run sometime later. There are a few ways around this. The browser can poll for results using XMLHttpRequest or similar techniques. Alternatively, a GFA can fire up a job that runs for several minutes although I haven't investigated if there is a practical way to communicate with a web application running in a job (I guess incoming HTTPS is tricky to achieve).

Triggers offer a lot more fun than what I've already covered. They can respond to the creation and modification of wiki pages, issues, and pull request comments. This means it's possible to use those entities for interacting with the outside world. The GFA could act as a chat bot on its issue tracker, for example.

Limitations

Git forges weren't designed for GFAs. But then they weren't initially designed to be Turing complete and offering ways to interact with the outside world either. That was added incrementally as demand for that functionality grew. Maybe git forges will evolve into full-blown serverless hosting competitors to today's cloud hosting providers.

GFAs are a hack that uses features like static pages, CI jobs, and webhooks in a creative way to run applications on a git forge. Building GFAs that are actually useful is likely to hit some challenges:

  • No synchronous requests - it is hard to implement search queries and other dynamic behavior in GFAs because they are primarily asynchronous (and slow!). This limitation matters for certain classes of applications. A lot can be done client-side instead to make up for this deficiency. But maybe someone can figure out how to do synchronous requests in GFAs.
  • Security - I have outlined how to make data publicly available and also how to encrypt it so only the git repository owner can view the plaintext. This is enough for personal web applications, but it's not sufficient for multi-user applications. Maybe git forges offer a form of authentication that works with GFAs so multiple users can store private data in a single GFA instance (the git repository owner could view all users' data but other users could not).
  • Free usage tiers - job execution time, storage capacity, and request throttling will limit how resource-intensive a GFA can be before it outgrows the free tier and eventually even the paid tier.

The first generation of GFAs could be written from scratch with each job carefully designed. Then a second generation of GFAs could be built on top of frameworks that abstract the tedious git forge integration with standard APIs and data models. For example, a Vue.js frontend could use a key/value store API and the whole thing can be hosted as a GFA.

Even if GFAs don't become a thing because git forges decide there is not enough demand to make them work really well, changing your perspective to think of git forges in this way is valuable. For example, I have a git repo for building a container image that I deploy on my server and that pushes output files to a web server host. All of that can be replaced with a git repository that runs a job and publishes to GitLab Pages. This simplifies things and frees me from maintaining infrastructure.

Conclusion

Git forges offer jobs for processing data, git repositories and artifacts for storing data, and triggers for interacting with the outside world. It is possible to build applications that exist solely as a git repository on a git forge. There is no longer a need to deploy code because the git forge itself is powerful enough to run non-trivial applications. I look forward to how this evolves and whether git forges eventually become full-blown cloud service providers.

Friday, November 27, 2020

Call for QEMU Advent Calendar 2020 disk images

QEMU Advent Calendar publishes a disk image surprise each day from December 1-24. It's a QEMU community tradition that is back again this year!

If you want to contribute disk images to this year's advent calendar (puzzles, games, niche operating systems, retro computing fun, etc), please check out the call for disk images for details.