Friday, May 13, 2016

Git: Internals of how objects are stored

Git has become the ubiquitous version control tool. A big community has grown around Git over the years. From its origins with the Linux kernel community to GitHub, the popular project hosting service, its core design has scaled. Well it needed to since from day one the Linux source tree and commit history was large by most standards.

I consider Git one of the innovations (like BitTorrent and Bitcoin) that come along every few years to change the way we do things. These tools didn't necessarily invent brand new algorithms but they applied them in a way that was more effective than previous approaches. They have designs worth studying and have something we can learn from.

This post explains the internals of how git show 365e45161384f6dc704ec798828dc927d63e3b22 fetches the commit object for this SHA1. This SHA1 lookup is the main query operation for all object types in Git including commits, tags, and trees.

Repository setup

Let's start with a simple repo:

$ mkdir test && cd test
$ git init .
Initialized empty Git repository in /tmp/test/.git/
$ echo 'Hello world' >test.txt
$ git add test.txt && git commit -s

Here is the initial commit:

$ git show --format=raw
commit 365e45161384f6dc704ec798828dc927d63e3b22
tree 401ce7dbd55d28ea49c1c2f1c1439eb7d2b92427
author Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700
committer Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700

    Initial checkin
    
    Signed-off-by: Stefan Hajnoczi <stefanha@gmail.com>

diff --git a/test.txt b/test.txt
new file mode 100644
index 0000000..802992c
--- /dev/null
+++ b/test.txt
@@ -0,0 +1 @@
+Hello world

Loose objects

The function to look up an object from a binary SHA1 is struct object *parse_object(const unsigned char *sha1) in Git v2.8.2-396-g5fe494c. You can browse the source here. In Git the versioned data and metadata is stored as "objects" identified by their SHA1 hash. There are several types of objects so this function can look up commits, tags, trees, etc.

Git stores objects in several ways. The simplest is called a "loose object", which means that object 365e45161384f6dc704ec798828dc927d63e3b22 is stored zlib DEFLATE compressed in its own file. We can manually view the object like this:

$ openssl zlib -d <.git/objects/36/5e45161384f6dc704ec798828dc927d63e3b22
commit 241tree 401ce7dbd55d28ea49c1c2f1c1439eb7d2b92427
author Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700
committer Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700

Initial checkin

Signed-off-by: Stefan Hajnoczi <stefanha@gmail.com>

There! So this little command-line produces similar output to $ git show --format=raw 365e45161384f6dc704ec798828dc927d63e3b22 without using Git.

Note that openssl(1) is used in this example as a convenient command-line tool for decompressing zlib DEFLATE data. This has nothing to do with OpenSSL cryptography.

Git spreads objects across subdirectories in .git/objects/ by the SHA1's first byte to avoid having a single directory with a huge number of files. This helps file systems and userspace tools scale better when there are many commits.

Despite this directory layout it can still be inefficient to access loose objects. Each object requires open(2)/close(2) syscalls and possibly an access(2) call to check for existence first. This puts a lot of faith in cheap directory lookups in the file system and is not optimal when the number of objects grows.

Pack files

Git's solution to large numbers of objects is found in .git/objects/pack/. Each "pack" consists of an index file (.idx) and the pack itself (.pack). Here is how that looks in our test repository:

$ git repack
Counting objects: 3, done.
Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 0 (delta 0)
$ ls -l .git/object/pack/
total 8
-r--r--r--. 1 stefanha stefanha 1156 May 13 10:35 pack-c63d94bfadeaba23a38c05c75f48c7535339d685.idx
-r--r--r--. 1 stefanha stefanha  246 May 13 10:35 pack-c63d94bfadeaba23a38c05c75f48c7535339d685.pack

In order to look up an object from a SHA1, the index file must be used:

The index file provides an efficient way to look up the byte offset into the pack file where the object is located. The index file's 256-entry lookup table takes the first byte of the SHA1 and produces the highest index for objects starting with that SHA1 byte. This lookup table makes it possible to search only objects starting with the same SHA1 byte while excluding all other objects. That's similar to the subdirectory layout used for loose objects!

The SHA1s of all objects inside the pack are stored in a sorted table. From the 1st SHA1 byte lookup we now know the lowest and highest index into the sorted SHA1 table where we can binary search for the SHA1 we are looking for. If the SHA1 cannot be found then the pack file does not contain that object.

Once the SHA1 has been found in the sorted SHA1 table, its index is used with the offsets table. The offset is the location in the pack file where the object is located. There is a minor complication here: offset table entries are only 32-bits so an extension is necessary for 64-bit offsets.

The 64-bit offset extension uses the top-most bit in the 32-bit offset table entry to indicate that a 64-bit offset is needed. This means the 32-bit offset table actually only works for offsets that fit into 31 bits! When the top-most bit is set the remaining bits form an index into the 64-bit offset table where the actual offset is located.

Each object in the pack file has a header describing the object type (commit, tree, tag, etc) and its size. After the header is the zlib DEFLATE compressed data.

Deltas

Although the pack file can contain simple zlib DEFLATE compressed data for objects, it can also contain a chain of delta objects. The chain is ultimately terminated by a non-delta "base object". Deltas avoid storing some of the same bytes again each time a modified object is added. A delta object describes only the changed data between its parent and itself. This can be much smaller than storing the full object.

Think about the case where a single line is added to a long text file. It would be nice to avoid storing all the unmodified lines again when the modified text file is added. This is exactly what deltas do!

In order to unpack a delta object it is necessary to start with the base object and apply the chain of delta objects, one after the other. I won't go into the details of the binary format of delta objects, but it is a memcpy(3) engine rather than a traditional text diff produced by diff(1). Each memcpy(3) operation needs offset and length parameters, as well as a source buffer - either the parent object or the delta data. This memcpy(3) engine approach is enough to delete bytes, add bytes, and copy bytes.

Conclusion

Git objects can be stored either as zlib DEFLATE compressed "loose object" files or inside a "pack file". The pack file reduces the need to access many files and also features deltas to reduce storage requirements.

The pack index file has a fairly straightforward layout for taking a SHA1 and finding the offset into the pack file. The layout is cache-friendly because it co-locates fields into separate tables instead of storing full records in a single table. This way the CPU cache only reads data that is being searched, not the other record fields that are irrelevant to the search.

The pack file format works as an immutable archive. It is optimized for lookups and does not support arbitrary modification well. An advantage of this approach is that the design is relatively simple and multiple git processes can perform lookups without fear of races.

I hope this gives a useful overview of git data storage internals. I'm not a git developer so I hope this description matches reality. Please point out my mistakes in the comments.