Stefan Hajnoczi: 2016

Saturday, December 17, 2016

13 years of using Linux

I've been using Linux on both work and personal machines for 13 years. Over time I've tried various distributions, changed the nature of my work, and revisited other operating systems to arrive back to the same conclusion every time: Linux works best for me.

The reason I started using Linux remains the reason why it's my operating system of choice today:

It's free and easy to install most software under an open source license that allows both commercial and non-commercial use.

That means software to do common tasks is available for free without limitations. The cost of entry for exploring and learning new things is zero.

The amount of packaged software available in major Linux distributions is incredible. Niche open source operating systems don't have this wide selection of high-quality software. Proprietary operating systems have high-quality software but there is constant irritation in dealing with the artificial limitations of closed source software. The strength of Linux is this sweet spot between high-quality mainstream software and the advantages of open source software.

The pain points of Linux have changed over the years. In the beginning hardware support was limited. This has largely been solved for laptops, desktop, and server hardware as vendors began to contribute drivers and publish hardware datasheets free of NDAs. Class-compliant USB devices also cut down on the number of vendor-specific drivers. Nowadays the reputation for limited hardware support is largely unjustified.

Another issue that has subsided is the Windows-only software that kept many people tied to that platform. Two trends killed Windows-only software: the move to the web and the rise of the Mac. A lot of applications migrated to pure web applications without the need for ActiveX or Java applets with platform-specific code - and Adobe Flash is close to its end too. Ever since Macs rose to popularity again it was no longer acceptable to ship Windows-only software. As a result so many things are now on the web or cross-platform applications with Linux support.

Migrating to Linux is still a big change just like switching from Windows to Mac is a big change. It will always be hard to overcome this, even with virtualization, because the virtual machine experience isn't seamless. Ultimately users need to pick native applications and import their existing data. And it's worth it because you get access to applications that can do almost everything without the hassles of proprietary platforms. That's the lasting advantage that Linux has over the competition.

Saturday, September 17, 2016

Making I/O deterministic with blkdebug breakpoints

Recently I was looking for a deterministic way to reproduce a QEMU bug that occurred when a guest reboots while an IDE/ATA TRIM (discard) request is in flight. You can find the bug report here.

A related problem is how to trigger the code path where request A is in flight when request B is issued. Being able to do this is useful for writing test suites that check I/O requests interact correctly with each other.

Both of these scenarios require the ability to put an I/O request into a specific state and keep it there. This makes the request deterministic so there is no chance of it completing too early.

QEMU has a disk I/O error injection framework called blkdebug. Block drivers like qcow2 tell blkdebug about request states, making it possible to fail or suspend requests at certain points. For example, it's possible to suspend an I/O request when qcow2 decides to free a cluster.

The blkdebug documentation mentions the break command that can suspend a request when it reaches a specific state.

Here is how we can suspend a request when qcow2 decides to free a cluster:

$ qemu-system-x86_64 -drive if=ide,id=ide-drive,file=blkdebug::test.qcow2,format=qcow2
(qemu) qemu-io ide-drive "break cluster_free A"

QEMU prints a message when the request is suspended. It can be resumed with:

(qemu) qemu-io ide-drive "resume A"

The tag name 'A' is arbitrary and you can pick your own name or even suspend multiple requests at the same time.

Automated tests need to wait until a request has suspended. This can be done with the wait_break command:

(qemu) qemu-io ide-drive "wait_break A"

For another example of blkdebug breakpoints, see the qemu-iotests 046 test case.

Friday, May 13, 2016

Git: Internals of how objects are stored

Git has become the ubiquitous version control tool. A big community has grown around Git over the years. From its origins with the Linux kernel community to GitHub, the popular project hosting service, its core design has scaled. Well it needed to since from day one the Linux source tree and commit history was large by most standards.

I consider Git one of the innovations (like BitTorrent and Bitcoin) that come along every few years to change the way we do things. These tools didn't necessarily invent brand new algorithms but they applied them in a way that was more effective than previous approaches. They have designs worth studying and have something we can learn from.

This post explains the internals of how git show 365e45161384f6dc704ec798828dc927d63e3b22 fetches the commit object for this SHA1. This SHA1 lookup is the main query operation for all object types in Git including commits, tags, and trees.

Repository setup

Let's start with a simple repo:

$ mkdir test && cd test
$ git init .
Initialized empty Git repository in /tmp/test/.git/
$ echo 'Hello world' >test.txt
$ git add test.txt && git commit -s

Here is the initial commit:

$ git show --format=raw
commit 365e45161384f6dc704ec798828dc927d63e3b22
tree 401ce7dbd55d28ea49c1c2f1c1439eb7d2b92427
author Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700
committer Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700

    Initial checkin
    
    Signed-off-by: Stefan Hajnoczi <stefanha@gmail.com>

diff --git a/test.txt b/test.txt
new file mode 100644
index 0000000..802992c
--- /dev/null
+++ b/test.txt
@@ -0,0 +1 @@
+Hello world

Loose objects

The function to look up an object from a binary SHA1 is struct object *parse_object(const unsigned char *sha1) in Git v2.8.2-396-g5fe494c. You can browse the source here. In Git the versioned data and metadata is stored as "objects" identified by their SHA1 hash. There are several types of objects so this function can look up commits, tags, trees, etc.

Git stores objects in several ways. The simplest is called a "loose object", which means that object 365e45161384f6dc704ec798828dc927d63e3b22 is stored zlib DEFLATE compressed in its own file. We can manually view the object like this:

$ openssl zlib -d <.git/objects/36/5e45161384f6dc704ec798828dc927d63e3b22
commit 241tree 401ce7dbd55d28ea49c1c2f1c1439eb7d2b92427
author Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700
committer Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700

Initial checkin

Signed-off-by: Stefan Hajnoczi <stefanha@gmail.com>

There! So this little command-line produces similar output to $ git show --format=raw 365e45161384f6dc704ec798828dc927d63e3b22 without using Git.

Note that openssl(1) is used in this example as a convenient command-line tool for decompressing zlib DEFLATE data. This has nothing to do with OpenSSL cryptography.

Git spreads objects across subdirectories in .git/objects/ by the SHA1's first byte to avoid having a single directory with a huge number of files. This helps file systems and userspace tools scale better when there are many commits.

Despite this directory layout it can still be inefficient to access loose objects. Each object requires open(2)/close(2) syscalls and possibly an access(2) call to check for existence first. This puts a lot of faith in cheap directory lookups in the file system and is not optimal when the number of objects grows.

Pack files

Git's solution to large numbers of objects is found in .git/objects/pack/. Each "pack" consists of an index file (.idx) and the pack itself (.pack). Here is how that looks in our test repository:

$ git repack
Counting objects: 3, done.
Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 0 (delta 0)
$ ls -l .git/object/pack/
total 8
-r--r--r--. 1 stefanha stefanha 1156 May 13 10:35 pack-c63d94bfadeaba23a38c05c75f48c7535339d685.idx
-r--r--r--. 1 stefanha stefanha  246 May 13 10:35 pack-c63d94bfadeaba23a38c05c75f48c7535339d685.pack

In order to look up an object from a SHA1, the index file must be used:

The index file provides an efficient way to look up the byte offset into the pack file where the object is located. The index file's 256-entry lookup table takes the first byte of the SHA1 and produces the highest index for objects starting with that SHA1 byte. This lookup table makes it possible to search only objects starting with the same SHA1 byte while excluding all other objects. That's similar to the subdirectory layout used for loose objects!

The SHA1s of all objects inside the pack are stored in a sorted table. From the 1st SHA1 byte lookup we now know the lowest and highest index into the sorted SHA1 table where we can binary search for the SHA1 we are looking for. If the SHA1 cannot be found then the pack file does not contain that object.

Once the SHA1 has been found in the sorted SHA1 table, its index is used with the offsets table. The offset is the location in the pack file where the object is located. There is a minor complication here: offset table entries are only 32-bits so an extension is necessary for 64-bit offsets.

The 64-bit offset extension uses the top-most bit in the 32-bit offset table entry to indicate that a 64-bit offset is needed. This means the 32-bit offset table actually only works for offsets that fit into 31 bits! When the top-most bit is set the remaining bits form an index into the 64-bit offset table where the actual offset is located.

Each object in the pack file has a header describing the object type (commit, tree, tag, etc) and its size. After the header is the zlib DEFLATE compressed data.

Deltas

Although the pack file can contain simple zlib DEFLATE compressed data for objects, it can also contain a chain of delta objects. The chain is ultimately terminated by a non-delta "base object". Deltas avoid storing some of the same bytes again each time a modified object is added. A delta object describes only the changed data between its parent and itself. This can be much smaller than storing the full object.

Think about the case where a single line is added to a long text file. It would be nice to avoid storing all the unmodified lines again when the modified text file is added. This is exactly what deltas do!

In order to unpack a delta object it is necessary to start with the base object and apply the chain of delta objects, one after the other. I won't go into the details of the binary format of delta objects, but it is a memcpy(3) engine rather than a traditional text diff produced by diff(1). Each memcpy(3) operation needs offset and length parameters, as well as a source buffer - either the parent object or the delta data. This memcpy(3) engine approach is enough to delete bytes, add bytes, and copy bytes.

Conclusion

Git objects can be stored either as zlib DEFLATE compressed "loose object" files or inside a "pack file". The pack file reduces the need to access many files and also features deltas to reduce storage requirements.

The pack index file has a fairly straightforward layout for taking a SHA1 and finding the offset into the pack file. The layout is cache-friendly because it co-locates fields into separate tables instead of storing full records in a single table. This way the CPU cache only reads data that is being searched, not the other record fields that are irrelevant to the search.

The pack file format works as an immutable archive. It is optimized for lookups and does not support arbitrary modification well. An advantage of this approach is that the design is relatively simple and multiple git processes can perform lookups without fear of races.

I hope this gives a useful overview of git data storage internals. I'm not a git developer so I hope this description matches reality. Please point out my mistakes in the comments.

Friday, March 4, 2016

Slides available for "NFS over virtio-vsock: Host/guest file sharing for virtual machines"

I have been working on the virtio-vsock host/guest communications mechanism and the first application is file sharing with NFS. NFS is a mature and stable network file system which can be reused to provide host/guest file sharing in KVM.

If you are interested in learning more, check out the slides from my Connectathon 2016 presentation:

NFS over virtio-vsock: Host/guest file sharing for virtual machines

Wednesday, March 2, 2016

QEMU accepted into Google Summer of Code & Outreachy 2016!

I'm delighted to announce that QEMU has been accepted into Google Summer of Code 2016 and Outreachy May-August 2016.

Both GSoC and Outreachy are 12-week full-time remote work internships. Interns work on a QEMU-related project with the support of a mentor from the QEMU community. Interns are paid for their work by the generous funding from Google (GSoC), Red Hat (Outreachy), and/or IBM (Outreachy).

Find out more about the project ideas that have been proposed:
http://qemu-project.org/Google_Summer_of_Code_2016
http://qemu-project.org/Outreachy_2016_MayAugust

There are two dedicated IRC channels, #qemu-gsoc and #qemu-outreachy, that candidates can use to discuss project ideas with mentors and ask questions.

Good luck to all applicants!

Friday, January 8, 2016

QEMU Internals: How guest physical RAM works

Memory is one of the key aspects of emulating computer systems. Inside QEMU the guest RAM is modelled by several components that work together. This post gives an overview of the design of guest physical RAM in QEMU 2.5 by explaining the most important components without going into all the details. After reading this post you will know enough to dig into the QEMU source code yourself.

Note that guest virtual memory is not covered here since it deserves its own post. KVM virtualization relies on hardware memory translation support and does not use QEMU's software MMU.

Guest RAM configuration

The QEMU command-line option -m [size=]megs[,slots=n,maxmem=size] specifies the initial guest RAM size as well as the maximum guest RAM size and number of slots for memory chips (DIMMs).

The reason for the maximum size and slots is that QEMU emulates DIMM hotplug so the guest operating system can detect when new memory is added and removed using the same mechanism as on real hardware. This involves plugging or unplugging a DIMM into a slot, just like on a physical machine. In other words, changing the amount of memory available isn't done in byte units, it's done by changing the set of DIMMs plugged into the emulated machine.

Hotpluggable guest memory

The "pc-dimm" device (hw/mem/pc-dimm.c) models a DIMM. Memory is hotplugged by creating a new "pc-dimm" device. Although the name includes "pc" this device is also used with ppc and s390 machine types.

As a side-note, the initial RAM that the guest started with might not be modelled with a "pc-dimm" device and it can't be unplugged.

The guest RAM itself isn't contained inside the "pc-dimm" object. Instead the "pc-dimm" must be associated with a "memory-backend" object.

Memory backends

The "memory-backend" device (backends/hostmem.c) contains the actual host memory that backs guest RAM. This can either be anonymous mmapped memory or file-backed mmapped memory. File-backed guest RAM allows Linux hugetlbfs usage for huge pages on the host and also shared-memory so other host applications can access to guest RAM.

The "pc-dimm" and "memory-backend" objects are the user-visible parts of guest RAM in QEMU. They can be managed using the QEMU command-line and QMP monitor interface. This is just the tip of the iceberg though because there are still several aspects of guest RAM internal to QEMU that will be covered next.

The following diagram shows the components explained below:

RAM blocks and the ram_addr_t address space

Memory inside a "memory-backend" is actually mmapped by RAMBlock through qemu_ram_alloc() (exec.c). Each RAMBlock has a pointer to the mmap memory and also a ram_addr_t offset. This ram_addr_t offset is interesting because it is in a global namespace so the RAMBlock can be looked up by the offset.

The ram_addr_t namespace is different from the guest physical memory space. The ram_addr_t namespace is a tightly packed address space of all RAMBlocks. Guest physical address 0x100001000 might not be ram_addr_t 0x100001000 since ram_addr_t does not include guest physical memory regions that are reserved, memory-mapped I/O, etc. Furthermore, the ram_addr_t offset is dependent on the order in which RAMBlocks were created, unlike the guest physical memory space where everything has a fixed location.

All RAMBlocks are in a global list RAMList object called ram_list. ram_list holds the RAMBlocks and also the dirty memory bitmaps.

Dirty memory tracking

When the guest CPU or device DMA stores to guest RAM this needs to be noticed by several users:

The live migration feature relies on tracking dirty memory pages so they can be resent if they change during live migration.
TCG relies on tracking self-modifying code so it can recompile changed instructions.
Graphics card emulation relies on tracking dirty video memory to redraw only scanlines that have changed.

There are dirty memory bitmaps for each of these users in ram_list because dirty memory tracking can be enabled or disabled independently for each of these users.

Address spaces

All CPU architectures have a memory space and some also have an I/O address space. This is represented by AddressSpace, which contains a tree of MemoryRegions (include/exec/memory.h).

The MemoryRegion is the link between guest physical address space and the RAMBlocks containing the memory. Each MemoryRegion has the ram_addr_t offset of the RAMBlock and each RAMBlock has a MemoryRegion pointer.

Note that MemoryRegion is more general than just RAM. It can also represent I/O memory where read/write callback functions are invoked on access. This is how hardware register accesses from a guest CPU are dispatched to emulated devices.

The address_space_rw() function dispatches load/store accesses to the appropriate MemoryRegions. If a MemoryRegion is a RAM region then the data will be accessed from the RAMBlock's mmapped guest RAM. The address_space_memory global variable is the guest physical memory space.

Conclusion

There are a few different layers involved in managing guest physical memory. The "pc-dimm" and "memory-backend" objects are the user-visible configuration objects for DIMMs and memory. The RAMBlock is the mmapped memory chunk. The AddressSpace and its MemoryRegion elements place guest RAM into the memory map.