Saturday, 29 July 2017

Tracing userspace static probes with perf(1)

The perf(1) tool added support for userspace static probes in Linux 4.8. Userspace static probes are pre-defined trace points in userspace applications. Application developers add them so frequently needed lifecycle events are available for performance analysis, troubleshooting, and development.

Static userspace probes are more convenient than defining your own function probes from scratch. You can save time by using them and not worrying about where to add probes because that has already been done for you.

On my Fedora 26 machine the QEMU, gcc, and nodejs packages ship with static userspace probes. QEMU offers probes for vcpu events, disk I/O activity, device emulation, and more.

Without further ado, here is how to trace static userspace probes with perf(1)!

Scan the binary for static userspace probes

The perf(1) tool needs to scan the application's ELF binaries for static userspace probes and store the information in $HOME/.debug/usr/:

# perf buildid-cache --add /usr/bin/qemu-system-x86_64

List static userspace probes

Once the ELF binaries have been scanned you can list the probes as follows:

# perf list sdt_*:*

List of pre-defined events (to be used in -e):

  sdt_qemu:aio_co_schedule                           [SDT event]
  sdt_qemu:aio_co_schedule_bh_cb                     [SDT event]
  sdt_qemu:alsa_no_frames                            [SDT event]
  ...

Let's trace something!

First add probes for the events you are interested in:

# perf probe sdt_qemu:blk_co_preadv
Added new event:
  sdt_qemu:blk_co_preadv (on %blk_co_preadv in /usr/bin/qemu-system-x86_64)

You can now use it in all perf tools, such as:

 perf record -e sdt_qemu:blk_co_preadv -aR sleep 1

Then capture trace data as follows:

# perf record -a -e sdt_qemu:blk_co_preadv
^C
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 2.274 MB perf.data (4714 samples) ]

The trace can be printed using perf-script(1):

# perf script
 qemu-system-x86  3425 [000]  2183.218343: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=0 arg4=512 arg5=0
 qemu-system-x86  3425 [001]  2183.310712: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=0 arg4=512 arg5=0
 qemu-system-x86  3425 [001]  2183.310904: sdt_qemu:blk_co_preadv: (55d230272e4b) arg1=94361280966400 arg2=94361282838528 arg3=512 arg4=512 arg5=0
 ...

If you want to get fancy it's also possible to write trace analysis scripts with perf-script(1). That's a topic for another post but see the --gen-script= option to generate a skeleton script.

Current limitations

As of July 2017 there are a few limitations to be aware of:

Probe arguments are automatically numbered and do not have human-readable names. You will see arg1, arg2, etc and will need to reference the probe definition in the application source code to learn the meaning of the argument. Some versions of perf(1) may not even print arguments automatically since this feature was added later.

The contents of string arguments are not printed, only the memory address of the string.

Probes called from multiple call-sites in the application result in multiple perf probes. For example, if probe foo is called from 3 places you get sdt_myapp:foo, sdt_myapp:foo_1, and sdt_myapp:foo_2 when you run perf probe --add sdt_myapp:foo.

The SystemTap semaphores feature is not supported and such probes will not fire unless you manually set the semaphore inside your application or from another tool like GDB. This means that the sdt_myapp:foo will not fire if the application uses the MYAPP_FOO_ENABLED() macro like this: if (MYAPP_FOO_ENABLED()) MYAPP_FOO();.

Some history and alternative tools

Static userspace probes were popularized by DTrace's <sys/sdt.h> header. Tracers that came after DTrace implemented the same interface for compatibility.

On Linux the initial tool for static userspace probes was SystemTap. In fact, the <sys/sdt.h> header file on my Fedora 26 system is still part of the systemtap-sdt-devel package.

More recently the GDB debugger gained support for static userspace probes. See the Static Probe Points documentation if you want to use userspace static probes from GDB.

Conclusion

It's very handy to have static userspace probing available alongside all the other perf(1) tracing features. There are a few limitations to keep in mind but if your tracing workflow is based primarily around perf(1) then you can now begin using static userspace probes without relying on additional tools.

Thursday, 13 July 2017

Packet capture coming to AF_VSOCK

For anyone interested in the AF_VSOCK zero-configuration host<->guest communications channel it's important to be able to observe traffic. Packet capture is commonly used to troubleshoot network problems and debug networking applications. Up until now it hasn't been available for AF_VSOCK.

In 2016 Gerard Garcia created the vsockmon Linux driver that enables AF_VSOCK packet capture. During the course of his excellent Google Summer of Code work he also wrote patches for libpcap, tcpdump, and Wireshark.

Recently I revisited Gerard's work because Linux 4.12 shipped with the new vsockmon driver, making it possible to finalize the userspace support for AF_VSOCK packet capture. And it's working beautifully:

I have sent the latest patches to the tcpdump and Wireshark communities so AF_VSOCK can be supported out-of-the-box in the future. For now you can also find patches in my personal repositories:

The basic flow is as follows:

# ip link add type vsockmon
# ip link set vsockmon0 up
# tcpdump -i vsockmon0
# ip link set vsockmon0 down
# ip link del vsockmon0

It's easiest to wait for distros to package Linux 4.12 and future versions of libpcap, tcpdump, and Wireshark. If you decide to build from source, make sure to build libpcap first and then tcpdump or Wireshark. The libpcap dependency is necessary so that tcpdump/Wireshark can access AF_VSOCK traffic.

Monday, 6 February 2017

Slides posted for "Using NVDIMM under KVM" talk

I gave a talk on NVDIMM persistent memory at FOSDEM 2017. QEMU has gained support for emulated NVDIMMs and they can be used efficiently under KVM.

Applications inside the guest access the physical NVDIMM directly with native performance when properly configured. These devices are DDR4 RAM modules so the access times are much lower than solid state (SSD) drives. I'm looking forward to hardware coming onto the market because it will change storage and databases in a big way.

This talk covers what NVDIMM is, the programming model, and how it can be used under KVM. Slides are available here (PDF).

Update: Video is available here.

Saturday, 17 December 2016

13 years of using Linux

I've been using Linux on both work and personal machines for 13 years. Over time I've tried various distributions, changed the nature of my work, and revisited other operating systems to arrive back to the same conclusion every time: Linux works best for me.

The reason I started using Linux remains the reason why it's my operating system of choice today:

It's free and easy to install most software under an open source license that allows both commercial and non-commercial use.

That means software to do common tasks is available for free without limitations. The cost of entry for exploring and learning new things is zero.

The amount of packaged software available in major Linux distributions is incredible. Niche open source operating systems don't have this wide selection of high-quality software. Proprietary operating systems have high-quality software but there is constant irritation in dealing with the artificial limitations of closed source software. The strength of Linux is this sweet spot between high-quality mainstream software and the advantages of open source software.

The pain points of Linux have changed over the years. In the beginning hardware support was limited. This has largely been solved for laptops, desktop, and server hardware as vendors began to contribute drivers and publish hardware datasheets free of NDAs. Class-compliant USB devices also cut down on the number of vendor-specific drivers. Nowadays the reputation for limited hardware support is largely unjustified.

Another issue that has subsided is the Windows-only software that kept many people tied to that platform. Two trends killed Windows-only software: the move to the web and the rise of the Mac. A lot of applications migrated to pure web applications without the need for ActiveX or Java applets with platform-specific code - and Adobe Flash is close to its end too. Ever since Macs rose to popularity again it was no longer acceptable to ship Windows-only software. As a result so many things are now on the web or cross-platform applications with Linux support.

Migrating to Linux is still a big change just like switching from Windows to Mac is a big change. It will always be hard to overcome this, even with virtualization, because the virtual machine experience isn't seamless. Ultimately users need to pick native applications and import their existing data. And it's worth it because you get access to applications that can do almost everything without the hassles of proprietary platforms. That's the lasting advantage that Linux has over the competition.

Saturday, 17 September 2016

Making I/O deterministic with blkdebug breakpoints

Recently I was looking for a deterministic way to reproduce a QEMU bug that occurred when a guest reboots while an IDE/ATA TRIM (discard) request is in flight. You can find the bug report here.

A related problem is how to trigger the code path where request A is in flight when request B is issued. Being able to do this is useful for writing test suites that check I/O requests interact correctly with each other.

Both of these scenarios require the ability to put an I/O request into a specific state and keep it there. This makes the request deterministic so there is no chance of it completing too early.

QEMU has a disk I/O error injection framework called blkdebug. Block drivers like qcow2 tell blkdebug about request states, making it possible to fail or suspend requests at certain points. For example, it's possible to suspend an I/O request when qcow2 decides to free a cluster.

The blkdebug documentation mentions the break command that can suspend a request when it reaches a specific state.

Here is how we can suspend a request when qcow2 decides to free a cluster:

$ qemu-system-x86_64 -drive if=ide,id=ide-drive,file=blkdebug::test.qcow2,format=qcow2
(qemu) qemu-io ide-drive "break cluster_free A"

QEMU prints a message when the request is suspended. It can be resumed with:

(qemu) qemu-io ide-drive "resume A"

The tag name 'A' is arbitrary and you can pick your own name or even suspend multiple requests at the same time.

Automated tests need to wait until a request has suspended. This can be done with the wait_break command:

(qemu) qemu-io ide-drive "wait_break A"

For another example of blkdebug breakpoints, see the qemu-iotests 046 test case.

Friday, 13 May 2016

Git: Internals of how objects are stored

Git has become the ubiquitous version control tool. A big community has grown around Git over the years. From its origins with the Linux kernel community to GitHub, the popular project hosting service, its core design has scaled. Well it needed to since from day one the Linux source tree and commit history was large by most standards.

I consider Git one of the innovations (like BitTorrent and Bitcoin) that come along every few years to change the way we do things. These tools didn't necessarily invent brand new algorithms but they applied them in a way that was more effective than previous approaches. They have designs worth studying and have something we can learn from.

This post explains the internals of how git show 365e45161384f6dc704ec798828dc927d63e3b22 fetches the commit object for this SHA1. This SHA1 lookup is the main query operation for all object types in Git including commits, tags, and trees.

Repository setup

Let's start with a simple repo:

$ mkdir test && cd test
$ git init .
Initialized empty Git repository in /tmp/test/.git/
$ echo 'Hello world' >test.txt
$ git add test.txt && git commit -s

Here is the initial commit:

$ git show --format=raw
commit 365e45161384f6dc704ec798828dc927d63e3b22
tree 401ce7dbd55d28ea49c1c2f1c1439eb7d2b92427
author Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700
committer Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700

    Initial checkin
    
    Signed-off-by: Stefan Hajnoczi <stefanha@gmail.com>

diff --git a/test.txt b/test.txt
new file mode 100644
index 0000000..802992c
--- /dev/null
+++ b/test.txt
@@ -0,0 +1 @@
+Hello world

Loose objects

The function to look up an object from a binary SHA1 is struct object *parse_object(const unsigned char *sha1) in Git v2.8.2-396-g5fe494c. You can browse the source here. In Git the versioned data and metadata is stored as "objects" identified by their SHA1 hash. There are several types of objects so this function can look up commits, tags, trees, etc.

Git stores objects in several ways. The simplest is called a "loose object", which means that object 365e45161384f6dc704ec798828dc927d63e3b22 is stored zlib DEFLATE compressed in its own file. We can manually view the object like this:

$ openssl zlib -d <.git/objects/36/5e45161384f6dc704ec798828dc927d63e3b22
commit 241tree 401ce7dbd55d28ea49c1c2f1c1439eb7d2b92427
author Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700
committer Stefan Hajnoczi <stefanha@gmail.com> 1463158107 -0700

Initial checkin

Signed-off-by: Stefan Hajnoczi <stefanha@gmail.com>

There! So this little command-line produces similar output to $ git show --format=raw 365e45161384f6dc704ec798828dc927d63e3b22 without using Git.

Note that openssl(1) is used in this example as a convenient command-line tool for decompressing zlib DEFLATE data. This has nothing to do with OpenSSL cryptography.

Git spreads objects across subdirectories in .git/objects/ by the SHA1's first byte to avoid having a single directory with a huge number of files. This helps file systems and userspace tools scale better when there are many commits.

Despite this directory layout it can still be inefficient to access loose objects. Each object requires open(2)/close(2) syscalls and possibly an access(2) call to check for existence first. This puts a lot of faith in cheap directory lookups in the file system and is not optimal when the number of objects grows.

Pack files

Git's solution to large numbers of objects is found in .git/objects/pack/. Each "pack" consists of an index file (.idx) and the pack itself (.pack). Here is how that looks in our test repository:

$ git repack
Counting objects: 3, done.
Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 0 (delta 0)
$ ls -l .git/object/pack/
total 8
-r--r--r--. 1 stefanha stefanha 1156 May 13 10:35 pack-c63d94bfadeaba23a38c05c75f48c7535339d685.idx
-r--r--r--. 1 stefanha stefanha  246 May 13 10:35 pack-c63d94bfadeaba23a38c05c75f48c7535339d685.pack

In order to look up an object from a SHA1, the index file must be used:

The index file provides an efficient way to look up the byte offset into the pack file where the object is located. The index file's 256-entry lookup table takes the first byte of the SHA1 and produces the highest index for objects starting with that SHA1 byte. This lookup table makes it possible to search only objects starting with the same SHA1 byte while excluding all other objects. That's similar to the subdirectory layout used for loose objects!

The SHA1s of all objects inside the pack are stored in a sorted table. From the 1st SHA1 byte lookup we now know the lowest and highest index into the sorted SHA1 table where we can binary search for the SHA1 we are looking for. If the SHA1 cannot be found then the pack file does not contain that object.

Once the SHA1 has been found in the sorted SHA1 table, its index is used with the offsets table. The offset is the location in the pack file where the object is located. There is a minor complication here: offset table entries are only 32-bits so an extension is necessary for 64-bit offsets.

The 64-bit offset extension uses the top-most bit in the 32-bit offset table entry to indicate that a 64-bit offset is needed. This means the 32-bit offset table actually only works for offsets that fit into 31 bits! When the top-most bit is set the remaining bits form an index into the 64-bit offset table where the actual offset is located.

Each object in the pack file has a header describing the object type (commit, tree, tag, etc) and its size. After the header is the zlib DEFLATE compressed data.

Deltas

Although the pack file can contain simple zlib DEFLATE compressed data for objects, it can also contain a chain of delta objects. The chain is ultimately terminated by a non-delta "base object". Deltas avoid storing some of the same bytes again each time a modified object is added. A delta object describes only the changed data between its parent and itself. This can be much smaller than storing the full object.

Think about the case where a single line is added to a long text file. It would be nice to avoid storing all the unmodified lines again when the modified text file is added. This is exactly what deltas do!

In order to unpack a delta object it is necessary to start with the base object and apply the chain of delta objects, one after the other. I won't go into the details of the binary format of delta objects, but it is a memcpy(3) engine rather than a traditional text diff produced by diff(1). Each memcpy(3) operation needs offset and length parameters, as well as a source buffer - either the parent object or the delta data. This memcpy(3) engine approach is enough to delete bytes, add bytes, and copy bytes.

Conclusion

Git objects can be stored either as zlib DEFLATE compressed "loose object" files or inside a "pack file". The pack file reduces the need to access many files and also features deltas to reduce storage requirements.

The pack index file has a fairly straightforward layout for taking a SHA1 and finding the offset into the pack file. The layout is cache-friendly because it co-locates fields into separate tables instead of storing full records in a single table. This way the CPU cache only reads data that is being searched, not the other record fields that are irrelevant to the search.

The pack file format works as an immutable archive. It is optimized for lookups and does not support arbitrary modification well. An advantage of this approach is that the design is relatively simple and multiple git processes can perform lookups without fear of races.

I hope this gives a useful overview of git data storage internals. I'm not a git developer so I hope this description matches reality. Please point out my mistakes in the comments.

Friday, 4 March 2016

Slides available for "NFS over virtio-vsock: Host/guest file sharing for virtual machines"

I have been working on the virtio-vsock host/guest communications mechanism and the first application is file sharing with NFS. NFS is a mature and stable network file system which can be reused to provide host/guest file sharing in KVM.

If you are interested in learning more, check out the slides from my Connectathon 2016 presentation:

NFS over virtio-vsock: Host/guest file sharing for virtual machines