Stefan Hajnoczi: 2019

Friday, November 29, 2019

Visiting the Centre for Computing History

I visited the Centre for Computing History today in Cambridge, UK. It's home to old machines from computer science history, 80s home computers, 90s games consoles, and much more. It was nice to see familiar machines that I used to play with back in the day. This post has pictures from the visit to the museum.

The journey starts with the Megaprocessor, a computer build from 15,000 transistors with countless LEDs that allow you to see what's happening inside the machine while a program runs.

The Megaprocessor has its own instruction set architecture (ISA) with 4 General Purpose Registers (GPRs). The contents of the GPRs are visible on seven-segment displays and LEDs.

The instruction decoder looks fun. Although I didn't look in detail, it seems to be an old-school decoder where each bit in an instruction is hardcoded to enable or disable certain hardware units. No microcoded instructions here!

Ada Lovelace is considered the first programmer thanks to her work on the Analytical Engine. On a Women in Computer Science note, I learnt that Margaret Hamilton coined the term "software engineering". Hedy Lamarr also has an interesting background: movie star and inventor. There are posters throughout the museum featuring profiles on women in computer science that are quite interesting.

The museum is very hands-on with machines available to use and other items like books open to visitors. If nostalgia strikes and you want to sit down and play a game or program in BASIC, or just explore an exotic old machine, you can just do it! That is quite rare for a museum since these are historic items that can be fragile or temperamental.

Moving along in chronological order, here is the PDP-11 minicomputer that UNIX ran on in 1970! I've seen them in museums before have yet to interact with one.

In the 1980s the MicroVAX ran VMS or ULTRIX. I've read about these machines but they were before my time! It's cool to see one.

This HP Graphics Terminal was amusing. I don't see anything graphical about ASCII art, but I think the machine was connected to a plotter/printer.

The museum has a lot of microcomputers from the 1980s including brands I've never heard of. There were also machines with laserdiscs or more obscure optical media, what eventually became the "multi-media" buzzword in the 90s when CD-ROMs became mainstream.

Speaking of optical media, here is an physical example of bitrot, the deterioration of data media or software!

Classic home computers: ZX Spectrum, Commodore 64, Atari ST Mega 2, and Acorn. The museum has machines that were popular in the UK, so the selection is a little different from what you find in the Computer History Museum in Mountain View, CA, USA.

There are games consoles from the 80s, 90s, and 2000s. The Gameboy holds a special place for me. Both the feel of the buttons and the look of the display still seems right in this age of high resolution color displays.

The museum has both the popular Nintendo, SEGA, and Sony consoles as well as rarer specimens that I've never seen in real life before. It was cool to see an Intellivision, Jaguar, etc.

Back to UNIX. This SGI Indy brought back memories. I remember getting a used one in order to play with the IRIX operating system. It was already an outdated machine at the time but the high resolution graphics and the camera were clearly ahead of its time.

Say hello to an old friend. I remember having exactly the same 56K modem! What a wonderful dial-up sound :).

And finally, the Palm pilot. Too bad that the company failed, they had neat hardware before smartphones came along. I remember programming and reverse engineering on the Palm.

Conclusion

If you visit Cambridge, UK be sure to check out the Centre for Computing History. It has an excellent collection of home computers and games consoles. I hope it will be expanded to cover the 2000s internet era too (old web browsers, big websites that no longer exist, early media streaming, etc).

Wednesday, November 27, 2019

Software Freedom Conservancy donation matching is back!

Software Freedom Conservancy is a non-profit that provides a home for Git, QEMU, Inkscape, and many other popular open source projects. Conservancy is also influential in promoting free software and open source licenses, including best practices for license compliance. They help administer the Outreachy open source internship program that encourages diversity in open source. They are a small organization with just 5 full-time employees taking on many tasks important in the open source community.

The yearly donation matching event has started again, so now is the best time to become a supporter by donating!

Tuesday, November 19, 2019

Video and slides available for "virtio-fs: A Shared File System for Virtual Machines"

This year I presented virtio-fs at KVM Forum 2019 with David Gilbert and Miklos Szeredi. virtio-fs is a host<->guest file system that allows guests to access a shared directory on the host. We've been working on virtio-fs together with Vivek Goyal and community contributors since 2018 and are excited that it is now being merged upstream in Linux and QEMU.

virtio-fs gives guests file system access without the need for disk image files or copying files between the guest and host. You can even boot a guest from a directory on the host without a disk image file. Kata Containers 1.7 and later ship with virtio-fs support for running VM-isolated containers.

What is new and interesting about virtio-fs is that it takes advantage of the co-location of guests and the hypervisor to avoid file server communication and to provide local file system semantics. The guest can map the contents of files from the host page cache. This bypasses the guest page cache to reduce memory footprint and avoid copying data into guest RAM. Network file systems and earlier attempts at paravirtualized file systems, like virtio-9p, cannot do this since they are designed for message-passing communication only.

To learn more about virtio-fs, check out the video or slides (PDF) from the presentation.

Monday, August 5, 2019

Determining why a Linux syscall failed

One is often left wondering what caused an errno value when a system call fails. Figuring out the reason can be tricky because a single errno value can have multiple causes. Applications get an errno integer and no additional information about what went wrong in the kernel.

There are several ways to determine the reason for a system call failure (from easiest to most involved):

Check the system call's man page for the meaning of the errno value. Sometimes this is enough to explain the failure.
Check the kernel log using dmesg(1). If something went seriously wrong (like a hardware error) then there may be a detailed error information. It may help to increase the kernel log level.
Read the kernel source code to understand various error code paths and identify the most relevant one.
Use the function graph tracer to see which code path was taken.
Add printk() calls, recompile the kernel (module), and rerun to see the output.

Reading the man page and checking dmesg(1) are fairly easy for application developers and do not require knowledge of kernel internals. If this does not produce an answer then it is necessary to look closely at the kernel source code to understand a system call's error code paths.

This post discusses the function graph tracer and how it can be used to identify system call failures without recompiling the kernel. This is useful because running a custom kernel may not be possible (e.g. due to security or reliability concerns) and recompiling the kernel is slow.

An example

In order to explore some debugging techniques let's take the io_uring_setup(2) system call as an example. It is failing with ENOMEM but the system is not under memory pressure, so ENOMEM is not expected.

The io_uring_setup(2) source code (fs/io_uring.c) contains many ENOMEM locations but it is not possible to conclusively identify which one is failing. The next step is to determine which code path is taken using dynamic instrumentation.

The function graph tracer

The Linux function graph tracer records kernel function entries and returns so that function call relationships are made apparent. The io_uring_setup(2) system call is failing with ENOMEM but it is unclear at which point in the system call this happens. It is possible to find the answer by studying the function call graph produced by the tracer and following along in the Linux source code.

Since io_uring_setup(2) is a system call it's not an ordinary C function definition and has a special symbol name in the kernel ELF file. It is possible to look up the (architecture-specific) symbol for the currently running kernel:

# grep io_uring_setup /proc/kallsyms
...
ffffffffbd357130 T __x64_sys_io_uring_setup

Let's trace all __x64_sys_io_uring_setup calls:

# cd /sys/kernel/debug/tracing
# echo '__x64_sys_io_uring_setup' > set_graph_function
# echo 'function_graph' >current_tracer
# cat trace_pipe >/tmp/trace.log
...now run the application in another terminal...
^C

The trace contains many successful io_uring_setup(2) calls that look like this:

 1)               |  __x64_sys_io_uring_setup() {
 1)               |    io_uring_setup() {
 1)               |      capable() {
 1)               |        ns_capable_common() {
 1)               |          security_capable() {
 1)   0.199 us    |            cap_capable();
 1)   7.095 us    |          }
 1)   7.594 us    |        }
 1)   8.059 us    |      }
 1)               |      kmem_cache_alloc_trace() {
 1)               |        _cond_resched() {
 1)   0.244 us    |          rcu_all_qs();
 1)   0.708 us    |        }
 1)   0.208 us    |        should_failslab();
 1)   0.220 us    |        memcg_kmem_put_cache();
 1)   2.201 us    |      }
...
 1)               |      fd_install() {
 1)   0.223 us    |        __fd_install();
 1)   0.643 us    |      }
 1) ! 190.396 us  |    }
 1) ! 216.236 us  |  }

Although the goal is to understand system call failures, looking at a successful invocation can be useful too. Failed calls in trace output can be identified on the basis that they differ from successful calls. This knowledge can be valuable when searching through large trace files. A failed io_uring_setup(2) call aborts early and does not invoke fd_install(). Now it is possible to find a failed call amongst all the io_uring_setup(2) calls:

 2)               |  __x64_sys_io_uring_setup() {
 2)               |    io_uring_setup() {
 2)               |      capable() {
 2)               |        ns_capable_common() {
 2)               |          security_capable() {
 2)   0.236 us    |            cap_capable();
 2)   0.872 us    |          }
 2)   1.419 us    |        }
 2)   1.951 us    |      }
 2)   0.419 us    |      free_uid();
 2)   3.389 us    |    }
 2) + 48.769 us   |  }

The fs/io_uring.c code shows the likely error code paths:

        account_mem = !capable(CAP_IPC_LOCK);

        if (account_mem) {
                ret = io_account_mem(user,
                                ring_pages(p->sq_entries, p->cq_entries));
                if (ret) {
                        free_uid(user);
                        return ret;
                }
        }

        ctx = io_ring_ctx_alloc(p);
        if (!ctx) {
                if (account_mem)
                        io_unaccount_mem(user, ring_pages(p->sq_entries,
                                                                p->cq_entries));
                free_uid(user);
                return -ENOMEM;
        }

But is there enough information in the trace to determine which of these return statements is executed? The trace shows free_uid() so we can be confident that both these code paths are valid candidates. By looking back at the success code path we can use the kmem_cache_alloc_trace() as a landmark. It is called by io_ring_ctx_alloc() so we should see kmem_cache_alloc_trace() in the trace before free_uid() if the second return statement is taken. Since it does not appear in the trace output we conclude that the first return statement is being taken!

When trace output is inconclusive

Function graph tracer output only shows functions in the ELF file. When the compiler inlines code, no entry or return is recorded in the function graph trace. This can make it hard to identify the exact return statement taken in a long function. Functions containing few function calls and many conditional branches are also difficult to analyze from just a function graph trace.

We can enhance our understanding of the trace by adding dynamic probes that record function arguments, local variables, and/or return values via perf-probe(2). By knowing these values we can make inferences about the code path being taken.

If this is not enough to infer which code path is being taken, detailed code coverage information is necessary.

One way to approximate code coverage is using a sampling CPU profiler, like perf(1), and letting it run under load for some time to gather statistics on which code paths are executed frequently. This is not as precise as code coverage tools, which record each branch encountered in a program, but it can be enough to observe code paths in functions that are not amenable to the function graph tracer due to the low number of function calls.

This is done as follows:

Run the system call in question in a tight loop so the CPU is spending a significant amount of time in the code path you wish to observe.
Start perf record -a and let it run for 30 seconds.
Stop perf-record(1) and run perf-report(1) to view the annotated source code of the function in question.

The error code path should have a significant number of profiler samples and it should be prominent in the pref-report(1) annotated output.

Conclusion

Determining the cause for a system call failure can be hard work. The function graph tracer is helpful in shedding light on the code paths being taken by the kernel. Additional debugging is possible using perf-probe(2) and the sampling profiler, so that in most cases it's not necessary to recompile the kernel with printk() just to learn why a system call is failing.

Thursday, April 18, 2019

What's new in VIRTIO 1.1?

The VIRTIO 1.1 specification has been published! This article covers the major new features in this specification.

New Devices

The following new devices are defined:

virtio-input is a Linux evdev input device (mouse, keyboard, joystick)
virtio-gpu is a 2D graphics device (with 3D support planned)
virtio-vsock is a host<->guest socket communications device
virtio-crypto is a cryptographic accelerator device

New Device Features

virtio-net

VIRTIO_NET_F_MTU advises the driver of the device's maximum MTU
VIRTIO_NET_F_RSC_EXT adds Receive Segment Coalescing support
VIRTIO_NET_F_STANDBY indicates that this device can act as a failover device for a primary device

virtio-blk

VIRTIO_BLK_F_DISCARD adds discard (aka trim) support for unmapping blocks
VIRTIO_BLK_F_WRITE_ZEROES adds zero offload support and is more efficient than writing buffers of zeroes

virtio-balloon

Several new guest memory statistics are available

New Core Features

There is a new virtqueue memory layout called packed virtqueues. The old layout is called split virtqueues because the avail and used rings are separate from the descriptor table. The new packed virtqueue layout uses just a single descriptor table as the single ring. The layout is optimized for a friendlier CPU cache footprint and there are several features that devices can exploit for better peformance.

The VIRTIO_F_NOTIFICATION_DATA feature is an optimization mainly for hardware implementations of VIRTIO. The driver writes extra information as part of the Available Buffer Notification. Thanks to the information included in the notification, the device does not need to fetch this information from memory anymore. This is useful for PCI hardware implementations where minimizing DMA operations improves performance significantly.

Thursday, February 28, 2019

QEMU accepted into Google Summer of Code and Outreachy 2019

QEMU is participating in the Google Summer of Code and Outreachy open source internship programs again this year. These 12-week, full-time, paid, remote work internships allow people interested in contributing to QEMU get started. Each intern works with one or more mentors who can answer questions and are experienced developers. This is a great way to try out working on open source if you are considering it as a career.

For more information (including eligibility requirements), see our GSoC and our Outreachy pages.

Friday, January 25, 2019

VIRTIO 1.1 is available for public review until Feb 21st 2019

The VIRTIO 1.1 specification for paravirtualized I/O devices includes the new packed vring layout and the GPU, input, crypto, and socket device types. In addition to this there are other improvements and new features in the specification. The new vring layout will increase performance and offers new features that devices can take advantage of.

You can review the specification and post comments until February 21st 2019: VIRTIO 1.1 csprd01.

Sunday, January 6, 2019

mute-thread: a script to mute email threads with notmuch

Ever get included on an email thread that isn't relevant? It can be distracting to see new emails appear on a thread you already know is not interesting. You could mark them as read manually, but that is tedious.

This mute-thread script silences email threads that you don't want to read, even after new emails are received.

Download it here.

Setup

It relies on the awesome notmuch(1) email utility, so make sure you have that set up in order to use this script.

The following .muttrc macro integrates this with the mutt(1) email client. When you press M the entire email thread is muted:

macro index M "<enter-command>unset wait_key<enter><pipe-message>~/.mutt/mute-thread add<enter><enter-command>set wait_key<enter><read-thread>" "Mute thread"

After fetching new emails, run notmuch and then mute-thread apply.

Unmuting threads

If you change your mind, run mute-thread remove MESSAGE-ID to unmute a thread again. Future emails will not be silenced.