Monday, August 5, 2019

Determining why a Linux syscall failed

One is often left wondering what caused an errno value when a system call fails. Figuring out the reason can be tricky because a single errno value can have multiple causes. Applications get an errno integer and no additional information about what went wrong in the kernel.

There are several ways to determine the reason for a system call failure (from easiest to most involved):

  1. Check the system call's man page for the meaning of the errno value. Sometimes this is enough to explain the failure.
  2. Check the kernel log using dmesg(1). If something went seriously wrong (like a hardware error) then there may be a detailed error information. It may help to increase the kernel log level.
  3. Read the kernel source code to understand various error code paths and identify the most relevant one.
  4. Use the function graph tracer to see which code path was taken.
  5. Add printk() calls, recompile the kernel (module), and rerun to see the output.

Reading the man page and checking dmesg(1) are fairly easy for application developers and do not require knowledge of kernel internals. If this does not produce an answer then it is necessary to look closely at the kernel source code to understand a system call's error code paths.

This post discusses the function graph tracer and how it can be used to identify system call failures without recompiling the kernel. This is useful because running a custom kernel may not be possible (e.g. due to security or reliability concerns) and recompiling the kernel is slow.

An example

In order to explore some debugging techniques let's take the io_uring_setup(2) system call as an example. It is failing with ENOMEM but the system is not under memory pressure, so ENOMEM is not expected.

The io_uring_setup(2) source code (fs/io_uring.c) contains many ENOMEM locations but it is not possible to conclusively identify which one is failing. The next step is to determine which code path is taken using dynamic instrumentation.

The function graph tracer

The Linux function graph tracer records kernel function entries and returns so that function call relationships are made apparent. The io_uring_setup(2) system call is failing with ENOMEM but it is unclear at which point in the system call this happens. It is possible to find the answer by studying the function call graph produced by the tracer and following along in the Linux source code.

Since io_uring_setup(2) is a system call it's not an ordinary C function definition and has a special symbol name in the kernel ELF file. It is possible to look up the (architecture-specific) symbol for the currently running kernel:

# grep io_uring_setup /proc/kallsyms
...
ffffffffbd357130 T __x64_sys_io_uring_setup

Let's trace all __x64_sys_io_uring_setup calls:

# cd /sys/kernel/debug/tracing
# echo '__x64_sys_io_uring_setup' > set_graph_function
# echo 'function_graph' >current_tracer
# cat trace_pipe >/tmp/trace.log
...now run the application in another terminal...
^C
The trace contains many successful io_uring_setup(2) calls that look like this:
 1)               |  __x64_sys_io_uring_setup() {
 1)               |    io_uring_setup() {
 1)               |      capable() {
 1)               |        ns_capable_common() {
 1)               |          security_capable() {
 1)   0.199 us    |            cap_capable();
 1)   7.095 us    |          }
 1)   7.594 us    |        }
 1)   8.059 us    |      }
 1)               |      kmem_cache_alloc_trace() {
 1)               |        _cond_resched() {
 1)   0.244 us    |          rcu_all_qs();
 1)   0.708 us    |        }
 1)   0.208 us    |        should_failslab();
 1)   0.220 us    |        memcg_kmem_put_cache();
 1)   2.201 us    |      }
...
 1)               |      fd_install() {
 1)   0.223 us    |        __fd_install();
 1)   0.643 us    |      }
 1) ! 190.396 us  |    }
 1) ! 216.236 us  |  }

Although the goal is to understand system call failures, looking at a successful invocation can be useful too. Failed calls in trace output can be identified on the basis that they differ from successful calls. This knowledge can be valuable when searching through large trace files. A failed io_uring_setup(2) call aborts early and does not invoke fd_install(). Now it is possible to find a failed call amongst all the io_uring_setup(2) calls:

 2)               |  __x64_sys_io_uring_setup() {
 2)               |    io_uring_setup() {
 2)               |      capable() {
 2)               |        ns_capable_common() {
 2)               |          security_capable() {
 2)   0.236 us    |            cap_capable();
 2)   0.872 us    |          }
 2)   1.419 us    |        }
 2)   1.951 us    |      }
 2)   0.419 us    |      free_uid();
 2)   3.389 us    |    }
 2) + 48.769 us   |  }

The fs/io_uring.c code shows the likely error code paths:

        account_mem = !capable(CAP_IPC_LOCK);

        if (account_mem) {
                ret = io_account_mem(user,
                                ring_pages(p->sq_entries, p->cq_entries));
                if (ret) {
                        free_uid(user);
                        return ret;
                }
        }

        ctx = io_ring_ctx_alloc(p);
        if (!ctx) {
                if (account_mem)
                        io_unaccount_mem(user, ring_pages(p->sq_entries,
                                                                p->cq_entries));
                free_uid(user);
                return -ENOMEM;
        }

But is there enough information in the trace to determine which of these return statements is executed? The trace shows free_uid() so we can be confident that both these code paths are valid candidates. By looking back at the success code path we can use the kmem_cache_alloc_trace() as a landmark. It is called by io_ring_ctx_alloc() so we should see kmem_cache_alloc_trace() in the trace before free_uid() if the second return statement is taken. Since it does not appear in the trace output we conclude that the first return statement is being taken!

When trace output is inconclusive

Function graph tracer output only shows functions in the ELF file. When the compiler inlines code, no entry or return is recorded in the function graph trace. This can make it hard to identify the exact return statement taken in a long function. Functions containing few function calls and many conditional branches are also difficult to analyze from just a function graph trace.

We can enhance our understanding of the trace by adding dynamic probes that record function arguments, local variables, and/or return values via perf-probe(2). By knowing these values we can make inferences about the code path being taken.

If this is not enough to infer which code path is being taken, detailed code coverage information is necessary.

One way to approximate code coverage is using a sampling CPU profiler, like perf(1), and letting it run under load for some time to gather statistics on which code paths are executed frequently. This is not as precise as code coverage tools, which record each branch encountered in a program, but it can be enough to observe code paths in functions that are not amenable to the function graph tracer due to the low number of function calls.

This is done as follows:

  1. Run the system call in question in a tight loop so the CPU is spending a significant amount of time in the code path you wish to observe.
  2. Start perf record -a and let it run for 30 seconds.
  3. Stop perf-record(1) and run perf-report(1) to view the annotated source code of the function in question.

The error code path should have a significant number of profiler samples and it should be prominent in the pref-report(1) annotated output.

Conclusion

Determining the cause for a system call failure can be hard work. The function graph tracer is helpful in shedding light on the code paths being taken by the kernel. Additional debugging is possible using perf-probe(2) and the sampling profiler, so that in most cases it's not necessary to recompile the kernel with printk() just to learn why a system call is failing.