Stefan Hajnoczi: June 2025

A fair amount of the development work I do is related to storage performance in QEMU/KVM. Although I have written about disk I/O benchmarking and my performance analysis workflow in the past, I haven't covered the performance tools that I frequently rely on. In this post I'll go over what's in my toolbox and hopefully this will be helpful to others.

Performance analysis is hard when the system is slow but there is no clear bottleneck. If a CPU profile shows that a function is consuming significant amounts of time, then that's a good target for optimizations. On the other hand, if the profile is uniform and each function only consumes a small fraction of time, then it is difficult to gain much by optimizing just one function (although taking function call nesting into account may point towards parent functions that can be optimized):

If you are measuring just one metric, then eventually the profile will become uniform and there isn't much left to optimize. It helps to measure at multiple layers of the system in order to increase the chance of finding bottlenecks.

Here are the tools I like to use when hunting for QEMU storage performance optimizations.

kvm_stat

kvm_stat is a tool that runs on the host and counts events from the kvm.ko kernel module, including device register accesses (MMIO), interrupt injections, and more. These are often associated with vmentry/vmexit events where control passes between the guest and the hypervisor. Each time this happens there is a cost and it is preferrable to minimize the number of vmentry/vmexit events.

kvm_stat will let you identify inefficiencies when guest drivers are accessing devices as well as low-level activity (MSRs, nested page tables, etc) that can be optimized.

Here is output from an I/O intensive workload:

 Event                                         Total %Total CurAvg/s
 kvm_entry                                   1066255   20.3    43513
 kvm_exit                                    1066266   20.3    43513
 kvm_hv_timer_state                           954944   18.2    38754
 kvm_msr                                      487702    9.3    19878
 kvm_apicv_accept_irq                         278926    5.3    11430
 kvm_apic_accept_irq                          278920    5.3    11430
 kvm_vcpu_wakeup                              250055    4.8    10128
 kvm_pv_tlb_flush                             250000    4.8    10123
 kvm_msi_set_irq                              229439    4.4     9345
 kvm_ple_window_update                        213309    4.1     8836
 kvm_fast_mmio                                123433    2.3     5000
 kvm_wait_lapic_expire                         39855    0.8     1628
 kvm_apic_ipi                                   9614    0.2      457
 kvm_apic                                       9614    0.2      457
 kvm_unmap_hva_range                               9    0.0        1
 kvm_fpu                                          42    0.0        0
 kvm_mmio                                         28    0.0        0
 kvm_userspace_exit                               21    0.0        0
 vcpu_match_mmio                                  19    0.0        0
 kvm_emulate_insn                                 19    0.0        0
 kvm_pio                                           2    0.0        0
 Total                                       5258472          214493

Here I'm looking for high CurAvg/s rates. Any counters at 100k/s are well worth investigating.

Important events:

kvm_entry/kvm_exit: vmentries and vmexits are when the CPU transfers control between guest mode and the hypervisor.
kvm_msr: Model Specific Register accesses
kvm_msi_set_irq: Interrupt injections (Message Signalled Interrupts)
kvm_fast_mmio/kvm_mmio/kvm_pio: Device register accesses

sysstat

sysstat is a venerable suite of performance monitoring tools cover CPU, network, disk, memory activity. It can be used equally within guests and on the host. It is like an extended version of the classic vmstat(8) tool.

The mpstat, pidstat, and iostat tools are the ones I use most often:

mpstat reports CPU consumption, including %usr (userspace), %sys (kernel), %guest (guest mode), %steal (hypervisor), and more. Use this to find CPUs that are maxed out or poor use of parallelism/multi-threading.
pidstat reports per-process and per-thread statistics. This is useful for identifying specific threads that are consuming resources.
iostat reports disk statistics. This is useful for comparing I/O statistics between bare metal and guests.

blktrace

blktrace monitors Linux block I/O activity. This can be used equally within guests and on the host. Often it's interesting to capture traces in both the guest and on the host so they can be compared. If the I/O pattern is different then something in the I/O stack is modifying requests. That can be a sign of a misconfiguration.

The blktrace data can be analyzed and plotted with the btt command. For example, the latencies from driver submission to completion can be summarized to find the overhead compared to bare metal.

perf-top

In its default mode, perf-top is a sampling CPU profiler. It periodically collects the CPU's program counter value so that a profile can be generated showing hot functions. It supports call graph recording with the -g option if you want to aggregate nested function calls and find out which parent functions are responsible for the most CPU usage.

perf-top (and its non-interactive perf-record/perf-report cousin) is good at identifying inner loops of programs, excessive memcpy/memset, expensive locking instructions, instructions with poor cache hit rates, etc. When I use it to profile QEMU it shows the host kernel and QEMU userspace activity. It does not show guest activity.

Here is the output without call graph recording where we can see vmexit activity, QEMU virtqueue processing, and excessive memset at the top of the profile:

   3.95%  [kvm_intel]                 [k] vmx_vmexit
   3.37%  qemu-system-x86_64          [.] virtqueue_split_pop
   2.64%  libc.so.6                   [.] __memset_evex_unaligned_erms
   2.50%  [kvm_intel]                 [k] vmx_spec_ctrl_restore_host
   1.57%  [kernel]                    [k] native_irq_return_iret
   1.13%  qemu-system-x86_64          [.] bdrv_co_preadv_part
   1.13%  [kernel]                    [k] sync_regs
   1.09%  [kernel]                    [k] native_write_msr
   1.05%  [nvme]                      [k] nvme_poll_cq

The goal is to find functions or families of functions that consume significant amounts of CPU time so they can be optimized. If the profile is uniform with most functions taking less then 1%, then the bottlenecks are more likely to be found with other tools.

perf-trace

perf-trace is an strace-like tool for monitoring system call activity. For performance monitoring it has a --summary option that shows time spent in system calls and the counts. This is how you can identify system calls that block for a long time or that are called too often.

Here is an example from the host showing a summary of a QEMU IOThread that uses io_uring for disk I/O and ioeventfd/irqfd for VIRTIO device activity:

 IO iothread1 (332737), 1763284 events, 99.9%

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   io_uring_enter    351680      0  8016.308     0.003     0.023     0.209      0.11%
   write             390189      0  1238.501     0.002     0.003     0.098      0.08%
   read              121057      0   305.355     0.002     0.003     0.019      0.14%
   ppoll              18707      0    62.228     0.002     0.003     0.023      0.32%

Conclusion

We looked at kvm_stat, sysstat, blktrace, perf-top, and perf-trace. They provide performance metrics from different layers of the system. Another worthy mention is the bcc collection of eBPF-based tools that offers a huge array of performance monitoring and troubleshooting tools. Let me know which tools you like to use!

Stefan Hajnoczi

Wednesday, June 25, 2025

Profiling tools I use for QEMU storage performance optimization