A fair amount of the development work I do is related to storage performance in QEMU/KVM. Although I have written about disk I/O benchmarking and my performance analysis workflow in the past, I haven't covered the performance tools that I frequently rely on. In this post I'll go over what's in my toolbox and hopefully this will be helpful to others.
Performance analysis is hard when the system is slow but there is no clear bottleneck. If a CPU profile shows that a function is consuming significant amounts of time, then that's a good target for optimizations. On the other hand, if the profile is uniform and each function only consumes a small fraction of time, then it is difficult to gain much by optimizing just one function (although taking function call nesting into account may point towards parent functions that can be optimized):
If you are measuring just one metric, then eventually the profile will become uniform and there isn't much left to optimize. It helps to measure at multiple layers of the system in order to increase the chance of finding bottlenecks.
Here are the tools I like to use when hunting for QEMU storage performance optimizations.
kvm_stat
kvm_stat is a tool that runs on the host and counts events from the kvm.ko kernel module, including device register accesses (MMIO), interrupt injections, and more. These are often associated with vmentry/vmexit events where control passes between the guest and the hypervisor. Each time this happens there is a cost and it is preferrable to minimize the number of vmentry/vmexit events.
kvm_stat will let you identify inefficiencies when guest drivers are accessing devices as well as low-level activity (MSRs, nested page tables, etc) that can be optimized.
Here is output from an I/O intensive workload:
Event Total %Total CurAvg/s kvm_entry 1066255 20.3 43513 kvm_exit 1066266 20.3 43513 kvm_hv_timer_state 954944 18.2 38754 kvm_msr 487702 9.3 19878 kvm_apicv_accept_irq 278926 5.3 11430 kvm_apic_accept_irq 278920 5.3 11430 kvm_vcpu_wakeup 250055 4.8 10128 kvm_pv_tlb_flush 250000 4.8 10123 kvm_msi_set_irq 229439 4.4 9345 kvm_ple_window_update 213309 4.1 8836 kvm_fast_mmio 123433 2.3 5000 kvm_wait_lapic_expire 39855 0.8 1628 kvm_apic_ipi 9614 0.2 457 kvm_apic 9614 0.2 457 kvm_unmap_hva_range 9 0.0 1 kvm_fpu 42 0.0 0 kvm_mmio 28 0.0 0 kvm_userspace_exit 21 0.0 0 vcpu_match_mmio 19 0.0 0 kvm_emulate_insn 19 0.0 0 kvm_pio 2 0.0 0 Total 5258472 214493
Here I'm looking for high CurAvg/s rates. Any counters at 100k/s are well worth investigating.
Important events:
- kvm_entry/kvm_exit: vmentries and vmexits are when the CPU transfers control between guest mode and the hypervisor.
- kvm_msr: Model Specific Register accesses
- kvm_msi_set_irq: Interrupt injections (Message Signalled Interrupts)
- kvm_fast_mmio/kvm_mmio/kvm_pio: Device register accesses
sysstat
sysstat is a venerable suite of performance monitoring tools cover CPU, network, disk, memory activity. It can be used equally within guests and on the host. It is like an extended version of the classic vmstat(8) tool.
The mpstat, pidstat, and iostat tools are the ones I use most often:
- mpstat reports CPU consumption, including %usr (userspace), %sys (kernel), %guest (guest mode), %steal (hypervisor), and more. Use this to find CPUs that are maxed out or poor use of parallelism/multi-threading.
- pidstat reports per-process and per-thread statistics. This is useful for identifying specific threads that are consuming resources.
- iostat reports disk statistics. This is useful for comparing I/O statistics between bare metal and guests.
blktrace
blktrace monitors Linux block I/O activity. This can be used equally within guests and on the host. Often it's interesting to capture traces in both the guest and on the host so they can be compared. If the I/O pattern is different then something in the I/O stack is modifying requests. That can be a sign of a misconfiguration.
The blktrace data can be analyzed and plotted with the btt command. For example, the latencies from driver submission to completion can be summarized to find the overhead compared to bare metal.
perf-top
In its default mode, perf-top is a sampling CPU profiler. It periodically collects the CPU's program counter value so that a profile can be generated showing hot functions. It supports call graph recording with the -g option if you want to aggregate nested function calls and find out which parent functions are responsible for the most CPU usage.
perf-top (and its non-interactive perf-record/perf-report cousin) is good at identifying inner loops of programs, excessive memcpy/memset, expensive locking instructions, instructions with poor cache hit rates, etc. When I use it to profile QEMU it shows the host kernel and QEMU userspace activity. It does not show guest activity.
Here is the output without call graph recording where we can see vmexit activity, QEMU virtqueue processing, and excessive memset at the top of the profile:
3.95% [kvm_intel] [k] vmx_vmexit 3.37% qemu-system-x86_64 [.] virtqueue_split_pop 2.64% libc.so.6 [.] __memset_evex_unaligned_erms 2.50% [kvm_intel] [k] vmx_spec_ctrl_restore_host 1.57% [kernel] [k] native_irq_return_iret 1.13% qemu-system-x86_64 [.] bdrv_co_preadv_part 1.13% [kernel] [k] sync_regs 1.09% [kernel] [k] native_write_msr 1.05% [nvme] [k] nvme_poll_cq
The goal is to find functions or families of functions that consume significant amounts of CPU time so they can be optimized. If the profile is uniform with most functions taking less then 1%, then the bottlenecks are more likely to be found with other tools.
perf-trace
perf-trace is an strace-like tool for monitoring system call activity. For performance monitoring it has a --summary option that shows time spent in system calls and the counts. This is how you can identify system calls that block for a long time or that are called too often.
Here is an example from the host showing a summary of a QEMU IOThread that uses io_uring for disk I/O and ioeventfd/irqfd for VIRTIO device activity:
IO iothread1 (332737), 1763284 events, 99.9% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ io_uring_enter 351680 0 8016.308 0.003 0.023 0.209 0.11% write 390189 0 1238.501 0.002 0.003 0.098 0.08% read 121057 0 305.355 0.002 0.003 0.019 0.14% ppoll 18707 0 62.228 0.002 0.003 0.023 0.32%
Conclusion
We looked at kvm_stat, sysstat, blktrace, perf-top, and perf-trace. They provide performance metrics from different layers of the system. Another worthy mention is the bcc collection of eBPF-based tools that offers a huge array of performance monitoring and troubleshooting tools. Let me know which tools you like to use!