Sunday, February 9, 2020

Why CPU Utilization Metrics are Confusing

How much CPU is being used? Intuitively we would like to know the percentage of time being consumed. Popular utilities like top(1) and virt-top(1) do show percentages but the numbers can be weird. This post goes into how CPU utilization is accounted and why the numbers can be confusing.

Tools sometimes show CPU utilizations above 100%. Or we know a virtual machine is consuming all its CPU but only 12% CPU utilization is reported. Comparing CPU utilization metrics from different tools often reveals that the numbers they report are wildly different. What's going on?

How CPU Utilization is Measured

Imagine we want to measure the CPU utilization of an application on a simple computer with one CPU. Each time the application is scheduled on the CPU we record the time until it is next descheduled. The utilization is calculated by dividing the total CPU time that the application ran by the time interval being measured:

Here t is execution time for each of the n times the application was scheduled and T is the time unit being measured (e.g. 1 second).

So far, so good. This is how CPU utilization times should work. Now let's look at why the percentages can be confusing.

CPU Utilization on Multi-Processor Systems

Modern computers from mobile phones to laptops to servers typically have multiple logical CPUs. They are called logical CPUs because they appear as a CPU to software regardless of whether they are implemented as a socket, a core, or an SMT hardware thread.

On multi-processor systems we need to adapt the CPU utilization formula to account for CPUs running in parallel. There are two ways to do this:

  1. Treat 100% as full utilization of all CPUs. top(1) calls this Solaris mode.
  2. Treat 100% as full utilization of one CPU. top(1) calls this Irix mode.

By default top(1) reports CPU utilization in Irix mode and virt-top(1) reports Solaris mode.

The implications of Solaris mode are that a single CPU being fully utilized is only reported as 1/N CPU utilization where N is the number of CPUs. On a system with a large number of CPUs the utilization percentages can be very low even though some CPUs are fully utilized. Even on my laptop with 4 logical CPUs that means a single-threaded application consuming a full CPU only reports 25% CPU utilization.

Irix mode produces more intuitive 0-100% numbers for single-threaded applications but multi-threaded applications may consume multiple CPUs and therefore exceed 100%, which looks a bit funny.

Confused?

Since there are two ways of accounting CPU utilization on multi-processor systems it is always necessary to know which method is being used. A percentage on its own is meaningless and might be misinterpreted.

This also explains why numbers reported by different tools can be so vastly different. It is necessary to check which accounting method is being used by both tools.

Documentation (and source code) often sheds light on which accounting method is used, but another way to check is by running a process that consumes a full CPU and then observing the CPU utilization that is reported. This can be done by running while true; do true; done in a shell and checking the CPU utilization numbers that are reported.

virt-top(1) has another peculiarity that must be taken into account. Its formula divides CPU time consumed by a guest by the total CPU time available on the host. If the guest has 4 vCPUs but the guest has 8 physical CPUs, then the guest can only ever reach 50% because it will never use all physical CPUs at once.

Conclusion

CPU utilization can be confusing on multi-processor systems, which is most computers today. Interpreting CPU utilization metrics requires knowing whether Solaris mode or Irix mode was used for calculation. Be careful with CPU utilization metrics!