Wednesday, November 15, 2017

Video and slides available for "Applying Polling Techniques to QEMU: Reducing virtio-blk I/O Latency"

At KVM Forum 2017 I gave a talk about the AioContext polling optimization that was merged in QEMU 2.9. It reduces latency for virtio-blk and virtio-scsi devices with the iothread= option on high IOPS devices like recent NVMe PCIe SSDs drives. It increases performance for latency-sensitive workloads and has been designed to avoid interfering with workloads that do not benefit from polling thanks to a self-tuning algorithm.

The video of the talk is now available:

The slides are available here (PDF).

Monday, November 13, 2017

Common disk benchmarking mistakes

Collecting benchmark results is the first step to solving disk I/O performance problems. Unfortunately, many bug reports and performance investigations fall down at the first step because bogus benchmark data is collected. This post explains common mistakes when running disk I/O benchmarks.

Disk I/O patterns

Skip this section if you are already familiar with these terms. Before we begin, it is important to understand the different I/O patterns and how they are used in benchmarking.

Sequential vs random I/O is the access pattern in which data is read or written. Sequential I/O is in-order data access commonly found in workloads like streaming multimedia or writing log files. Random I/O is access of non-adjacent data commonly found when accessing many small files or on systems running multiple applications that access the disk at the same time. It is easy to prefetch sequential I/O so both disk read caches and operating system page caches may keep the next piece of data ready even before it is accessed. Random I/O does not offer opportunities for prefetching and is therefore a harder access pattern to optimize.

Block or request size is the amount of data transferred by a single access. Small request sizes are 512B through 4 KB, large request sizes are 64 KB through 128 KB, while very large request sizes could be 1 MB (although the maximum allowed request size ultimately depends on the hardware). Fewer requests are needed to transfer the same amount of data when the request size is larger. Therefore, throughput is usually higher at larger request sizes because less per-request overhead is incurred for the same amount of data.

Read vs write is the request type that determines whether data is transferred to or from the storage medium. Reads can be completed cheaply if data is already in the disk read cache and, failing that, the access time depends on the storage medium. Traditional spinning disks have significant average seek times in the range of 4-15 milliseconds, depending on the drive, when the head is not positioned in the read location, while solid-state storage devices might just take on the order of 10 microseconds. Writes can be completed cheaply by leaving data in the disk write cache unless the cache is full or the cache is disabled.

Queue depth is the number of in-flight I/O requests at a given time. Latency-sensitive workloads submit one request and wait for it to complete before submitting the next request. This is queue depth 1. Parallel workloads submit many requests without waiting for earlier requests to complete first. The maximum queue depth depends on the hardware with 64 being a common number. Maximum throughput is usually achieved when queue depth is fairly high because the disk can keep busy without waiting for the next request to be submitted and it may optimize the order in which requests are processed.

Random reads are a good way to force storage medium access and minimize cache hit rates. Sequentual reads are a good way to maximize cache hit rates. Which I/O pattern is appropriate depends on your goals.

Real-life workloads are usually a mixture of sequential vs random, block sizes, reads vs writes, and the queue depth may vary over time. It is simplest to benchmark a specific I/O pattern in isolation but benchmark tools can also be configured to produce mixed I/O patterns like 70% reads/30% writes. The goal when configuring a benchmark is to produce the I/O pattern that is critical for real-life workload performance.

1. Use a real benchmarking tool

It is often tempting to use file utilities instead of real benchmarking tools because file utilities report I/O throughput like real benchmarking tools and time taken can be easily measured. Therefore it might seem like there is no need to install a real benchmarking tool when file utilities are already available on every system.

Do not use cp(1), scp(1), or even dd(1). Instead, use a real benchmark like fio(1).

What's the difference? Real benchmarking tools can be configured to produce specific I/O patterns, like 4 KB random reads with queue depth 8, whereas file utilities offer limited or no ability to choose the I/O pattern. Since disk performance varies depending on the I/O pattern, it is hard to understand or compare results between systems without full control over the I/O pattern.

The second reason why real benchmarking tools are necessary is that file utilities are not designed to exercise the disk, they are designed to manipulate files. This means file utilities spend time doing things that does not involve disk I/O and therefore produces misleading performance results. The most important example of this is that file utilities use the operating system's page cache and this can result in no disk I/O activity at all!

2. Bypass the page cache

One of the most common mistakes is forgetting to bypass the operating system's page cache. Files and block devices opened with the O_DIRECT flag perform I/O to the disk without going through the page cache. This is the best way to guarantee that the disk actually gets I/O requests. Files opened without this flag are in "buffered I/O" mode and that means I/O may be fulfilled entirely within the page cache in RAM without any disk I/O activity. If the goal is to benchmark disk performance then the page cache needs to be eliminated.

fio(1) jobs must use the direct=1 parameter to exercise the disk.

It is not sufficient to echo 3 > /proc/sys/vm/drop_caches before running the benchmark instead of using O_DIRECT. Although this command is often used to make non-disk benchmarks produce more consistent results between runs, it does not guarantee that the disk will actually receive I/O requests. In addition, the page cache interferes with the desired benchmark I/O pattern since page cache prefetch and writeback will alter the actual I/O pattern that the disk sees.

3. Bypass file systems and device mapper

fio(1) can do both file I/O and disk I/O benchmarking, so it's often mistakenly used in file I/O mode instead of disk I/O mode. When benchmarking disk performance it is best to eliminate file systems and device mapper targets to isolate raw disk I/O performance. File systems and device mapper targets may have their own internal bottlenecks, such as software locks, that are unrelated to disk performance. File systems and device mapper targets are also likely to modify the I/O pattern because they submit their own metadata I/O.

fio(1) jobs must use the filename=/path/to/disk to do disk I/O benchmarking.

Without a block device filename parameter, the benchmark would create regular files on whatever file system is in use. Remember to double- and triple-check the block device filename before running benchmarks that write to the disk to avoid accidentally overwriting important data like the system root disk!

Example benchmark configurations

Here are a few example fio(1) jobs that you can use as a starting point.

High-throughput parallel reads

This job is a read-heavy workload with lots of parallelism that is likely to show off the device's best throughput:

[global]
filename=/path/to/device
runtime=120
ioengine=libaio
direct=1
ramp_time=10            # start measuring after warm-up time

[read]
readwrite=read
numjobs=16
blocksize=64k
offset_increment=128m   # each job starts at a different offset

Latency-sensitive random reads

This job is a latency-sensitive workload that stresses per-request overhead and seek times:

[global]
filename=/path/to/device
runtime=120
ioengine=libaio
direct=1
ramp_time=10            # start measuring after warm-up time

[read]
readwrite=randread
blocksize=4k

Mixed workload

This job simulates a more real-life workload with an I/O pattern that contains boths reads and writes:

[global]
filename=/path/to/device
runtime=120
ioengine=libaio
direct=1
ramp_time=10            # start measuring after warm-up time

[read]
readwrite=randrw
rwmixread=70
rwmixwrite=30
iodepth=4
blocksize=4k

Conclusion

There are several common issues with disk benchmarking that can lead to useless results. Using a real benchmarking tool and bypassing the page cache and file system are the basic requirements for useful disk benchmark results. If you have questions or suggestions about disk benchmarking, feel free to post a comment.