Collecting benchmark results is the first step to solving disk I/O performance problems. Unfortunately, many bug reports and performance investigations fall down at the first step because bogus benchmark data is collected. This post explains common mistakes when running disk I/O benchmarks.
Disk I/O patterns
Skip this section if you are already familiar with these terms. Before we
begin, it is important to understand the different I/O patterns and how they
are used in benchmarking.
Sequential vs random I/O is the access pattern in which data
is read or written. Sequential I/O is in-order data access commonly found in
workloads like streaming multimedia or writing log files. Random I/O is access
of non-adjacent data commonly found when accessing many small files or on
systems running multiple applications that access the disk at the same time.
It is easy to prefetch sequential I/O so both disk read caches and operating
system page caches may keep the next piece of data ready even before it is
accessed. Random I/O does not offer opportunities for prefetching and is
therefore a harder access pattern to optimize.
Block or request size is the amount of data transferred by a single
access. Small request sizes are 512B through 4 KB, large request sizes are 64
KB through 128 KB, while very large request sizes could be 1 MB (although the
maximum allowed request size ultimately depends on the hardware). Fewer
requests are needed to transfer the same amount of data when the request size
is larger. Therefore, throughput is usually higher at larger request sizes
because less per-request overhead is incurred for the same amount of data.
Read vs write is the request type that determines whether data is
transferred to or from the storage medium. Reads can be completed cheaply if
data is already in the disk read cache and, failing that, the access time
depends on the storage medium. Traditional spinning disks have significant
average seek times in the range of 4-15 milliseconds, depending on the drive, when the head is not positioned in the
read location, while solid-state storage devices might just take on the order
of 10 microseconds. Writes can be completed cheaply by leaving data in the
disk write cache unless the cache is full or the cache is disabled.
Queue depth is the number of in-flight I/O requests at a given time.
Latency-sensitive workloads submit one request and wait for it to complete
before submitting the next request. This is queue depth 1. Parallel workloads
submit many requests without waiting for earlier requests to complete first.
The maximum queue depth depends on the hardware with 64 being a common
number. Maximum throughput is usually achieved when queue depth is fairly
high because the disk can keep busy without waiting for the next request to be
submitted and it may optimize the order in which requests are processed.
Random reads are a good way to force storage medium access and minimize cache hit
rates. Sequentual reads are a good way to maximize cache hit rates. Which I/O
pattern is appropriate depends on your goals.
Real-life workloads are usually a mixture of sequential vs random, block
sizes, reads vs writes, and the queue depth may vary over time. It is simplest
to benchmark a specific I/O pattern in isolation but benchmark tools can also
be configured to produce mixed I/O patterns like 70% reads/30% writes. The
goal when configuring a benchmark is to produce the I/O pattern that is
critical for real-life workload performance.
1. Use a real benchmarking tool
It is often tempting to use file utilities instead of real benchmarking tools because file utilities report I/O throughput like real benchmarking tools and time taken can be easily measured. Therefore it might seem like there is no need to install a real benchmarking tool when file utilities are already available on every system.
Do not use cp(1), scp(1), or even dd(1). Instead, use a real benchmark like fio(1).
What's the difference? Real benchmarking tools can be configured to produce specific I/O patterns, like 4 KB random reads with queue depth 8, whereas file utilities offer limited or no ability to choose the I/O pattern. Since disk performance varies depending on the I/O pattern, it is hard to understand or compare results between systems without full control over the I/O pattern.
The second reason why real benchmarking tools are necessary is that file utilities are not designed to exercise the disk, they are designed to manipulate files. This means file utilities spend time doing things that does not involve disk I/O and therefore produces misleading performance results. The most important example of this is that file utilities use the operating system's page cache and this can result in no disk I/O activity at all!
2. Bypass the page cache
One of the most common mistakes is forgetting to bypass the operating system's page cache. Files and block devices opened with the O_DIRECT flag perform I/O to the disk without going through the page cache. This is the best way to guarantee that the disk actually gets I/O requests. Files opened without this flag are in "buffered I/O" mode and that means I/O may be fulfilled entirely within the page cache in RAM without any disk I/O activity. If the goal is to benchmark disk performance then the page cache needs to be eliminated.
fio(1) jobs must use the direct=1 parameter to exercise the disk.
It is not sufficient to echo 3 > /proc/sys/vm/drop_caches before
running the benchmark instead of using O_DIRECT. Although this command is
often used to make non-disk benchmarks produce more consistent results between
runs, it does not guarantee that the disk will actually receive I/O requests.
In addition, the page cache interferes with the desired benchmark I/O pattern
since page cache prefetch and writeback will alter the actual I/O pattern that
the disk sees.
3. Bypass file systems and device mapper
fio(1) can do both file I/O and disk I/O benchmarking, so it's
often mistakenly used in file I/O mode instead of disk I/O mode. When benchmarking disk
performance it is best to eliminate file systems and device mapper targets to
isolate raw disk I/O performance. File systems and device mapper targets may
have their own internal bottlenecks, such as software locks, that are unrelated to
disk performance. File systems and device mapper targets are also likely to
modify the I/O pattern because they submit their own metadata I/O.
fio(1) jobs must use the filename=/path/to/disk to do disk I/O benchmarking.
Without a block device filename parameter, the benchmark would create regular files on whatever file system is in use. Remember to double- and triple-check the block device filename before running benchmarks that write to the disk to avoid accidentally overwriting important data like the system root disk!
Example benchmark configurations
Here are a few example fio(1) jobs that you can use as a starting point.
High-throughput parallel reads
This job is a read-heavy workload with lots of parallelism that is likely to show off the device's best throughput:
[global]
filename=/path/to/device
runtime=120
ioengine=libaio
direct=1
ramp_time=10 # start measuring after warm-up time
[read]
readwrite=read
numjobs=16
blocksize=64k
offset_increment=128m # each job starts at a different offset
Latency-sensitive random reads
This job is a latency-sensitive workload that stresses per-request overhead and seek times:
[global]
filename=/path/to/device
runtime=120
ioengine=libaio
direct=1
ramp_time=10 # start measuring after warm-up time
[read]
readwrite=randread
blocksize=4k
Mixed workload
This job simulates a more real-life workload with an I/O pattern that contains boths reads and writes:
[global]
filename=/path/to/device
runtime=120
ioengine=libaio
direct=1
ramp_time=10 # start measuring after warm-up time
[read]
readwrite=randrw
rwmixread=70
rwmixwrite=30
iodepth=4
blocksize=4k
Conclusion
There are several common issues with disk benchmarking that can lead to useless results. Using a real benchmarking tool and bypassing the page cache and file system are the basic requirements for useful disk benchmark results. If you have questions or suggestions about disk benchmarking, feel free to post a comment.