Wednesday, 13 March 2013

New in QEMU 1.4: high performance virtio-blk data plane implementation

QEMU 1.4 includes an experimental feature for improved high IOPS disk I/O scalability called virtio-blk data plane. It extends QEMU to perform disk I/O in a dedicated thread that is optimized for scalability with high IOPS devices and many disks. IBM and Red Hat have published a whitepaper presenting the highest IOPS achieved to date under virtualization using virtio-blk data plane:

KVM Virtualized I/O Performance [PDF]

Update

Much of this post is now obsolete! The virtio-blk dataplane feature was integrated with QEMU's block layer (live migration and block layer features are now supported), virtio-scsi dataplane support was added, and libvirt XML syntax was added.

If you have a RHEL 7.2 or later host please use the following:

QEMU syntax:

$ qemu-system-x86_64 -object iothread,id=iothread0 \
                     -drive if=none,id=drive0,file=vm.img,format=raw,cache=none,aio=native \
                     -device virtio-blk-pci,iothread=iothread0,drive=drive0

Libvirt domain XML syntax:

<domain>
    <iothreads>1<iothreads>
    <cputune>  <!-- optional -->
        <iothreadpin iothread="1" cpuset="5,6"/>
    </cputune>
    <devices>
        <disk type="file">
            <driver iothread="1" ... />
        </disk>
    </devices>
</domain>

When can virtio-blk data plane be used?

Data plane is suitable for LVM or raw image file configurations where live migration and advanced block features are not needed. This covers many configurations where performance is the top priority.

Data plane is still an experimental feature because it only supports a subset of QEMU configurations. The QEMU 1.4 feature has the following limitations:

  • Image formats are not supported (qcow2, qed, etc).
  • Live migration is not supported.
  • QEMU I/O throttling is not supported but cgroups blk-io controller can be used.
  • Only the default "report" I/O error policy is supported (-drive werror=,rerror=).
  • Hot unplug is not supported.
  • Block jobs (block-stream, drive-mirror, block-commit) are not supported.

How to use virtio-blk data plane

The following libvirt domain XML enables virtio-blk data plane:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='path/to/disk.img'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>
...
  </devices>
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.scsi=off'/>
  </qemu:commandline>
  <!-- config-wce=off is not needed in RHEL 6.4 -->
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.config-wce=off'/>
  </qemu:commandline>
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.x-data-plane=on'/>
  </qemu:commandline>
<domain>

Note that <qemu:commandline> must be added directly inside <domain> and not inside a child tag like <devices>.

If you do not use libvirt the QEMU command-line is:

qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=path/to/disk.img \
     -device virtio-blk,drive=drive0,scsi=off,config-wce=off,x-data-plane=on

What is the roadmap for virtio-blk data plane

The limitations of virtio-blk data plane in QEMU 1.4 will be lifted in future releases. The goal I intend to reach is that QEMU virtio-blk simply uses the data plane approach behind-the-scenes and the x-data-plane option can be dropped.

Reaching the point where data plane becomes the default requires teaching the QEMU event loop and all the core infrastructure to be thread-safe. In the past there has been a big lock which allows a lot of code to simply ignore multi-threading. This creates scalability problems that data plane avoids by using a dedicated thread. Work is underway to reduce scope of the big lock and allow the data plane thread to work with live migration and other QEMU features that are not yet supported.

Patches have also been posted upstream to convert the QEMU net subsystem and virtio-net to data plane. This demonstrates the possibility of converting other performance-critical devices.

With these developments happening, 2013 will be an exciting year for QEMU I/O performance.

29 comments:

  1. Thanks for the blog post. Do you think it's likely that virtio-net with data plane will become the preferred approach instead of vhost-net?

    ReplyDelete
    Replies
    1. I don't think virtio-net data plane will become the preferred approach in the next year. vhost-net has seen an awful lot of performance work and it does zero-copy transmit. It's hard to do zero-copy transmit from userspace (vhost-net uses an kernel-only interface to achieve this at the moment).

      Delete
  2. Stefan I implemented all the steps you gave in this blog. The VM starts up fine and shows IO in the 100K IOPS range. How can I independently verify that the VM is actually using virtio-blk-data-plane and not just virtio-blk ? Both modes can support 100K IOPS so I'm wondering how I can confirm for certain that it's actually using virtio-blk-data-plane threads? Thank You!

    ReplyDelete
    Replies
    1. The "info qtree" QEMU monitor command will show x-data-plane = on.

      Or alternatively, if you compare /proc/$(pidof qemu)/fd between x-data-plane=on and x-data-plane=off there will be an additional eventfd listed.

      Delete
    2. Stefan, thanks! It took me a bit of time to figure out how to get access to "info qtree". For the benefit of those who might also wish to do this, here is what worked for me.

      First, I setup console access to my KVM guest using this wordpress blog which worked perfectly (but idk if this is actually required for getting 'info qtree'):

      http://rwmj.wordpress.com/2011/07/08/setting-up-a-serial-console-in-qemu-and-libvirt/

      Then after I did that I connected got info qtree using your other blog post here:

      http://blog.vmsplice.net/2011/03/how-to-access-qemu-monitor-through.html

      and specifically used this command to get the 'info qtree' output which had eluded my earlier efforts today as a bit of a KVM newb which had to be discontinued in favor of the money-making kind of work. So here is how I got into 'info qtree'. Run this command from the host, not the guest:

      sudo virsh qemu-monitor-command --hmp Oracle65-1 'info qtree'

      and put in the name of your KVM guest in place of "Oracle65-1"

      And yes, it appears I am in fact using virtio-blk-data-plane (see snippet from my info qtree output below):

      dev: virtio-blk-pci, id "virtio-disk0"
      class = 0x0
      ioeventfd = on
      vectors = 2
      x-data-plane = on

      which is really great. FYI I am running on an Ubuntu W520 workstation using a Toshiba Q Series Pro SSD which is DRAM-less and uses 19nm toggle-MLC NAND flash, and using perf_test (and similar results from FIO) I get the following results in my KVM guest using virtio-blk-data-plane = on, for example, at 8K blocksize, using perf_test (similar to fio tool):

      "4096 BW
      MB/sec" "4096
      IOPS" "8192 BW
      MB/sec" "8192
      IOPS"
      450.495 109983 632.04 77152
      436.068 106461 872.674 106527
      454.422 110942 649.34 79264
      442.113 107937 659.15 80462

      which is basically passing through the same speed and IOPS I would get at the host layer.

      Thanks Stefan, great thinking, in a VM...

      Delete
  3. Here's something else that has me wondering...

    Running our storage benchmark tool in KVM host vs. KVM guest.

    Running the benchmark first in the KVM host against physical device /dev/sda

    gstanden@ubuntu-GS1:~$ sudo perf_test -p /dev/sda -t 8 -b 4096 -d 30 -R -C -S
    [sudo] password for gstanden:
    Violin Memory, Inc.
    Version: vtms-linux-utils-D6.2.0.6, 09/12/2013
    Command: perf_test -p /dev/sda -t 8 -b 4096 -d 30 -R -C -S

    Running with options:
    threads = 8
    block_size = 4096
    memory size = 476 GB
    memory start addr = 0x0
    write:read ratio = 0:1
    random mode = 0
    duration = 30 secs
    seconds to skip = 0 secs
    path = /dev/sda
    MB = 1000000 bytes
    incrementing mode = 1 non-repeat
    no-cache mode = 1
    AIO depth = 16

    and summary result for above settings is:

    503.208 MB/s 122853 IOPS

    The above results are in the ballpark for the rated speed of this consumer grade SSD.

    Next running exact same perf_test command in KVM guest with x-data-plane = on against *img KVM backing image file /dev/vda gives:

    1120.335 MB/s 273518 IOPS

    The physical SSD is definitely not rated at 1120 MB/sec and 273000 IOPS ! How are these numbers possible? Do you think these are accurate benchmarks, or is this somehow a flaw in the benchmarking tool?

    ReplyDelete
  4. Hi Stefan

    I was looking at the SPECvirt 2013 results for KVM

    http://www.spec.org/virt_sc2013/results/res2013q3/virt_sc2013-20130730-00003-perf.html#VirtNotes

    and noticed that under "VM Configuration details", it says storage used was "virtio_blk".

    Is this the old virtio which is limited by the big QEMU lock to 150K IOPs or have any of the proposed changes (data-plane, vhost-blk, vhost-scsi) been integrated into virtio-blk?

    What is the status of the big QEMU lock which was hindering IO scalability ?


    ReplyDelete
    Replies
    1. data-plane is available as an experimental option in QEMU. Please check the benchmark run details to see if it was used (the option is -device virtio-blk-pci,...,x-data-plane=on).

      vhost-blk was never merged, I think the issue there was that the performance wins were not clear. And there is a big feature gap (no image formats, no block jobs, etc) if storage emulation bypasses QEMU.

      vhost-scsi is merged but not commonly shipped by distros at the moment. Again, it has the same drawbacks as vhost-blk with regards to feature-parity.

      Scalability improvements in QEMU are ongoing. Some parts no longer need a global mutex but overall the mutex is still there and will be until dataplane fully replaces the current virtio-blk implementation.

      Delete
  5. I am using below libvirt domain XML file to enable virtio-blk data plane on centos 6.5 as host with qemu-kvm-0.12.1.2-2.415.el6.8.x86_64 and libvirt-0.10.2-29.el6.9.x86_64 RPMs intalled.


    .....










    3a933e22-f032-4dd4-b99c-dfc83acf6de8

    .....










    I am able to define the domain without any error. After starting, qemu-kvm launched with below args.

    # tail -f /var/log/libvirt/qemu/instance-00000019.log
    LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name instance-00000018 -S -M rhel6.5.0 -cpu core2duo,+lahf_lm,+dca,+pdcm,+xtpr,+cx16,+tm2,+vmx,+ds_cpl,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds -enable-kvm -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid ebe64ba7-047c-448c-b5de-5a31ec8464c6 -smbios type=1,manufacturer=Red Hat Inc.,product=OpenStack Nova,version=2014.1-4.el6,serial=44454c4c-4b00-1053-804a-c7c04f523153,uuid=ebe64ba7-047c-448c-b5de-5a31ec8464c6 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000018.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -no-kvm-pit-reinjection -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/ebe64ba7-047c-448c-b5de-5a31ec8464c6/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/var/lib/nova/mnt/8d8f616f22b16b4933291b91b6cf164e/volume-3a933e22-f032-4dd4-b99c-dfc83acf6de8,if=none,id=drive-virtio-disk1,format=raw,serial=3a933e22-f032-4dd4-b99c-dfc83acf6de8,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,fd=30,id=hostnet0,vhost=on,vhostfd=31 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:ec:80:1c,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/ebe64ba7-047c-448c-b5de-5a31ec8464c6/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -set device.virtio-disk1.scsi=off -set device.virtio-disk1.x-data-plane=on
    Domain id=7 is tainted: custom-argv
    char device redirected to /dev/pts/3
    Domain id=7 is tainted: custom-monitor

    # virsh qemu-monitor-command instance-00000018 --hmp info qtree
    dev: virtio-blk-pci, id "virtio-disk1"
    dev-prop: class = 0x100
    dev-prop: drive = drive-virtio-disk1
    dev-prop: ioeventfd = on
    dev-prop: x-data-plane = on
    dev-prop: scsi = off

    But, libvirt daemon is terminating the qemu-kvm process with signal 15 after few minutes of guest running successfully. I want to run below command inside guest with x-data-plane=on.

    # fio --filename=/dev/vdb --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=50 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest

    My intention is to take performance result with data-plane. Please let me know where I am doing wrong or if you need more information to answer. My guest OS is centos6.5. Is there extra package needs to be installed inside guest to make it working?

    ReplyDelete
    Replies
    1. Signal 15 is SIGTERM. That means libvirt is deliberately shutting down the guest.

      No extra packages are needed to make virtio-blk dataplane work. Look at libvirtd's own logs to find out why it is shutting down the guest. It looks like this is not a QEMU issue.

      Delete
    2. Hi Stefanha,

      libvirtd is sending SIGTERM to qemu-kvm after few minutes, but only when data-plane configured in the domain xml file. Without data-plane, it is working fine with the same guest. Below is the relevant part of the libvirtd log. I am not able to understand from the log why suddenly virDomainDestroy is being called which is the reason libvirtd is killing QEMU. Please suggest me where to look at.

      8972: debug : virKeepAliveCheckMessage:384 : ka=0x17a3770, client=0x17a5f00, msg=0x17a7800
      8972: debug : virObjectRef:168 : OBJECT_REF: obj=0x17a5f00
      ....
      8972: debug : virNetServerDispatchNewMessage:218 : server=0x179e190 client=0x17a5f00 message=0x17a7800
      8973: debug : virNetServerProgramDispatch:284 : prog=536903814 ver=1 type=0 status=0 serial=210255 proc=12
      8973: debug : virDomainDestroy:2172 : dom=0x7f34940a96a0, (VM: name=instance-0000001a, uuid=.....)
      8973: debug : virObjectRef:168 : OBJECT_REF: obj=0x7f34900c1e40
      8973: debug : qemuProcessKill:4160 : vm=instance-0000001a pid=27243 flags=1
      ...
      8972: debug : qemuMonitorIOProcess:354 : QEMU_MONITOR_IO_PROCESS: mon=0x7f34941284c0 buf={"timestamp": {"seconds": 1403763563, "microseconds": 740372}, "event": "SHUTDOWN"}
      len=85
      ..
      8972: debug : qemuProcessHandleShutdown:663 : vm=0x7f34900c1e40

      Delete
    3. I don't see any reason for the shutdown in the log lines you posted. Please ask libvirt folks for help (#virt on irc.oftc.net).

      Delete
  6. Hi Stefanha,
    I'm running a Centos6.5 host with libvirt-0.10.2-29.el6_5.9.x86_64 and qemu-kvm-0.12.1.2-2.415.el6_5.10.x86_64 installed. These are latest libvirt and qemu-kvm rpms for Centos6.5. From changlog of qemu-kvm rpm, I believe x-data-plane is supported.
    The weired thing is I always failed to add "" section by "virsh edit ". After I added those section to support x-data-plane via "virsh edit", I checked if those qemu args were added successfully by "virsh dumpxml", I found those arges I added were not in xml file. I doubt that virsh ignored those args automatically and I don't know why.
    What's your libvirt version? Does these qemu args require specific libvirt version?
    Thanks,
    Kenny

    ReplyDelete
    Replies
    1. Yes, dataplane is available in CentOS 6.5.

      Did you modify the tag as shown in the blog post?


      You must have the xmlns:qemu part, otherwise virsh will discard your XML changes.

      Delete
    2. blogger.com dropped the XML from my reply. Retrying with HTML entities. Hopefully it won't filter it out again:

      <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

      You needs the xmlns:qemu attribute.

      Delete
    3. Hi Stefanha,
      I happendly found it's because I used "virsh edit" when the VM is running. In that case, those args(including xmlns:qemu attribute) were ignored and not saved. When I powered off the VM and then added those qemu args, it worked.
      Thanks for this great feature.
      Is there any plan to have x-data-plane in virtio-scsi?

      Thanks,
      Kenny

      Delete
    4. virsh edit should work while the guest is running. But you have to reboot to get the effects of your changes.

      If dumpxml does not show your changes then libvirt rejected them because it was unhappy with the XML syntax.

      The plan is for the QEMU 2.1 release (end of July 2014) to lift many of the limitations of dataplane (no image formats, no I/O throttling, no rerror/werror policy, etc).

      virtio-scsi dataplane is interesting but there is still more work necessary before it will be ready.

      Delete
    5. Thanks, Stefanha.
      I'm working to port a backup appliance from VMware to KVM. I'm using virtio-scsi in my VM with a customized linux kernel based on 3.2. It's mainly a backup appliance(IO bounded). I found the significant backup performance drop(up to 30% drop) with KVM comparing to running on VMware with the same hardware configuration. So I'm tring to use virtio-blk with x-data-plane to see if the performance could be better.

      The reason I'm asking if there is a plane to have x-data-plane in virtio-scsi is that my appliance supposes only SCSI disks are used, especially those disk management logic heavilty depends on sdx, HCTL, devices path in /sys, and so on. If virtio-scsi has x-data-plane, that will be wonderful for both compatibility, scalability and performance benefits.

      Thanks for your guidance.

      -Kenny

      Delete
    6. Kenny,
      virtio-scsi (currently without dataplane) is actively developed and you are welcome to post KVM benchmark results to qemu-devel@nongnu.org so the developers can investigate.

      Information to include:
      1. QEMU command-line (ps aux | grep qemu)
      2. Storage configuration on the host (e.g. raw file on xfs file system).
      3. Host/guest kernel and QEMU versions
      4. Benchmark configuration (e.g. sequential write 8KB, queue depth 16)
      5. Benchmark results

      Stefan

      Delete
  7. Hi Stefanha,

    Just wanted to understand the use case of .bdrv_co_discard interface of block driver. I mean what are the things I should do to implement this interface in the block driver. I am guessing this can be used to reclaim the blocks which are discarded.

    Please confirm and advice.

    Thanks
    Sanjay

    ReplyDelete
    Replies
    1. Hi Sanjay,
      Please email general questions about QEMU to qemu-devel@nongnu.org. This is not related to the blog post.

      .bdrv_co_discard() may be invoked by the guest or by QEMU to free blocks. If you are implementing an image format it should free allocated blocks. If you are implementing a protocol driver, like a network storage system, it should tell the remote storage device to free blocks.

      Some image formats don't have a sophisticated enough block allocator to perform discard. Some network storage protocols don't have a command to free blocks. In these cases you don't need to implement .bdrv_co_discard().

      Stefan

      Delete
  8. Hi Stefan,
    Please let me know does network based protocol supports 'qcow2' format? As my block driver is network based. It is perfectly working for raw format. When I change the type='qcow2', then it is not working because bdrv_get_geometry() is returning 0. I am working with qemu-kvm-0.12.1.2.

    Thanks & Regards,
    Sanjay Kumar

    ReplyDelete
    Replies
    1. Yes, qcow2 works on non-file protocols including network protocols.

      Delete
  9. Hi Stefan,
    Our qemu version is 0.12.1.2, kernel version is 2.6.32, both provided by rhel.
    Performance test results show that data-plane work very well, but qemu io throttling and network protocols(rbd) does not supported on our version of qemu, sadly those features are badly needed by us. Currently we are not going to upgrade our operating system to centos7 yet, my question is, do you have plan to backport your patches to centos 6? I think it not easy because the versions differs a lot, but I really want to experience this feature on our production environment.
    thanks~

    ReplyDelete
    Replies
    1. Hi,
      The latest virtio-blk dataplane features are available in RHEL7. If you need them in RHEL6, please contact Red Hat to express your interest.

      Stefan

      Delete
  10. Hi Stefan,

    I'm developing for Libiscsi userspace initiator, for that I tried to understand the data-path flow in Qemu - virtio-scsi. Couldn't get it so much if you can help me with few things I will be more than grateful.
    - where qemu create the thread that is taking care of virtio-scsi IO?
    - where is the poll on some fd to know socket is available?
    - where can I find the block queue-depth in qemu? who is responsible to assign it?
    - also if you can explain about the difference of virtio-scsi-dataplane and the regular virtio-scsi?

    Thanks, Roy

    ReplyDelete
    Replies
    1. Please email the QEMU mailing list at qemu-devel@nongnu.org and CC famz@redhat.com and stefanha@redhat.com.

      In short, the difference between dataplane and non-dataplane is that dataplane has an IOThread (iothread.c) running. The virtio-scsi virtqueue processing is done in the IOThread when dataplane is enabled (see hw/scsi/virtio-scsi.c). This avoids taking the QEMU global mutex and scales better when there are many devices.

      Delete
  11. Hi stefan,

    After reading the dataplane code - I gather for the wrapping block device the AIOContext is that of the dataplane's internal I/O thread. However I have seen also in the code that it's possible to give iothread config parameter as well when using data plane. What are the scenarios under which it makes sense to do so? Shouldn't iothread and dataplane on be mutually exclusive?

    Or is there something missing in my understanding?

    Thanks

    -abhijit

    ReplyDelete
    Replies
    1. They are the same thing. The -device virtio-blk-pci,iothread=foo syntax is the latest and recommended syntax.

      The -device virtio-blk-pci,x-data-plane=on syntax is an experimental and older syntax that implicitly creates an IOThread.

      Nowadays -device virtio-blk-pci,iothread=foo is the syntax that should be used.

      Delete