Wednesday, 7 September 2011

QEMU Internals: vhost architecture

This post explains how vhost provides in-kernel virtio devices for KVM. I have been hacking on vhost-scsi and have answered questions about ioeventfd, irqfd, and vhost recently, so I thought this would be a useful QEMU Internals post.

Vhost overview

The vhost drivers in Linux provide in-kernel virtio device emulation. Normally the QEMU userspace process emulates I/O accesses from the guest. Vhost puts virtio emulation code into the kernel, taking QEMU userspace out of the picture. This allows device emulation code to directly call into kernel subsystems instead of performing system calls from userspace.

The vhost-net driver emulates the virtio-net network card in the host kernel. Vhost-net is the oldest vhost device and the only one which is available in mainline Linux. Experimental vhost-blk and vhost-scsi devices have also been developed.

In Linux 3.0 the vhost code lives in drivers/vhost/. Common code that is used by all devices is in drivers/vhost/vhost.c. This includes the virtio vring access functions which all virtio devices need in order to communicate with the guest. The vhost-net code lives in drivers/vhost/net.c.

The vhost driver model

The vhost-net driver creates a /dev/vhost-net character device on the host. This character device serves as the interface for configuring the vhost-net instance.

When QEMU is launched with -netdev tap,vhost=on it opens /dev/vhost-net and initializes the vhost-net instance with several ioctl(2) calls. These are necessary to associate the QEMU process with the vhost-net instance, prepare for virtio feature negotiation, and pass the guest physical memory mapping to the vhost-net driver.

During initialization the vhost driver creates a kernel thread called vhost-$pid, where $pid is the QEMU process pid. This thread is called the "vhost worker thread". The job of the worker thread is to handle I/O events and perform the device emulation.

In-kernel virtio emulation

Vhost does not emulate a complete virtio PCI adapter. Instead it restricts itself to virtqueue operations only. QEMU is still used to perform virtio feature negotiation and live migration, for example. This means a vhost driver is not a self-contained virtio device implementation, it depends on userspace to handle the control plane while the data plane is done in-kernel.

The vhost worker thread waits for virtqueue kicks and then handles buffers that have been placed on the virtqueue. In vhost-net this means taking packets from the tx virtqueue and transmitting them over the tap file descriptor.

File descriptor polling is also done by the vhost worker thread. In vhost-net the worker thread wakes up when packets come in over the tap file descriptor and it places them into the rx virtqueue so the guest can receive them.

Vhost as a userspace interface

One surprising aspect of the vhost architecture is that it is not tied to KVM in any way. Vhost is a userspace interface and has no dependency on the KVM kernel module. This means other userspace code, like libpcap, could in theory use vhost devices if they find them convenient high-performance I/O interfaces.

When a guest kicks the host because it has placed buffers onto a virtqueue, there needs to be a way to signal the vhost worker thread that there is work to do. Since vhost does not depend on the KVM kernel module they cannot communicate directly. Instead vhost instances are set up with an eventfd file descriptor which the vhost worker thread watches for activity. The KVM kernel module has a feature known as ioeventfd for taking an eventfd and hooking it up to a particular guest I/O exit. QEMU userspace registers an ioeventfd for the VIRTIO_PCI_QUEUE_NOTIFY hardware register access which kicks the virtqueue. This is how the vhost worker thread gets notified by the KVM kernel module when the guest kicks the virtqueue.

On the return trip from the vhost worker thread to interrupting the guest a similar approach is used. Vhost takes a "call" file descriptor which it will write to in order to kick the guest. The KVM kernel module has a feature called irqfd which allows an eventfd to trigger guest interrupts. QEMU userspace registers an irqfd for the virtio PCI device interrupt and hands it to the vhost instance. This is how the vhost worker thread can interrupt the guest.

In the end the vhost instance only knows about the guest memory mapping, a kick eventfd, and a call eventfd.

Where to find out more

Here are the main points to begin exploring the code:
  • drivers/vhost/vhost.c - common vhost driver code
  • drivers/vhost/net.c - vhost-net driver
  • virt/kvm/eventfd.c - ioeventfd and irqfd
The QEMU userspace code shows how to initialize the vhost instance:
  • hw/vhost.c - common vhost initialization code
  • hw/vhost_net.c - vhost-net initialization


  1. This comment has been removed by the author.

  2. Nice write-up! If you don't mind I have a networking related question. We noticed that the KVM *host* networking performance dropped after we installed the virtualization software on the base server (there were no guests configured etc.). Would like to understand why this happens.

    1. I saw you posted this question on the KVM mailing list. It's best to discuss it there where others in the community can help.

  3. Hello Stefan,

    at first: Good work. Your articles are very nice!

    in the graphic there is "dma-access" in qemu-space.
    Could you point out, where a DMA-Transfer is initiated?
    If I understand correctly than only the low-level driver from the network-card does any DMA in his own RX/TX buffers...not VHOST self.
    The communication with the low-level driver and the vhost is through sockets?!
    The Transfer from RX/TX buffers is then a normal memcpy?

    It would be nice if you have the time to clarify some things for me.



    1. Hi Fuzolan,

      > Could you point out, where a DMA-Transfer is initiated?
      > The Transfer from RX/TX buffers is then a normal memcpy?

      vhost_net supports zero-copy transmit. This means that guest RAM is mapped so that the physical NIC can DMA from it.

      The receive path still requires a memory copy from host kernel socket buffers into guest RAM (which is mapped from the QEMU userspace process).

      Take a look at drivers/vhost/net.c handle_tx() and handle_rx().

      > The communication with the low-level driver and the vhost is through sockets?!

      vhost_net does not directly communicate with a physical NIC driver. It talks to a tun-like (tap or macvtap) driver. tap interfaces are often placed on a software bridge which forwards packets to the physical NIC.

      vhost_net uses the in-kernel struct socket interface. But it only works with tun (tap) or macvtap driver instances - it will refuse to use regular socket file descriptors, see drivers/vhost/net.c:get_tap_socket().

      Note that the tun driver struct socket is never exposed to user-space as a socket file descriptor. It's just used inside the kernel like AF_PACKET sockets in userspace.

  4. Hi Stefan. I liked your post. I am interested in including traffic shaping in vhost architecture. I have been googling around. As you said, I am looking in drivers/vhost/vhost.c and net.c. It would be great if you can help me with it. I have used linux QoS tool TC. The problem with tc tool is it does ingress policing to outgoing traffic from virtual machine. The way around to this problem is using ifb. But tc with ifb drops excessive packets after packet has left vm. This creates additional overhead. Instead if traffic shaping is implemented in vhost worker thread, it will not pickup packets from guest os unless the guest os has tokens available.

    1. By the time the packet has reached vhost_net.c, the guest has already burnt CPU assembling the packet and preparing it for virtio-net. When the TCP protocol is used the guest will adjust to the network performance, thereby actually holding off on creating packets that would just get dropped.

      Therefore I'm not sure if traffic shaping in vhost_net.c is a big win. For UDP it's more of a win.

      Also keep in mind that the throttling you propose basically makes the vring a packet queue which increases latency. Imagine the vring is being throttled and is currently full. Now if the guest wants to send an urgent packet it cannot until all previous packets have drained! (This is the "bufferbloat" problem.)

      The first step for performance work is a benchmark. Implementing packet-per-second TX throttling in vhost_net.c should be fairly straightforward. Then you can compare the host CPU utilization between the approaches.

  5. Hi Stefan,

    I am running a VM in Linux that uses Vhost driver for sending/receiving traffic in backend. VM has two ports and the stack running in the VM functions like simple router that receives packet from one virtual port and simply routes it to other virtual port.

    In the backend, linux Vhost driver does the actual packet handling for VM.

    In my routing test, the Vhost TX path seems to be twice expensive than Vhost RX packet.

    the "sendmsg" socket call in "handle_tx" routine takes almost twice the time than "recvmsg" socket call in "handle_rx" routine

    sendmsg takes ~12000 cycles
    recvmsg takes ~6000 cycles

    Is there any thing wrong?
    Is there any way to minimize the time taken in sendmsg call?


    1. Hi,

      I'm not Stefan but I want to share my results.
      I made some tests with iperf. The RX-Result inside and outside a guest was always better than TX. The CPU-Consumption was always less for a RX. Because I get same results on host bare metal and guests...I think this is kernelspecific and is not a related to vhost, etc.

      Maybe you could get a little better performance if you active zerocopy in your vhost-module.



    2. Please send your question to so the vhost developers can participate.


  6. Hi,

    I am trying to use the balloon driver in qemu to do the working set estimation.
    For this I intend to dynamically set config the size of balloon and study the
    swap in and out events from the balloon stats.

    In this regard, I have certain clarification regarding my understanding of virtio :

    a) The qemu balloon driver talks to the guest balloon driver and vice-versa directly without any intervention by the host balloon driver. Is it correct ?

    b) It is the responsibility of the qemu balloon driver to alert the host through the
    ioeventfd (or irqfd) regarding any such virtio communication. So if I set config balloon size through qemu. Is the loop like this,
    qemu-user space ---->guest io---->qemu-user-space------>host---->qemu-user space

    or like

    qemu-user space ---->guest io---->host---->qemu-user space

    Another question is that I do not fully understand the operation of the virtqueues and their operations in the qemu-driver.For example the balloon_init driver fucntion, we have add_queue fucntion. Does this mean that only a quest driver can talk to qemu-driver over these queues or a host can also interact with the qemu-driver over these queues.

    Secondly, I am also not much clear with the virtque_notify function in virtio.c. Does this notify only the guest driver. Also there is no kick function call in it. Russell paper says kick function are used for notifying other end of queue. There are signal realted vring members being set in the code. But I do not fairly understand the mechanism.

    Thirdly, Is anywhere the vhost abstraction come into picture in these interactions.


    1. a. There is no host balloon driver, at least no special host kernel code for virtio-balloon. virtio-balloon is implemented in QEMU userspace, see hw/virtio/virtio-balloon.c.

      b. guest virtio_balloon -> QEMU userspace virtio-balloon -> madvise(MADV_DONTNEED or MADV_WILLNEED) on host

      The details of ioeventfd, PCI emulation, etc don't matter here. The guest is talking to QEMU. QEMU uses the madvise(2) syscall to tell the host kernel to drop pages or bring them back.

      c. The guest only talks to QEMU for virtio-balloon, there is no host kernel code for virtio-balloon.

      d. The QEMU virtio-pci.c code ensures that virtio_balloon_handle_output() gets called when the guest kicks the host. And QEMU's virtio_notify() kicks the guest by raising an interrupt.

      The guest kernel balloon_ack() function is invoked by the virtio-pci driver when the host raises an interrupt. The guest kicks the host using virtqueue_kick(), which will perform an I/O BAR write on the virtio PCI adapter (see vp_notify()).

      e. vhost doesn't come in to play with virtio-balloon since it's only implemented in QEMU userspace, not the host kernel. Therefore your questions are not really related to this blog post. Please ask general technical questions on so the community can help.

  7. Hello Stefan.
    Your articles are very good.
    I have two questions about vhost worker thread.

    > File descriptor polling is also done by the vhost worker thread.
    a) How long is polling interval of default?
    b) What should I do for changing polling interval?

    Thank you.

    1. Hi Kazu,
      "Polling" means Linux kernel file_operations->poll(). vhost does not repeatedly check the file descriptor. Instead, it gets notified when the status changes similar to the poll(2) system call.

    2. Thank you for answering my question.

      > "Polling" means Linux kernel file_operations->poll().
      Oh ... I had misunderstood.
      What would be the best thing to change the interval at which Linux kernel polls a file descripter?

      Thank you.

    3. The poll(2) mechanism does not busy wait. Therefore there is no interval. The kernel knows when the status of a file descriptor changes so it just wakes up waiting tasks at that point.

  8. Hi,

    I am new to virtualization and running into a problem when packets are sent out at higher rate from guest running on kvm based host. I am seeing 'overruns' counters incrementing for the vnetXX interfaces, so to circumvent this issue I increased the txqueuelen from default value of 500 to 10k, this solves the issue, but I would like to understand the root cause of the issue, is the issue happening because of the slow draining from guest or qemu is not able to drain out packets from guest. Essentially I would like to understand in which datapath (from guest or to guest) txqueuelen parameter is being used.

    some info about my environment
    qemu net creation options
    -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=26 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:ac:00:00:01:01,bus=pci.0,addr=0x3,bootindex=1

    Note: though vhost=on but it is not using vhost driver, since guest doesn't support msi-x

    and type of vnet

    ethtool -i vnet0
    driver: tun
    version: 1.6
    firmware-version: N/A
    bus-info: tap
    supports-statistics: no

    Any help or pointers would be highly appreciated.

    1. Please email your question to to others in the community can help too.

      If you aren't actually using vhost, then expect much lower performance. Nearly all the optimization work in recent years has been in vhost_net.

  9. hello

    I read your post and post was very interesting.
    but i have some question about vhost.

    i study vhost-net and vhost-scsi source code for write vhost driver example.
    and i want to know data path of guest to host kernel in vhost architecture.
    but i wondering point of frontend source code for vhost-net and vhsot-scsi.

    I guess frontend source code must be written at guest OS kernel.
    so drivers/net/virtio-net.c and drivers/scsi/virtio-scsi.c are may be frontend soruce code.

    are these source code is frontend source code for vhost code??
    and in vhost architecture, there is any change of frontend code for support vhost??

    1. The guest sees no difference between a vhost-net and a virtio-net device. Both implement the virtio-net specification. The difference is purely where the emulation code in the host lives (host kernel vs host userspace).

      This means the same virtio_net.ko guest driver is used in both cases.

    2. Thank you for your answer.

      in Guest OS frontend driver use virtqueue and virtio API for data communication.
      but in Host OS vhost driver use vhost_virtqueue for data communication.

      how these different structure are binding??
      and who dose binding these structure??(ex: common vhost code in QEMU, common vhost code in host kernel or etc)

    3. vhost implements the virtio data structures. It uses the memory layout defined in the virtio specification.

      QEMU userspace sets up the vhost kernel driver and gives it a list of guest RAM regions.

      For more info, please read the source code.

  10. This comment has been removed by the author.

  11. what is the driver used in linux guest for virtio net device? For control path (for kick events) it is virtio-pci?

    1. The normal virtio guest drivers that are also used for virtio-blk, virtio-scsi, virtio-rng, etc are used with vhost.

      The virtio-net drivers on x86 machines are: virtio_net.ko, virtio_ring.ko, virtio.ko, virtio_pci.ko.

  12. I am a bit late on this - I am trying to trace a packet tx from a virtio-net driver. I have managed to trace up to virtio_kick - that in turn calls
    virtqueue_kick_prepare and then virtqueue_notify and then vp_notify (as vq->notify in virtqueue_notify) in the guest kernel. That does an iowrite16 to VIRTIO_PCI_QUEUE_NOTIFY register. This should now come to 'qemu' right?

    In the qemu - virtio_ioport_write with val VIRTIO_PCI_QUEUE_NOTIFY . virtio_queue_notify - and eventually vq->handle_output getting called which is virtio_net_handle_tx_bh (setup in hw/net/virtio-net.c) (not sure about the last part) - What I am not getting it. How des it get into vhost_net inside host from here. I tried looking up the tap setup with vhost=on and tracing - but couldn't make much progress. Is my understanding correct? What should be the next place to look at?

    1. The VIRTIO_PCI_QUEUE_NOTIFY write is handled first by the kvm kernel module.

      When vhost_net is active there is an ioeventfd file descriptor registered with kvm.ko. Instead of returning from ioctl(KVM_RUN) back to the QEMU userspace as you observed, the ioeventfd signals the vhost_net kernel module. QEMU will not handle the virtqueue processing.

      The vhost_net code (drivers/vhost/net.c) will then process the virtqueue. That means you need to trace in the kernel, not QEMU.

      Note that during vhost setup there are code paths that exit back to QEMU. Once the virtio_net device is fully initialized all rx/tx virtqueue processing happens in vhost_net.ko and not QEMU.

    2. Thanks for answering this.

      So the code above in qemu - would come into play if one starts - without vhost=on parameters. Is that right?

      For tracing how the KVM ioeventfd is set, I need to look at qemu/kvm-all.c and then trace the ioctl in kvm sources in kernel right?

    3. > So the code above in qemu - would come into play if one starts - without vhost=on parameters. Is that right?

      No, as I mentioned in my reply the QEMU code still runs during vhost setup. vhost_net.ko only processes virtqueues once device initialization is complete. Before that QEMU will do some of the setup and may even receive the first guest->host notify.

      > For tracing how the KVM ioeventfd is set, I need to look at qemu/kvm-all.c and then trace the ioctl in kvm sources in kernel right?

      The ioeventfd code is in virt/kvm/eventfd.c.

    4. > No, as I mentioned in my reply the QEMU code still runs during vhost setup. vhost_net.ko only processes virtqueues once device initialization is complete. Before that QEMU will do some of the setup and may even receive the first guest->host notify.

      Thanks for this clarification. Sorry, I missed this point in original reply or rather didn't fully understand!

  13. With virt io, my virt ring is shared with the kernel, but what about each individual skb. Are they shared or copied from the ring ?

    If shared how it is hared, please point to the code.
    If they are copied where are they copied from the ring. Is it in the tap to driver interface ? please point to the code.

  14. > With virt io, my virt ring is shared with the kernel, but what about each individual skb. Are they shared or copied from the ring ?

    Take a look at handle_tx() for guest->host packets and handle_rx() for host->guest packets in drivers/vhost/net.c.

    The tx code path has zero-copy support. The rx code path does not have zero-copy support since it does recvmsg() into the virtqueue buffer (i.e. a copy).

  15. Thanks a lot.

    One more confusion I have is with how the guest space is shared with qemu,
    or what is the sharing design for the PCI transport layer.

    The guest ring consists of gva and gpa, how it is translated to hva.
    Again great if you can point the code.

    1. QEMU allocates the guest physical memory. This means QEMU has full access to guest physical RAM.

      All memory addresses in virtio are guest physical memory addresses. Since QEMU knows the location of the guest physical RAM it can translate gpa to hva.

      For an overview of how guest physical memory is organized, see

    2. Hi Stephan, the packet that is transmitted using tun_sendmsg() will still be HVA and it will get converted in Network driver in to HPA for the DMA. Is my understanding correct.

    3. Hi Stephan, the packet that is transmitted using tun_sendmsg() will still be HVA and it will get converted in Network driver in to HPA for the DMA. Is my understanding correct.

  16. Hi Stefan,

    If we increase [#define] VHOST_MAX_PEND 128 to 256, will there be any side effects? Or how this value is arrived? [We are facing some tx failure in our product due to tx descriptor exhaust. So planning to increase this value]

  17. Hi Stefan,

    If we increase [#define] VHOST_MAX_PEND 128 to 256, will there be any side effects? Or how this value is arrived? [We are facing some tx failure in our product due to tx descriptor exhaust. So planning to increase this value]

    1. Please email with your question so Michael Tsirkin and other vhost developers can participate in the discussion.

      In general, what you want to do sounds questionable. Descriptor rings are finite and code must be able to handle exhaustion. Increasing a limit might appear to help temporarily, but if the workload increases you will hit the same problem again.

      By the way, VHOST_MAX_PEND is just related to the number of in-flight zero-copy transmissions but the actual descriptor ring size is the virtqueue size. The virtqueues are initialized in QEMU's hw/net/virtio-net.c:virtio_net_add_queue() function with 256 descriptors.

  18. Hi Stephan, Can vhost module access the Network device Buffer directly instead of going through TAP socket or RAW socket.

    1. The vhost_net.ko kernel module only supports tap-like socket interfaces. In theory it could be modified to support other approaches too.

  19. Hi Stephan,

    Thanks for the knowledge,
    If I want to use vhost-blk for an iSCSI block device what is the appropriate way to do it?
    with QEMU user-space virtio-blk I can create an iSCSI connection (attached to a block device) and to pass it to the guest with virtio-blk paravirtualized method.


    1. vhost-blk is not being actively developed and was never merged into Linux or QEMU.

      I suggest using userspace virtio-blk. Libvirt can help you manage iSCSI LUNs and assign them to guests:

  20. Hi stefan
    Nice write up.
    In my setup I would like to have mac filtering in the virtio driver level, I have gone through the virtio spec and understood that VIRTIO_NET_CTRL_MAC_TABLE would do that, can you provide steps how to achieve this.

    1. Please send an email with your question to and CC Michael Tsirkin , Jason Wang . They maintain virtio and networking in Linux and QEMU.

      In the email, please clarify whether you are writing your own virtio-net driver or are using a Linux guest with the kernel virtio_net.ko driver.

  21. Hey Stephan

    I am trying a similar setup where I am trying to send FS specific commands from the guest to the vhost using virtio (something like 9P_FS , extending it for freebsd guest with linux as the host.) . Was wondering about the virtio-blk to use to send the requests. Now it makes sense why I couldn't find that in the tree. Do you have any suggestions on the approach ? What do you mean by virtio-blk in userspace. I couldn't really find anything in the code. Any pointers here would help . Thanks for you knowledge share and appreciate your help.

    1. This article is about vhost, the framework used to implement vhost_net.ko in the Linux kernel. virtio-blk and virtio-9p do not use vhost, they are implemented in QEMU.

      I'm not sure exactly what you're trying to do, but if you want to write a FreeBSD virtio-9p driver then you don't need vhost.

      For general technical questions, please email