Saturday, 5 March 2011

QEMU Internals: Overall architecture and threading model

This is the first post in a series on QEMU Internals aimed at developers. It is designed to share knowledge of how QEMU works and make it easier for new contributors to learn about the QEMU codebase.

Running a guest involves executing guest code, handling timers, processing I/O, and responding to monitor commands. Doing all these things at once requires an architecture capable of mediating resources in a safe way without pausing guest execution if a disk I/O or monitor command takes a long time to complete. There are two popular architectures for programs that need to respond to events from multiple sources:
  1. Parallel architecture splits work into processes or threads that can execute simultaneously. I will call this threaded architecture.
  2. Event-driven architecture reacts to events by running a main loop that dispatches to event handlers. This is commonly implemented using the select(2) or poll(2) family of system calls to wait on multiple file descriptors.

QEMU actually uses a hybrid architecture that combines event-driven programming with threads. It makes sense to do this because an event loop cannot take advantage of multiple cores since it only has a single thread of execution. In addition, sometimes it is simpler to write a dedicated thread to offload one specific task rather than integrate it into an event-driven architecture. Nevertheless, the core of QEMU is event-driven and most code executes in that environment.

The event-driven core of QEMU


An event-driven architecture is centered around the event loop which dispatches events to handler functions. QEMU's main event loop is main_loop_wait() and it performs the following tasks:

  1. Waits for file descriptors to become readable or writable. File descriptors play a critical role because files, sockets, pipes, and various other resources are all file descriptors. File descriptors can be added using qemu_set_fd_handler().
  2. Runs expired timers. Timers can be added using qemu_mod_timer().
  3. Runs bottom-halves (BHs), which are like timers that expire immediately. BHs are used to avoid reentrancy and overflowing the call stack. BHs can be added using qemu_bh_schedule().

When a file descriptor becomes ready, a timer expires, or a BH is scheduled, the event loop invokes a callback that responds to the event. Callbacks have two simple rules about their environment:
  1. No other core code is executing at the same time so synchronization is not necessary. Callbacks execute sequentially and atomically with respect to other core code. There is only one thread of control executing core code at any given time.
  2. No blocking system calls or long-running computations should be performed. Since the event loop waits for the callback to return before continuing with other events, it is important to avoid spending an unbounded amount of time in a callback. Breaking this rule causes the guest to pause and the monitor to become unresponsive.

This second rule is sometimes hard to honor and there is code in QEMU which blocks. In fact there is even a nested event loop in qemu_aio_wait() that waits on a subset of the events that the top-level event loop handles. Hopefully these violations will be removed in the future by restructuring the code. New code almost never has a legitimate reason to block and one solution is to use dedicated worker threads to offload long-running or blocking code.

Offloading specific tasks to worker threads


Although many I/O operations can be performed in a non-blocking fashion, there are system calls which have no non-blocking equivalent. Furthermore, sometimes long-running computations simply hog the CPU and are difficult to break up into callbacks. In these cases dedicated worker threads can be used to carefully move these tasks out of core QEMU.

One example user of worker threads is posix-aio-compat.c, an asynchronous file I/O implementation. When core QEMU issues an aio request it is placed on a queue. Worker threads take requests off the queue and execute them outside of core QEMU. They may perform blocking operations since they execute in their own threads and do not block the rest of QEMU. The implementation takes care to perform necessary synchronization and communication between worker threads and core QEMU.

Another example is ui/vnc-jobs-async.c which performs compute-intensive image compression and encoding in worker threads.

Since the majority of core QEMU code is not thread-safe, worker threads cannot call into core QEMU code. Simple utilities like qemu_malloc() are thread-safe but that is the exception rather than the rule. This poses a problem for communicating worker thread events back to core QEMU.

When a worker thread needs to notify core QEMU, a pipe or a qemu_eventfd() file descriptor is added to the event loop. The worker thread can write to the file descriptor and the callback will be invoked by the event loop when the file descriptor becomes readable. In addition, a signal must be used to ensure that the event loop is able to run under all circumstances. This approach is used by posix-aio-compat.c and makes more sense (especially the use of signals) after understanding how guest code is executed.

Executing guest code


So far we have mainly looked at the event loop and its central role in QEMU. Equally as important is the ability to execute guest code, without which QEMU could respond to events but would not be very useful.

There are two mechanism for executing guest code: Tiny Code Generator (TCG) and KVM. TCG emulates the guest using dynamic binary translation, also known as Just-in-Time (JIT) compilation. KVM takes advantage of hardware virtualization extensions present in modern Intel and AMD CPUs for safely executing guest code directly on the host CPU. For the purposes of this post the actual techniques do not matter but what matters is that both TCG and KVM allow us to jump into guest code and execute it.

Jumping into guest code takes away our control of execution and gives control to the guest. While a thread is running guest code it cannot simultaneously be in the event loop because the guest has (safe) control of the CPU. Typically the amount of time spent in guest code is limited because reads and writes to emulated device registers and other exceptions cause us to leave the guest and give control back to QEMU. In extreme cases a guest can spend an unbounded amount of time without giving up control and this would make QEMU unresponsive.

In order to solve the problem of guest code hogging QEMU's thread of control signals are used to break out of the guest. A UNIX signal yanks control away from the current flow of execution and invokes a signal handler function. This allows QEMU to take steps to leave guest code and return to its main loop where the event loop can get a chance to process pending events.

The upshot of this is that new events may not be detected immediately if QEMU is currently in guest code. Most of the time QEMU eventually gets around to processing events but this additional latency is a performance problem in itself. For this reason timers, I/O completion, and notifications from worker threads to core QEMU use signals to ensure that the event loop will be run immediately.

You might be wondering what the overall picture between the event loop and an SMP guest with multiple vcpus looks like. Now that the threading model and guest code has been covered we can discuss the overall architecture.

iothread and non-iothread architecture


The traditional architecture is a single QEMU thread that executes guest code and the event loop. This model is also known as non-iothread or !CONFIG_IOTHREAD and is the default when QEMU is built with ./configure && make. The QEMU thread executes guest code until an exception or signal yields back control. Then it runs one iteration of the event loop without blocking in select(2). Afterwards it dives back into guest code and repeats until QEMU is shut down.

If the guest is started with multiple vcpus using -smp 2, for example, no additional QEMU threads will be created. Instead the single QEMU thread multiplexes between two vcpus executing guest code and the event loop. Therefore non-iothread fails to exploit multicore hosts and can result in poor performance for SMP guests.

Note that despite there being only one QEMU thread there may be zero or more worker threads. These threads may be temporarily or permanent. Remember that they perform specialized tasks and do not execute guest code or process events. I wanted to emphasise this because it is easy to be confused by worker threads when monitoring the host and interpret them as vcpu threads. Remember that non-iothread only ever has one QEMU thread.

The newer architecture is one QEMU thread per vcpu plus a dedicated event loop thread. This model is known as iothread or CONFIG_IOTHREAD and can be enabled with ./configure --enable-io-thread at build time. Each vcpu thread can execute guest code in parallel, offering true SMP support, while the iothread runs the event loop. The rule that core QEMU code never runs simultaneously is maintained through a global mutex that synchronizes core QEMU code across the vcpus and iothread. Most of the time vcpus will be executing guest code and do not need to hold the global mutex. Most of the time the iothread is blocked in select(2) and does not need to hold the global mutex.

Note that TCG is not thread-safe so even under the iothread model it multiplexes vcpus across a single QEMU thread. Only KVM can take advantage of per-vcpu threads.

Conclusion and words about the future

Hopefully this helps communicate the overall architecture of QEMU (which KVM inherits). Feel free to leave questions in the comments below.

In the future the details are likely to change and I hope we will see a move to CONFIG_IOTHREAD by default and maybe even a removal of !CONFIG_IOTHREAD.

I will try to update this post as qemu.git changes.

42 comments:

  1. Great Start. Very useful and nice reading.

    ReplyDelete
  2. Hi,

    Nice article. Very informative. I found this when searching for answers on a KVM performance problem I'm having. Hopefully your knowledge can help me :)

    I am using Qemu-KVM-0.12.5 on Intel Xeon (Vt-x enabled) processors and monitoring the system using htop on the host. On the processors that
    are running Qemu-KVM I am seeing a 50/50 split between userspace and guest ("gu:" in htop). I have pinned the vCPU qemu-kvm threads to
    specific host CPUs using taskset. In the guest the CPUs are nearly 100% userspace in htop.

    I want to know why there is a 50/50 split on the host and is there a way to increase the utilization of the virtual CPU?

    I'm using very little emulation. Mostly shared memory interaction with another application on the host.

    ReplyDelete
  3. I suggest running perf(1) on the host to understand where host userspace, probably qemu-kvm, is spending its time. The perf(1) tool ships as linux-tools under Debian and perf under Red Hat.

    Make sure to have qemu-kvm debuginfo installed so that symbol names can be looked up. Look for the qemu-kvm-debuginfo package under Red Hat and qemu-kvm-dbg under Debian.

    It would be interesting to try the latest qemu-kvm too.

    Please move this to a KVM mailing list email thread at kvm@vger.kernel.org so others can help.

    ReplyDelete
  4. One small addition: scheduling a bottom half _is_ thread-safe (and wait-free), so a bottom half can replace the pipe or eventfd in worker threads. Other operations on bottom halves (creating, destroying) are not thread-safe though, and you have to be sure that the core thread won't destroy the bottom half at the same time you're scheduling it, but these conditions are quite easy to satisfy.

    Using a pipe can still be useful if you have to queue some information from the worker thread to the core thread (SPICE does that), but you could as well use an in-memory queue for that.

    ReplyDelete
  5. Thanks for the explanation. I am wondering though because in the !CONFIG_IOTHREAD case qemu_bh_schedule() does not seem thread-safe to me. Calling qemu_notify_event() from a non-QEMU thread looks dangerous since it uses cpu_single_env and may invoke cpu_unlink_tb(). Can that really be done concurrently with the TCG thread without any locking?

    ReplyDelete
  6. Stefan -- thanks much for a very clear explanation. Keep up the good work.

    ReplyDelete
  7. Nice.

    Few other things I wish you talked about are:

    1. How does guest interrupts reach user space and gets handled inside qemu?

    2. How does device emulation takes place using this event loop?

    3. In case of KVM, memory model namely gva->gpa->hva->hpa?

    ReplyDelete
  8. Thanks for those ideas, I have added them to the list of ideas to blog about. Recently I have not been writing new posts frequently but I'll try to write more soon.

    ReplyDelete
  9. Hi Stefan, thanks a lot for the QEMU internals info. I am interested in how QEMU/KVM emulate device DMAs (assuming the guest doesn't have para-virtualized drivers). If you or someone you know have already covered this topic, please post a link; otherwise please add this to your blog ideas.

    ReplyDelete
  10. hi,

    I have just started to look into qemu/kvm and your post was a nice read looking forward to your future posts.

    ReplyDelete
  11. thanks for a great article...

    ReplyDelete
  12. Very good explanation of how qemu/kvm works. keep blogging.

    ReplyDelete
  13. Thanks for your post. It's very useful for me and many others.

    ReplyDelete
  14. well explained, very helpful.
    I have a question though:
    "In order to solve the problem of guest code hogging QEMU's thread of control signals are used to break out of the guest. A UNIX signal yanks control away from the current flow of execution and invokes a signal handler function."

    Can you please elaborate this part, such as how long will the signal be sent, what's the signal, and where can I find relevant code?
    Thanks.

    ReplyDelete
    Replies
    1. cpus.c:qemu_cpu_kick_thread() sends SIG_IPI to a vcpu thread. In cpus.c:qemu_kvm_cpu_thread_fn() you can see that the vcpu main loop will leave kvm_cpu_exec() and check whether to keep running and other guest lifecycle state stuff.

      In TCG mode (when KVM is not in use) the cpu_signal() function will be invoked as the SIG_IPI handler. This calls cpu_exit() to stop running translated code and return to the TCG main loop.

      You can find out more using git grep SIG_IPI in the qemu.git source tree.

      Delete
  15. I still don`t understand
    - what will worker-threads do ?
    - what kind of event will the main loop receive?
    - what kind of guest code will be executed ?
    Can you give a specific example ?

    ReplyDelete
    Replies
    1. Here are examples for each of your questions:

      - what will worker-threads do ?

      posix-aio-compat.c worker threads perform preadv(2)/pwritev(2)/fdatasync(2)/ioctl(2) system calls. These calls are blocking so they need to be performed in a worker thread, allowing the rest of QEMU to continue executing.

      - what kind of event will the main loop receive?

      Incoming network packets (except when using vhost-net), keypresses from VNC clients, QEMU monitor commands sent by the user or by libvirt. These events are all messages received over file descriptors that QEMU has open.

      The QEMU main loop also processes timers in addition to file descriptor readable/writeable events.

      - what kind of guest code will be executed ?

      Guest code means any code that runs inside the virtual machine. For example, the guest kernel, userspace applications, BIOS, bootloader, etc.

      Delete
  16. "While a thread is running guest code it cannot simultaneously be in the event loop because the guest has (safe) control of the CPU". Does this mean that the entire guest code(multiple user threads, os process context, softirqs, tasklets, workque, kernel threads) is handled using one qemu thread ? Qemu runs as a process on host OS. how are different guest processes/threads/softirq, etc mapped to this single qemu process ?

    ReplyDelete
    Replies
    1. QEMU emulates hardware. Guest processes/threads/softirqs are Linux kernel concepts - they are software concepts that QEMU does not know about.

      QEMU running a KVM guest has 1 vCPU thread per guest CPU. In other words, QEMU provides 1 thread per guest CPU. It's up to the guest OS scheduler to dispatch processes, threads, softirqs onto the CPUs that it sees, just like on physical hardware.

      Delete
  17. Hi Stefan,

    Great article! I am a beginner in this field and it really helped me.
    However, it would be really helpful if you could explain how i/o requests and interrupts are handled by QEMU in a multi-vm environment.

    ReplyDelete
    Replies
    1. The QEMU process only knows about the single VM it is emulating. QEMU does not have host-wide knowledge.

      Therefore all host-wide concerns are handled by the Linux kernel: the scheduler, memory management, I/O schedulers, cgroups resource controllers, etc.

      Delete
    2. In that case, how are interrupts from/to guest vm handled and by whom?

      Delete
    3. Interrupt controllers are emulated individually for each guest. There is nothing magic about them, they are just another hardware device that each guest has.

      If a real interrupt comes in while running in guest mode, we trap into the host kernel and process the interrupt. For example, if the interrupt signals that the physical NIC has completed transferring a network packet then the emulated virtio-net will inject an interrupt into the guest.

      Upon re-entering the guest, the guest's interrupt handler will be executed and its virtio-net guest driver can handle the interrupt.

      There are a lot of details that I'm skipping but the main idea is that the host interrupt controller is *not* passed through. Instead the guest has its own (software) emulated interrupt controller.

      In the future it will become possible to pass through interrupts for PCI pass-through devices but that's a performance optimization and not the general case today.

      Delete
  18. Also, can you suggest a book or recent paper that elaborates on how interrupts are handled by kvm-qemu.

    ReplyDelete
  19. I am working on developing an algorithm for coalescing interrupts in kvm but I haven't been able to find an elaborate source from where I could read. I know the basic process but I still need to fill in the details.

    ReplyDelete
    Replies
    1. I suggest emailing kvm@vger.kernel.org with an outline of your idea. Also, interrupt coalescing can take advantage of device-specific knowledge to balance between latency and batching. Not sure if you have a particular device in mind - some of them already use interrupt coalescing strategies.

      Anyway, please discuss the details on the KVM mailing list if you need help.

      Delete
  20. Hi Stefan,

    Sorry to bug you with this but the stated email address is generating error. Can you please tell me hoe to approach KVM mailing list?

    ReplyDelete
    Replies
    1. kvm@vger.kernel.org is the correct mailing list address, not sure why it isn't working for you.

      Here is the link for more information: http://www.linux-kvm.org/page/Lists,_IRC

      Delete
  21. Hi Stefan,
    Can u explain how interrupt is generated and control is transferred to guest when a key is pressed? Any code files, that u have in mind related to this? It would be a great help from your side....

    ReplyDelete
    Replies
    1. Here is the PC keyboard emulation code that hooks up keypresses in hw/input/pckbd.c:i8042_realizefn():
      s->kbd = ps2_kbd_init(kbd_update_kbd_irq, s);

      kbd_update_kbd_irq() is invoked when a new keypress was queued for the guest. kbd_update_kbd_irq() raises an interrupt and the guest will come back to read keypresses from the keyboard controller.

      If this doesn't help you understand, make sure you know how PS/2 keyboard and OS keyboard interrupt handlers work. The emulation code should be clear if you understand how hardware behaves.

      Please send general questions to qemu-devel@nongnu.org or ask on #qemu on irc.oftc.net in the future. That way, others in the QEMU community can help.

      Delete
  22. Hi, I am trying to figure out interrupt flow in qemu.... I have understood how interrupt related registers are set for particular architecture i.e. after passing of control through qemu_set_irq()[in irq.c], to i8259.c to particular arch. files(e.g. in case of mips: cpu_mips_irq_request at mips_int.c
    and cpu_reset_interrupt at exec.c)
    What I am not able to understand, is how guest is able to read those interrupt related registers, while guest is busy executing its own code?? If I am not wrong this is happening in event loop....

    ReplyDelete
  23. Please send general questions to qemu-devel@nongnu.org or ask on #qemu on irc.oftc.net in the future. That way, others in the QEMU community can help.

    If you're asking how vcpus are "kicked" out of the code they are currently executing and forced to dispatch interrupts, then it depends on whether TCG or KVM is being used to execute guest code. In both cases we record the fact that an interrupt must be injected and notify the vcpu.

    See cpus.c:qemu_cpu_kick_thread() and SIG_IPI handling in QEMU as an example. Similar things happen in the kvm.ko kernel module.

    Basically: a signal is sent to interrupt guest code execution. Then TCG or KVM figures out there is a new interrupt to dispatch and re-enters the guest with that interrupt.

    ReplyDelete
  24. Thanx... While debugging using gdb, I found that this interrupt related flow is not handled in io thread i.e. on using info thread, I found: Thread 2(iothread) is at __lll_lock_wait() while Thread 1 is at qemu_set_irq(in irq.c).. Initially I thought that this interrupt related code(kbd_update_irq--->qemu_Set_irq......) should be handled through iothread(event loop).. Can u help me regarding this

    ReplyDelete
    Replies
    1. Please send general questions to qemu-devel@nongnu.org or ask on #qemu on irc.oftc.net in the future. That way, others in the QEMU community can help.

      Delete
  25. Hi Stefan,
    I have a few questions on qemu device emulation

    For example->
    ARM versatilepb machine emulation on X86 PC
    ********************************************
    case 1> There is no flash on the host PC. So qemu creates a virtual fLASH device in RAM and reads and writes
    into this virtual flash inside RAM.
    case 2> For keyboard device Is the I/O performed by host OS drivers OR the guest driver code is executed using KVM

    ReplyDelete
    Replies
    1. Please send general technical questions to qemu-devel@nongnu.org.

      Also please make them actual questions with a question mark :). I'm not really sure what your questions are.

      Delete
    2. Hi Stefan,
      Thanks for the reply,i had posted this question on qemu mailing list but no responses.
      It would be of much help if you could answer these queries.

      With the information ,i have on qemu what i understand is -> "QEMU submits I/Os to host on behalf of guest" , My question are->

      1> Is the I/O performed by host OS drivers ? What is the role of guest OS drivers ?
      2> Does qemu emulate hardware functionality of devices in software ?


      PART-B
      ______
      1> I would like to know if wince 6.5 is supported on qemu-system-arm? for any
      of the omap3machine.
      2> If NO what is effort would it take to boot wince 6.5 on qemu-omap3
      using beagle board as reference

      Thanks Much
      Wasim

      Delete
    3. Dont know much about part B of your questions, but regarding part A: Qemu emulates hardware devices in software, and translates guest code to particular assembly so that it may run on host architecture. Running guest OS does not know that it is being run on a software (not actual hardware). Lets take example of networking. Guest's networking driver will perform IO operations, which will be handled by networking device emulation code in QEMU, and then QEMU will use networking device on actual host, to perform networking functionality.

      Delete
    4. Hi AZ,
      In your reply "and then QEMU will use networking device on actual host, to perform networking functionality " ---> Does this mean using host drivers ?

      Thanks
      Wasim

      Delete
  26. Hi Stefan,

    The article is very informative!

    I started virtual machine with 2 virtual cpus (KVM assisted). It generated number of threads in qemu, though I was not doing any CPU intense or IO intense oprtations. I understand from your article that qemu creates one thread per vcpu(vcpu1, vcpu2) and an IO thread.(io) It sums up three permanent threads. Could you kindly clarify the followings,

    1. Are all other threads worker thread?
    2. Can I schedule worker threads in different scheduling policy if they are doing the task of vcpu1 thread?

    Thanks,
    Tamilselvan

    ReplyDelete
    Replies
    1. 1. Are all other threads worker threads?

      There are a number of threads that might be running depending on your configuration. VNC, live migration, Gluster/Ceph, audio, and other features may use threads.

      Some of them may be part of a "thread pool" which is reused for different operations. Others may be dedicated threads doing just one thing (such as audio processing).

      2. Can I schedule worker threads in different scheduling policy if they are doing the task of vcpu1 thread?

      In general threads will have the same CPU affinity as the main loop/iothread. There are QMP APIs to query the tids of some threads - this makes it possible to customize the CPU affinity and other per-thread attributes for them.

      If you have a concrete question about certain types of threads, please email the QEMU mailing list qemu-devel@nongnu.org.

      Delete