Wednesday, 9 March 2011

QEMU Internals: Big picture overview

Last week I started the QEMU Internals series to share knowledge of how QEMU works. I dove straight in to the threading model without a high-level overview. I want to go back and provide the big picture so that the details of the threading model can be understood more easily.

The story of a guest


A guest is created by running the qemu program, also known as qemu-kvm or just kvm. On a host that is running 3 virtual machines there are 3 qemu processes:


When a guest shuts down the qemu process exits. Reboot can be performed without restarting the qemu process for convenience although it would be fine to shut down and then start qemu again.

Guest RAM


Guest RAM is simply allocated when qemu starts up. It is possible to pass in file-backed memory with -mem-path such that hugetlbfs can be used. Either way, the RAM is mapped in to the qemu process' address space and acts as the "physical" memory as seen by the guest:


QEMU supports both big-endian and little-endian target architectures so guest memory needs to be accessed with care from QEMU code. Endian conversion is performed by helper functions instead of accessing guest RAM directly. This makes it possible to run a target with a different endianness from the host.

KVM virtualization


KVM is a virtualization feature in the Linux kernel that lets a program like qemu safely execute guest code directly on the host CPU. This is only possible when the target architecture is supported by the host CPU. Today KVM is available on x86, ARMv8, ppc, s390, and MIPS CPUs.

In order to execute guest code using KVM, the qemu process opens /dev/kvm and issues the KVM_RUN ioctl. The KVM kernel module uses hardware virtualization extensions found on modern Intel and AMD CPUs to directly execute guest code. When the guest accesses a hardware device register, halts the guest CPU, or performs other special operations, KVM exits back to qemu. At that point qemu can emulate the desired outcome of the operation or simply wait for the next guest interrupt in the case of a halted guest CPU.

The basic flow of a guest CPU is as follows:
open("/dev/kvm")
ioctl(KVM_CREATE_VM)
ioctl(KVM_CREATE_VCPU)
for (;;) {
     ioctl(KVM_RUN)
     switch (exit_reason) {
     case KVM_EXIT_IO:  /* ... */
     case KVM_EXIT_HLT: /* ... */
     }
}

The host's view of a running guest


The host kernel schedules qemu like a regular process. Multiple guests run alongside without knowledge of each other. Applications like Firefox or Apache also compete for the same host resources as qemu although resource controls can be used to isolate and prioritize qemu.

Since qemu system emulation provides a full virtual machine inside the qemu userspace process, the details of what processes are running inside the guest are not directly visible from the host. One way of understanding this is that qemu provides a slab of guest RAM, the ability to execute guest code, and emulated hardware devices; therefore any operating system (or no operating system at all) can run inside the guest. There is no ability for the host to peek inside an arbitrary guest.

Guests have a so-called vcpu thread per virtual CPU. A dedicated iothread runs a select(2) event loop to process I/O such as network packets and disk I/O completion. For more details and possible alternate configuration, see the threading model post.

The following diagram illustrates the qemu process as seen from the host:




Further information


Hopefully this gives you an overview of QEMU and KVM architecture. Feel free to leave questions in the comments and check out other QEMU Internals posts for details on these aspects of QEMU.

Here are two presentations on KVM architecture that cover similar areas if you are interested in reading more:

39 comments:

  1. This is such a poorly documented area... thanks for providing explaination and pointers

    ReplyDelete
  2. Hi Stefan,
    Nice informative article.
    I have few basic doubts.

    Is open("/dev/kvm") getting called from vcpu thread or io thread ?
    How does vcpu thread give illusion of a cpu ? Does it have to do with maintaining cpu context before context switch ? or does it require VT hardware support ?

    ReplyDelete
    Replies
    1. > Is open("/dev/kvm") getting called from vcpu thread or io thread ?

      It is called from the main thread when QEMU starts up. Remember each vcpu has its own file descriptor, so open("/dev/kvm") is the global file descriptor for the VM and not specific to a vcpu.

      > How does vcpu thread give illusion of a cpu ? Does it have to do with maintaining cpu context before context switch ? or does it require VT hardware support ?

      Yes, the kvm.ko ioctl(2) API allows the vcpu register state to be manipulated. QEMU initializes the vcpu on reset just like a physical CPU has an initial register state when the machine is turned on.

      Hardware support is required for KVM. On Intel CPUs the instruction set extension is called VMX (the feature is marketed as "VT"). On AMD CPUs the instruction set extension is called SVM. They are not compatible and therefore kvm.ko has 2 separate codepaths for Intel and AMD.

      Delete
  3. Hey Stefan. Thank you for a wonderful post. I was wondering if there is anyway for the hypervisor and the guest to communicate with each other directly. I am trying to establish a direct bi-directional communication channel between the hypervisor and the guest. Do you know of any approaches or do you have any pointers to solve this problem.

    ReplyDelete
    Replies
    1. It depends on what sort of communication you need. virtio-serial can be used for arbitrary guest/host communication.

      The QEMU guest agent (qga) builds on top of virtio-serial and introduces a JSON RPC API. It allows the host to invoke a set of commands inside the guest (to query the primary IP address, prepare applications for backup, etc).

      Please discuss on qemu-devel@nongnu.org if you want to go into more detail.

      Delete
  4. Hi Stephan,
    Thanks for the great article. It's dated 2011, so I wonder, now at the end
    of 2013, is QEMU still needed as part of KVM to run a guest OS?
    What is kvm-qemu? Most articles only mentioned KVM and little about
    QEMU, so I wonder has KVM integrated everything into KVM?
    Thanks.

    ReplyDelete
    Replies
    1. "KVM" is often used to describe the entire virtualization software stack. But really the components are:
      1. kvm.ko, the kernel module that uses VMX or SVM CPU instructions to run guest code
      2. QEMU, the userspace process that performs most device emulation and controls the guest

      For historic reasons QEMU has sometimes been called "qemu-kvm" or just "kvm". I described that in detail here:
      http://blog.vmsplice.net/2012/12/qemu-kvmgit-has-unforked-back-into.html

      Delete
  5. Stephan, excellent docs!

    A question: Is there any reason why KVM still emulates X86 instructions inside kernel, while it can just run them directly on bare metal?

    I understand that KVM must emulate missing instructions, but in kernel it even models general instructions PUSH, POP, etc... Which confuses me a lot.

    Thanks!

    ReplyDelete
    Replies
    1. Emulation is used in a couple of different cases: older CPUs don't support real-mode in VMX so it must be emulated, a faulting instruction may need emulation if it accesses an MMIO address, etc.

      That said, I'm not an expert on kvm.ko internals so try asking on kvm@vger.kernel.org if this answer isn't enough.

      Delete
  6. Thanks for your reply. I will post to the list, but just 1 more question: you said "... a faulting instruction may need emulation if it accesses an MMIO address". I dont get why faulting instruction must be emulated in this case?? Can you elaborate a bit?

    Thanks.

    ReplyDelete
    Replies
    1. Imagine an instruction like:
      mov $0x1, MMIO_ADDRESS

      The guest is storing a value to a memory-mapped I/O address. The CPU cannot execute that in VMX mode since I/O device emulation is performed in software. So my guess is that kvm.ko will emulate the instruction - but check the KVM source code or Intel Architecture Manual to be sure how it works.

      Delete
    2. @Anonymous: did you find a satisfactory answer somewhere? I'd be also curious to learn it :) Thanks!

      Delete
  7. Thanks for the information, Stephan.
    I did not find any documentation for developers in Qemu.
    Consider a scenario where Disk cache (in MM) is used inside virtual machine, if I want a mapping between any disk-cache main memory page and actual physical block of same page. What are the files I will have to look upon, what changes are required, tell me files only.

    ReplyDelete
    Replies
    1. Hi Alex,
      Your question does not make sense to me. Maybe I have misunderstood it.

      A process inside the guest has memory-mapped data from disk. You want to know the mapping between memory pages and disk blocks.

      This information is not within QEMU. The guest kernel does the memory management you are interested in. The management of virtual memory is a guest concept, QEMU or physical hardware doesn't know which page in RAM is mapped from disk - the kernel memory management code creates the illusion by mapping/unmapping pages on-demand.

      Delete
    2. Host makes a image file, and guest think it as a file system or may be disk. Suppose guest want to access block no. 10, from any file, from any file system. But actually the request is converted to file access from a image file at the host level. If you can help me, which is the function or file(s) which convert request virtual to physical for I/O. Forget memory this time.

      Delete
    3. Thank you for reply.
      I can show you some example..
      file excess from virtual memory
      this request is converted by virtio at the host level

      So I want to know the function or file which actually converts this. Thank you in advance

      Delete
    4. hw/block/virtio-blk.c:virtio_blk_handle_read() performs a read request on behalf of the guest. The key function is bdrv_aio_readv(), which reads sectors from the disk image.

      Delete
  8. stefanha,

    awesome doc....i really enjoyed and followed...I spent some time looking at kvm source code....and scribbled some notes ...thought of sharing with people who may find it useful ...still working on to add some text and shaping and making it more readable ..appreciate any comments and suggestion to make it more readable and useful
    http://linux-kvm-internals.blogspot.in/2014/08/kvm-linux-kernel-virtual-machine.html

    ReplyDelete
  9. How vm network packets gets dropped (or transfer) to data-link layer of physical machine. Which part of kvm, kvm-kmod or qemu source code is responsible for this functionality?

    ReplyDelete
    Replies
    1. Please ask general technical questions on the QEMU mailing list: qemu-devel@nongnu.org.

      Delete
  10. Hi Stefan,
    A very good post!!
    I believe it is possible to emulate an ARM core on x86, but you have mentioned... "today that means x86-on-x86 virtualization only". Can you please elaborate?

    ReplyDelete
    Replies
    1. This section is about hardware virtualization support, which makes it possible to "safely execute guest code directly on the host CPU". The key word is "directly" because KVM gives the CPU the address of some guest code and the CPU executes it without any translation.

      Emulation does not execute guest code directly on the host CPU. It either uses an interpreter or a just-in-time compiler to translate guest code to native code. Emulation is slower than virtualization because of this extra software layer.

      Delete
    2. Hi Stefan,
      Emulation does not execute guest code directly on the host CPU. It either uses an interpreter or a just-in-time compiler to translate guest code to native code. -> Could it be possible that guest code is arm code and native code is x86 code ? Thanks.

      Delete
    3. downriver119: Yes, QEMU can run ARM guest code on an x86 host.

      Delete
    4. Hi Stefan,
      Thanks for your quick reply.
      "QEMU can run ARM guest code on an x86 host" means that QEMU run ARM guest code by emulating directly and not with the help of KVM, right ? Thanks a lot.

      Delete
    5. Yes, it requires emulation and not KVM. KVM only supports the guest and host being the same architecture.

      Delete
  11. Thank you very much for your kindly reply. :)

    ReplyDelete
  12. How to monitor guest memory usage through libvirt or qemu-kvm?

    ReplyDelete
    Replies
    1. Please send general questions about libvirt to libvirt-users@redhat.com or about QEMU to qemu-devel@nongnu.org.

      Delete
  13. Can you explain flow on an I/O access from an application running on Guest OS?

    ReplyDelete
    Replies
    1. There are multiple code paths depending on whether you are using KVM or TCG, MMIO vs PIO, ioeventfd on or off.

      The basic flow is that a guest memory or I/O read/write access traps into kvm.ko. kvm.ko returns from ioctl(KVM_RUN) with a return code that QEMU handles.

      QEMU finds the emulated device responsible for the memory address and invokes its handler function. When the device's handler function completes QEMU re-enters the guest using ioctl(KVM_RUN).

      Delete
    2. What do you mean by emulate- QEMU doesn't really access the physical device in real??What if the I/O read/write is done to a device which physically exists? Does KVM perform I/O action on the physical device ??

      How does the QEMU emulation work virtio ??

      Delete
    3. Please read more on this blog or see these slides for an overview of how KVM works:
      https://vmsplice.net/~stefan/qemu-kvm-architecture-2015.pdf
      http://www.linux-kongress.org/2010/slides/KVM-Architecture-LK2010.pdf

      Most devices are emulated. Some devices can be passed through.

      Delete
    4. Thanks for sharing those, I have gone through these and yet confused. Let me pick up one example to point out the confusion. Let say, I want to write something on to a physical device attached (Eg:NIC) from an application running on Guest OS, It would execute OUT instruction and get trapped into hypervisor (if guest is paravirtualized, it may do a hypercall), KVM gets it and does KVM write to NIC or give it to QEMU (QEMU emulates and then writes it to NIC) ??

      Second part is, in the whole process, where is Virtio getting into action?

      Delete
    5. Your questions are general and it's hard to give a useful answer because many configurations are possible. You'll need to research your specific configuration if you want to know exactly what is happening.

      The guest sees an emulated NIC unless PCI passthrough is used. kvm.ko and QEMU do not write directly to a physical NIC when emulating a NIC. Instead they hand packets to the Linux network stack (e.g. tap device) which sends them.

      virtio is a family of emulated devices (networking, storage, etc) designed for virtualization. Some optimizations are made but mostly it's the same as emulating "real" devices.

      Delete