Friday, 8 January 2016

QEMU Internals: How guest physical RAM works

Memory is one of the key aspects of emulating computer systems. Inside QEMU the guest RAM is modelled by several components that work together. This post gives an overview of the design of guest physical RAM in QEMU 2.5 by explaining the most important components without going into all the details. After reading this post you will know enough to dig into the QEMU source code yourself.

Note that guest virtual memory is not covered here since it deserves its own post. KVM virtualization relies on hardware memory translation support and does not use QEMU's software MMU.

Guest RAM configuration

The QEMU command-line option -m [size=]megs[,slots=n,maxmem=size] specifies the initial guest RAM size as well as the maximum guest RAM size and number of slots for memory chips (DIMMs).

The reason for the maximum size and slots is that QEMU emulates DIMM hotplug so the guest operating system can detect when new memory is added and removed using the same mechanism as on real hardware. This involves plugging or unplugging a DIMM into a slot, just like on a physical machine. In other words, changing the amount of memory available isn't done in byte units, it's done by changing the set of DIMMs plugged into the emulated machine.

Hotpluggable guest memory

The "pc-dimm" device (hw/mem/pc-dimm.c) models a DIMM. Memory is hotplugged by creating a new "pc-dimm" device. Although the name includes "pc" this device is also used with ppc and s390 machine types.

As a side-note, the initial RAM that the guest started with might not be modelled with a "pc-dimm" device and it can't be unplugged.

The guest RAM itself isn't contained inside the "pc-dimm" object. Instead the "pc-dimm" must be associated with a "memory-backend" object.

Memory backends

The "memory-backend" device (backends/hostmem.c) contains the actual host memory that backs guest RAM. This can either be anonymous mmapped memory or file-backed mmapped memory. File-backed guest RAM allows Linux hugetlbfs usage for huge pages on the host and also shared-memory so other host applications can access to guest RAM.

The "pc-dimm" and "memory-backend" objects are the user-visible parts of guest RAM in QEMU. They can be managed using the QEMU command-line and QMP monitor interface. This is just the tip of the iceberg though because there are still several aspects of guest RAM internal to QEMU that will be covered next.

The following diagram shows the components explained below:

RAM blocks and the ram_addr_t address space

Memory inside a "memory-backend" is actually mmapped by RAMBlock through qemu_ram_alloc() (exec.c). Each RAMBlock has a pointer to the mmap memory and also a ram_addr_t offset. This ram_addr_t offset is interesting because it is in a global namespace so the RAMBlock can be looked up by the offset.

The ram_addr_t namespace is different from the guest physical memory space. The ram_addr_t namespace is a tightly packed address space of all RAMBlocks. Guest physical address 0x100001000 might not be ram_addr_t 0x100001000 since ram_addr_t does not include guest physical memory regions that are reserved, memory-mapped I/O, etc. Furthermore, the ram_addr_t offset is dependent on the order in which RAMBlocks were created, unlike the guest physical memory space where everything has a fixed location.

All RAMBlocks are in a global list RAMList object called ram_list. ram_list holds the RAMBlocks and also the dirty memory bitmaps.

Dirty memory tracking

When the guest CPU or device DMA stores to guest RAM this needs to be noticed by several users:

  1. The live migration feature relies on tracking dirty memory pages so they can be resent if they change during live migration.
  2. TCG relies on tracking self-modifying code so it can recompile changed instructions.
  3. Graphics card emulation relies on tracking dirty video memory to redraw only scanlines that have changed.

There are dirty memory bitmaps for each of these users in ram_list because dirty memory tracking can be enabled or disabled independently for each of these users.

Address spaces

All CPU architectures have a memory space and some also have an I/O address space. This is represented by AddressSpace, which contains a tree of MemoryRegions (include/exec/memory.h).

The MemoryRegion is the link between guest physical address space and the RAMBlocks containing the memory. Each MemoryRegion has the ram_addr_t offset of the RAMBlock and each RAMBlock has a MemoryRegion pointer.

Note that MemoryRegion is more general than just RAM. It can also represent I/O memory where read/write callback functions are invoked on access. This is how hardware register accesses from a guest CPU are dispatched to emulated devices.

The address_space_rw() function dispatches load/store accesses to the appropriate MemoryRegions. If a MemoryRegion is a RAM region then the data will be accessed from the RAMBlock's mmapped guest RAM. The address_space_memory global variable is the guest physical memory space.

Conclusion

There are a few different layers involved in managing guest physical memory. The "pc-dimm" and "memory-backend" objects are the user-visible configuration objects for DIMMs and memory. The RAMBlock is the mmapped memory chunk. The AddressSpace and its MemoryRegion elements place guest RAM into the memory map.

33 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Hi,
    Could you please tell me if it is possible to track the vm memory dirtying rate (page dirty rate) when it is idle (no live vm migration)? If so how can we measure the page dirty rate for a specific VM with Qemu or KVM. Thanks.

    ReplyDelete
    Replies
    1. There is no specific feature for doing that, but there are mechanisms available to find that information. First check the memory statistics in /proc on the host (but remember that QEMU itself also uses memory so not all accesses you see are guest RAM). If that doesn't provide the information you need, then you could modify QEMU to collect the page dirtying rate by turning on dirty memory logging all the time.

      Delete
    2. http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/page-modification-logging-vmm-white-paper.pdf

      Delete
    3. To add some context to the Page Modification Logging link that was posted. The kvm.ko kernel module uses PML internally, if available, but the userspace dirty logging ABI remains the same:

      If you want to track the dirtying rate without modifying the kernel module, you need to enable dirty memory logging and calculate/sample the rate yourself in userspace.

      Delete
  3. Hi,
    Would you please tell me how QEMU load guest OS and then execute it ?
    What are the differences between "Real machine boot OS" and "QEMU load OS" ?
    Thanks.

    ReplyDelete
  4. Hi,
    Would you please tell me how QEMU load guest OS and then execute it ?
    What are the differences between "Real machine boot OS" and "QEMU load OS" ?
    Thanks.

    ReplyDelete
    Replies
    1. QEMU tries to behave the same as a physical machine. This helps ensure that software runs correctly inside QEMU.

      Therefore the boot process is similar to that of a physical machine. QEMU resets the CPU and jumps into the BIOS like a real machine. The BIOS enumerates disks and PCI Option ROMs and can boot from them.

      This means QEMU follows the same BIOS -> bootloader -> OS boot process as a physical machine.

      Delete
    2. Hi stefanha,
      Thanks for your kindly reply.
      1. As you said "QEMU follows the same BIOS -> bootloader -> OS boot process as a physical machine" => means QEMU need to implement BIOS (enumerates devices) and bootload (for loading guest OS), right ?
      2. Does QEMU need to create physical/virtual address space for guest OS for executing guest OS ? (the same as exec system call preparing process address space (code, data, stack, heap VMAs) for executing process)
      3. If above item 2 is yes, then could you please describe the detailed implementation for QEMU to create guest OS's physical/virtual address space ? Thanks a lot.

      Delete
    3. > 1. QEMU need to implement BIOS (enumerates devices) and bootload (for loading guest OS), right ?

      QEMU ships with SeaBIOS, an open source BIOS. QEMU does not ship with a bootloader, just like a physical machine does not ship with a bootloader (that is provided by the operating system that you install).

      > 2. Does QEMU need to create physical/virtual address space for guest OS for executing guest OS ? (the same as exec system call preparing process address space (code, data, stack, heap VMAs) for executing process)

      The guest sees a CPU just like on real hardware. This means the guest can set up its own page tables, etc.

      > 3. If above item 2 is yes, then could you please describe the detailed implementation for QEMU to create guest OS's physical/virtual address space ?

      There are two possibilities:
      1. Using KVM to take advantage of hardware virtualization support. On Intel the instruction set extensions are called VMX. Search the web for "vmx" "virtualization" if you want to learn about it.

      2. Emulating the CPU using the TCG just-in-time compiler. How virtual memory access works in TCG is too complex for this reply but you can read the source if you are interested.

      Delete
  5. Hello Stefan, thanks for your blogs. I have found them really helpful.
    I have a question, if guest is running 32bit virtio-net frontend driver and host is running 64bit vhost-net based backend, would the memory mapping in the vhost architecture work? I have been seeing issues, where if I use 64 bit guest driver, all is fine. but if I use 32 bit guest driver qemu takes over and packets are processed by qemu and not by vhost-net.

    ReplyDelete
  6. Hello Stefan, thanks for your blogs. I have found them really helpful.
    I have a question, if guest is running 32bit virtio-net frontend driver and host is running 64bit vhost-net based backend, would the memory mapping in the vhost architecture work? I have been seeing issues, where if I use 64 bit guest driver, all is fine. but if I use 32 bit guest driver qemu takes over and packets are processed by qemu and not by vhost-net.

    ReplyDelete
    Replies
    1. The memory address width should not matter because virtio always uses 64-bit physical RAM addresses, regardless of the guest CPU mode. As far as I am aware 64-bit vhost_net.ko works with 32-bit guests.

      If you have trouble debugging it, please email the KVM mailing list at kvm@vger.kernel.org.

      Delete
    2. I solved the issue. The issue was in Guest side. My Guest operating system was older with kernel 2.6.3x. and i was trying to use DPDK virtio driver in Guest, which was resulting in incorrect reading of virtio device registers. This was due to the fact that PMD driver was not able to detect the use of msi-x. and as a result it was reading it from incorrect offset.

      Delete
  7. Hi Could you tell me how can I get memory data of a domain

    ReplyDelete
  8. Hi Could you tell me how can I get memory data of a domain

    ReplyDelete
    Replies
    1. Please look at the "virsh dump" command or the QMP dump-guest-memory command.

      Delete
  9. Is it possible to use a file-backed DIMM as a RAM-Disk with persistent changes? If so, does it have a special layout, or is it possible to pass an (page aligned) initrd image?

    ReplyDelete
    Replies
    1. Take a look at the -mem-path option to use a file as RAM.

      You can pass in an initrd image from the host using the -kernel/-initrd options. The contents of the files are copied into guest RAM during startup.

      Delete
    2. Thanks, will have a look at the mem-path option.

      Yes, I know about the -initrd option, but the idea is to have a persistent RAM-disk. The one which is not copied, but used directly. An art of NVRAM-disk.

      Delete
  10. Hi Stefan, thank you for this very informative blog post.
    Could you give me some pointer on how guest VM memory cache work (L1-L3)? or from which function in the source-code should I start?

    ReplyDelete
    Replies
    1. QEMU does not model CPU caches. QEMU is not a full simulator. For speed it focusses just on emulating the functional properties necessary for running guest code.

      Delete
  11. This comment has been removed by the author.

    ReplyDelete
  12. Please general questions to email qemu-devel@nongnu.org. That way others in the QEMU community can participate in the discussion. Thanks!

    ReplyDelete
  13. Hi Stefanha,
    Can you elaborate RAM emulation will change when using hardware virtualization support (KVM).
    1. How is guest memory mapped onto host system ? does the answer lies in 'kvm_set_user_memory_region' and 'KVM_SET_USER_MEMORY_REGION' ioctl call ?
    2. How does the sync mechanism will work between guest and host ?
    3. In case of multi core guest, how does different vcpu's access the guest memory ?

    Thanks

    ReplyDelete
    Replies
    1. 1. Yes, in KVM mode QEMU passes the list of physical memory ranges to the kvm.ko kernel module. This allows kvm.ko to convert guest physical addresses to host userspace QEMU virtual addresses. kvm.ko can access QEMU virtual addresses to copy in, copy out, pin, unpin, etc as necessary.

      2. I'm not sure what you mean by "sync". If you mean dirty logging then there is the KVM_GET_DIRTY_LOG ioctl which you can read about in Linux Documentation/virtual/kvm/api.txt.

      3. Physical memory access for multi-core guests is pretty much the same as unicore guests. Keep in mind that each core has its own MMU and they are independent.

      Delete
  14. Hi Stefanha,
    Why qemu-kvm process pysical memory (RES) always increases though not used so much in guest vm? Can you explain this or give me some directions.

    ReplyDelete
    Replies
    1. Please send details to qemu-devel@nongnu.org including your QEMU command-line and the guest OS version.

      Delete
  15. Hi stefanha,
    Can you elaborate a bit on the FlatView memory model incorporated since 2.0 version. Is it primarily done for RCU?

    ReplyDelete
    Replies
    1. No, RCU was added later to make the memory API thread-safe.

      FlatView supports logarithmic search time and the ability to easily iterate over every MemoryRegion (taking into account overlap). It also has a transactional update model, which is useful for communicating memory layout changes to MemoryListeners like the KVM kernel module efficiently.

      Delete
  16. Hi Stephan,

    Great write up! I was wondering if you can help me on something QEMU related. I am trying to emulate some external registers so that a bare-metal program I created can read/write to that address. I am using QEMU's implementation of pflash to represent the registers. Is this the best approach to emulating this?

    ReplyDelete
    Replies
    1. It depends on what you are trying to do. Please send general technical questions to the QEMU mailing list at qemu-devel@nongnu.org.

      Delete
  17. Thanks Stephan, I sent out an email in greater detail.

    ReplyDelete