26ba25
From f9416fd5d1232f47af1366c8099003a88dab4a21 Mon Sep 17 00:00:00 2001
26ba25
From: Alex Williamson <alex.williamson@redhat.com>
26ba25
Date: Mon, 3 Dec 2018 22:01:48 +0000
26ba25
Subject: [PATCH 12/16] vfio: Inhibit ballooning based on group attachment to a
26ba25
 container
26ba25
26ba25
RH-Author: Alex Williamson <alex.williamson@redhat.com>
26ba25
Message-id: <154387450879.27651.3509144221336190827.stgit@gimli.home>
26ba25
Patchwork-id: 83238
26ba25
O-Subject: [RHEL-8.0 qemu-kvm PATCH 3/7] vfio: Inhibit ballooning based on group attachment to a container
26ba25
Bugzilla: 1650272
26ba25
RH-Acked-by: Peter Xu <peterx@redhat.com>
26ba25
RH-Acked-by: Auger Eric <eric.auger@redhat.com>
26ba25
RH-Acked-by: Cornelia Huck <cohuck@redhat.com>
26ba25
RH-Acked-by: David Hildenbrand <david@redhat.com>
26ba25
26ba25
Bugzilla: 1650272
26ba25
26ba25
We use a VFIOContainer to associate an AddressSpace to one or more
26ba25
VFIOGroups.  The VFIOContainer represents the DMA context for that
26ba25
AdressSpace for those VFIOGroups and is synchronized to changes in
26ba25
that AddressSpace via a MemoryListener.  For IOMMU backed devices,
26ba25
maintaining the DMA context for a VFIOGroup generally involves
26ba25
pinning a host virtual address in order to create a stable host
26ba25
physical address and then mapping a translation from the associated
26ba25
guest physical address to that host physical address into the IOMMU.
26ba25
26ba25
While the above maintains the VFIOContainer synchronized to the QEMU
26ba25
memory API of the VM, memory ballooning occurs outside of that API.
26ba25
Inflating the memory balloon (ie. cooperatively capturing pages from
26ba25
the guest for use by the host) simply uses MADV_DONTNEED to "zap"
26ba25
pages from QEMU's host virtual address space.  The page pinning and
26ba25
IOMMU mapping above remains in place, negating the host's ability to
26ba25
reuse the page, but the host virtual to host physical mapping of the
26ba25
page is invalidated outside of QEMU's memory API.
26ba25
26ba25
When the balloon is later deflated, attempting to cooperatively
26ba25
return pages to the guest, the page is simply freed by the guest
26ba25
balloon driver, allowing it to be used in the guest and incurring a
26ba25
page fault when that occurs.  The page fault maps a new host physical
26ba25
page backing the existing host virtual address, meanwhile the
26ba25
VFIOContainer still maintains the translation to the original host
26ba25
physical address.  At this point the guest vCPU and any assigned
26ba25
devices will map different host physical addresses to the same guest
26ba25
physical address.  Badness.
26ba25
26ba25
The IOMMU typically does not have page level granularity with which
26ba25
it can track this mapping without also incurring inefficiencies in
26ba25
using page size mappings throughout.  MMU notifiers in the host
26ba25
kernel also provide indicators for invalidating the mapping on
26ba25
balloon inflation, not for updating the mapping when the balloon is
26ba25
deflated.  For these reasons we assume a default behavior that the
26ba25
mapping of each VFIOGroup into the VFIOContainer is incompatible
26ba25
with memory ballooning and increment the balloon inhibitor to match
26ba25
the attached VFIOGroups.
26ba25
26ba25
Reviewed-by: Peter Xu <peterx@redhat.com>
26ba25
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
26ba25
(cherry picked from commit c65ee433153b5925e183a00ebf568e160077c694)
26ba25
Signed-off-by: Danilo C. L. de Paula <ddepaula@redhat.com>
26ba25
---
26ba25
 hw/vfio/common.c | 30 ++++++++++++++++++++++++++++++
26ba25
 1 file changed, 30 insertions(+)
26ba25
26ba25
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
26ba25
index 07ffa0b..7e8f289 100644
26ba25
--- a/hw/vfio/common.c
26ba25
+++ b/hw/vfio/common.c
26ba25
@@ -32,6 +32,7 @@
26ba25
 #include "hw/hw.h"
26ba25
 #include "qemu/error-report.h"
26ba25
 #include "qemu/range.h"
26ba25
+#include "sysemu/balloon.h"
26ba25
 #include "sysemu/kvm.h"
26ba25
 #include "trace.h"
26ba25
 #include "qapi/error.h"
26ba25
@@ -1039,6 +1040,33 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
26ba25
 
26ba25
     space = vfio_get_address_space(as);
26ba25
 
26ba25
+    /*
26ba25
+     * VFIO is currently incompatible with memory ballooning insofar as the
26ba25
+     * madvise to purge (zap) the page from QEMU's address space does not
26ba25
+     * interact with the memory API and therefore leaves stale virtual to
26ba25
+     * physical mappings in the IOMMU if the page was previously pinned.  We
26ba25
+     * therefore add a balloon inhibit for each group added to a container,
26ba25
+     * whether the container is used individually or shared.  This provides
26ba25
+     * us with options to allow devices within a group to opt-in and allow
26ba25
+     * ballooning, so long as it is done consistently for a group (for instance
26ba25
+     * if the device is an mdev device where it is known that the host vendor
26ba25
+     * driver will never pin pages outside of the working set of the guest
26ba25
+     * driver, which would thus not be ballooning candidates).
26ba25
+     *
26ba25
+     * The first opportunity to induce pinning occurs here where we attempt to
26ba25
+     * attach the group to existing containers within the AddressSpace.  If any
26ba25
+     * pages are already zapped from the virtual address space, such as from a
26ba25
+     * previous ballooning opt-in, new pinning will cause valid mappings to be
26ba25
+     * re-established.  Likewise, when the overall MemoryListener for a new
26ba25
+     * container is registered, a replay of mappings within the AddressSpace
26ba25
+     * will occur, re-establishing any previously zapped pages as well.
26ba25
+     *
26ba25
+     * NB. Balloon inhibiting does not currently block operation of the
26ba25
+     * balloon driver or revoke previously pinned pages, it only prevents
26ba25
+     * calling madvise to modify the virtual mapping of ballooned pages.
26ba25
+     */
26ba25
+    qemu_balloon_inhibit(true);
26ba25
+
26ba25
     QLIST_FOREACH(container, &space->containers, next) {
26ba25
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
26ba25
             group->container = container;
26ba25
@@ -1227,6 +1255,7 @@ close_fd_exit:
26ba25
     close(fd);
26ba25
 
26ba25
 put_space_exit:
26ba25
+    qemu_balloon_inhibit(false);
26ba25
     vfio_put_address_space(space);
26ba25
 
26ba25
     return ret;
26ba25
@@ -1347,6 +1376,7 @@ void vfio_put_group(VFIOGroup *group)
26ba25
         return;
26ba25
     }
26ba25
 
26ba25
+    qemu_balloon_inhibit(false);
26ba25
     vfio_kvm_device_del_group(group);
26ba25
     vfio_disconnect_container(group);
26ba25
     QLIST_REMOVE(group, next);
26ba25
-- 
26ba25
1.8.3.1
26ba25