render / rpms / libvirt

Forked from rpms/libvirt 9 months ago
Clone
Pablo Greco 40546a
From 5347b12008842b5c86f766e391c6f3756afbff7d Mon Sep 17 00:00:00 2001
Pablo Greco 40546a
Message-Id: <5347b12008842b5c86f766e391c6f3756afbff7d@dist-git>
Pablo Greco 40546a
From: Daniel Henrique Barboza <danielhb413@gmail.com>
Pablo Greco 40546a
Date: Fri, 3 May 2019 13:54:53 +0200
Pablo Greco 40546a
Subject: [PATCH] PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough
Pablo Greco 40546a
Pablo Greco 40546a
The NVIDIA V100 GPU has an onboard RAM that is mapped into the
Pablo Greco 40546a
host memory and accessible as normal RAM via an NVLink2 bridge. When
Pablo Greco 40546a
passed through in a guest, QEMU puts the NVIDIA RAM window in a
Pablo Greco 40546a
non-contiguous area, above the PCI MMIO area that starts at 32TiB.
Pablo Greco 40546a
This means that the NVIDIA RAM window starts at 64TiB and go all the
Pablo Greco 40546a
way to 128TiB.
Pablo Greco 40546a
Pablo Greco 40546a
This means that the guest might request a 64-bit window, for each PCI
Pablo Greco 40546a
Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM
Pablo Greco 40546a
window isn't counted as regular RAM, thus this window is considered
Pablo Greco 40546a
only for the allocation of the Translation and Control Entry (TCE).
Pablo Greco 40546a
For more information about how NVLink2 support works in QEMU,
Pablo Greco 40546a
refer to the accepted implementation [1].
Pablo Greco 40546a
Pablo Greco 40546a
This memory layout differs from the existing VFIO case, requiring its
Pablo Greco 40546a
own formula. This patch changes the PPC64 code of
Pablo Greco 40546a
@qemuDomainGetMemLockLimitBytes to:
Pablo Greco 40546a
Pablo Greco 40546a
- detect if we have a NVLink2 bridge being passed through to the
Pablo Greco 40546a
guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function
Pablo Greco 40546a
added in the previous patch. The existence of the NVLink2 bridge in
Pablo Greco 40546a
the guest means that we are dealing with the NVLink2 memory layout;
Pablo Greco 40546a
Pablo Greco 40546a
- if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a
Pablo Greco 40546a
different way to account for the extra memory the TCE table can alloc.
Pablo Greco 40546a
The 64TiB..128TiB window is more than enough to fit all possible
Pablo Greco 40546a
GPUs, thus the memLimit is the same regardless of passing through 1 or
Pablo Greco 40546a
multiple V100 GPUs.
Pablo Greco 40546a
Pablo Greco 40546a
Further reading explaining the background
Pablo Greco 40546a
[1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg03700.html
Pablo Greco 40546a
[2] https://www.redhat.com/archives/libvir-list/2019-March/msg00660.html
Pablo Greco 40546a
[3] https://www.redhat.com/archives/libvir-list/2019-April/msg00527.html
Pablo Greco 40546a
Pablo Greco 40546a
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Pablo Greco 40546a
Reviewed-by: Erik Skultety <eskultet@redhat.com>
Pablo Greco 40546a
(cherry picked from commit 1a922648f67f56c4374d647feebf2adb9a642f96)
Pablo Greco 40546a
Pablo Greco 40546a
https://bugzilla.redhat.com/show_bug.cgi?id=1505998
Pablo Greco 40546a
Pablo Greco 40546a
Conflicts:
Pablo Greco 40546a
    The upstream commit relied on:
Pablo Greco 40546a
        - v4.7.0-37-gb72183223f
Pablo Greco 40546a
        - v4.7.0-38-ga14f597266
Pablo Greco 40546a
    which were not backported so virPCIDeviceAddressAsString had to
Pablo Greco 40546a
    swapped for the former virDomainPCIAddressAsString in order to
Pablo Greco 40546a
    compile.
Pablo Greco 40546a
Pablo Greco 40546a
Signed-off-by: Erik Skultety <eskultet@redhat.com>
Pablo Greco 40546a
Message-Id: <03c00ebf46d85b0615134ef8655e67a4c909b7da.1556884443.git.eskultet@redhat.com>
Pablo Greco 40546a
Reviewed-by: Andrea Bolognani <abologna@redhat.com>
Pablo Greco 40546a
---
Pablo Greco 40546a
 src/qemu/qemu_domain.c | 80 ++++++++++++++++++++++++++++++++----------
Pablo Greco 40546a
 1 file changed, 61 insertions(+), 19 deletions(-)
Pablo Greco 40546a
Pablo Greco 40546a
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
Pablo Greco 40546a
index a8bc618389..21f0722495 100644
Pablo Greco 40546a
--- a/src/qemu/qemu_domain.c
Pablo Greco 40546a
+++ b/src/qemu/qemu_domain.c
Pablo Greco 40546a
@@ -9813,7 +9813,7 @@ qemuDomainUpdateCurrentMemorySize(virQEMUDriverPtr driver,
Pablo Greco 40546a
  * such as '0004:04:00.0', and tells if the device is a NVLink2
Pablo Greco 40546a
  * bridge.
Pablo Greco 40546a
  */
Pablo Greco 40546a
-static ATTRIBUTE_UNUSED bool
Pablo Greco 40546a
+static bool
Pablo Greco 40546a
 ppc64VFIODeviceIsNV2Bridge(const char *device)
Pablo Greco 40546a
 {
Pablo Greco 40546a
     const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink",
Pablo Greco 40546a
@@ -9851,7 +9851,9 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
Pablo Greco 40546a
     unsigned long long maxMemory = 0;
Pablo Greco 40546a
     unsigned long long passthroughLimit = 0;
Pablo Greco 40546a
     size_t i, nPCIHostBridges = 0;
Pablo Greco 40546a
+    virPCIDeviceAddressPtr pciAddr;
Pablo Greco 40546a
     bool usesVFIO = false;
Pablo Greco 40546a
+    bool nvlink2Capable = false;
Pablo Greco 40546a
 
Pablo Greco 40546a
     for (i = 0; i < def->ncontrollers; i++) {
Pablo Greco 40546a
         virDomainControllerDefPtr cont = def->controllers[i];
Pablo Greco 40546a
@@ -9869,7 +9871,17 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
Pablo Greco 40546a
             dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI &&
Pablo Greco 40546a
             dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) {
Pablo Greco 40546a
             usesVFIO = true;
Pablo Greco 40546a
-            break;
Pablo Greco 40546a
+
Pablo Greco 40546a
+            pciAddr = &dev->source.subsys.u.pci.addr;
Pablo Greco 40546a
+            if (virPCIDeviceAddressIsValid(pciAddr, false)) {
Pablo Greco 40546a
+                VIR_AUTOFREE(char *) pciAddrStr = NULL;
Pablo Greco 40546a
+
Pablo Greco 40546a
+                pciAddrStr = virDomainPCIAddressAsString(pciAddr);
Pablo Greco 40546a
+                if (ppc64VFIODeviceIsNV2Bridge(pciAddrStr)) {
Pablo Greco 40546a
+                    nvlink2Capable = true;
Pablo Greco 40546a
+                    break;
Pablo Greco 40546a
+                }
Pablo Greco 40546a
+            }
Pablo Greco 40546a
         }
Pablo Greco 40546a
     }
Pablo Greco 40546a
 
Pablo Greco 40546a
@@ -9896,29 +9908,59 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
Pablo Greco 40546a
                 4096 * nPCIHostBridges +
Pablo Greco 40546a
                 8192;
Pablo Greco 40546a
 
Pablo Greco 40546a
-    /* passthroughLimit := max( 2 GiB * #PHBs,                       (c)
Pablo Greco 40546a
-     *                          memory                               (d)
Pablo Greco 40546a
-     *                          + memory * 1/512 * #PHBs + 8 MiB )   (e)
Pablo Greco 40546a
+    /* NVLink2 support in QEMU is a special case of the passthrough
Pablo Greco 40546a
+     * mechanics explained in the usesVFIO case below. The GPU RAM
Pablo Greco 40546a
+     * is placed with a gap after maxMemory. The current QEMU
Pablo Greco 40546a
+     * implementation puts the NVIDIA RAM above the PCI MMIO, which
Pablo Greco 40546a
+     * starts at 32TiB and is the MMIO reserved for the guest main RAM.
Pablo Greco 40546a
      *
Pablo Greco 40546a
-     * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 2 GiB
Pablo Greco 40546a
-     * rather than 1 GiB
Pablo Greco 40546a
+     * This window ends at 64TiB, and this is where the GPUs are being
Pablo Greco 40546a
+     * placed. The next available window size is at 128TiB, and
Pablo Greco 40546a
+     * 64TiB..128TiB will fit all possible NVIDIA GPUs.
Pablo Greco 40546a
      *
Pablo Greco 40546a
-     * (d) is the with-DDW (and memory pre-registration and related
Pablo Greco 40546a
-     * features) DMA window accounting - assuming that we only account RAM
Pablo Greco 40546a
-     * once, even if mapped to multiple PHBs
Pablo Greco 40546a
+     * The same assumption as the most common case applies here:
Pablo Greco 40546a
+     * the guest will request a 64-bit DMA window, per PHB, that is
Pablo Greco 40546a
+     * big enough to map all its RAM, which is now at 128TiB due
Pablo Greco 40546a
+     * to the GPUs.
Pablo Greco 40546a
      *
Pablo Greco 40546a
-     * (e) is the with-DDW userspace view and overhead for the 64-bit DMA
Pablo Greco 40546a
-     * window. This is based a bit on expected guest behaviour, but there
Pablo Greco 40546a
-     * really isn't a way to completely avoid that. We assume the guest
Pablo Greco 40546a
-     * requests a 64-bit DMA window (per PHB) just big enough to map all
Pablo Greco 40546a
-     * its RAM. 4 kiB page size gives the 1/512; it will be less with 64
Pablo Greco 40546a
-     * kiB pages, less still if the guest is mapped with hugepages (unlike
Pablo Greco 40546a
-     * the default 32-bit DMA window, DDW windows can use large IOMMU
Pablo Greco 40546a
-     * pages). 8 MiB is for second and further level overheads, like (b) */
Pablo Greco 40546a
-    if (usesVFIO)
Pablo Greco 40546a
+     * Note that the NVIDIA RAM window must be accounted for the TCE
Pablo Greco 40546a
+     * table size, but *not* for the main RAM (maxMemory). This gives
Pablo Greco 40546a
+     * us the following passthroughLimit for the NVLink2 case:
Pablo Greco 40546a
+     *
Pablo Greco 40546a
+     * passthroughLimit = maxMemory +
Pablo Greco 40546a
+     *                    128TiB/512KiB * #PHBs + 8 MiB */
Pablo Greco 40546a
+    if (nvlink2Capable) {
Pablo Greco 40546a
+        passthroughLimit = maxMemory +
Pablo Greco 40546a
+                           128 * (1ULL<<30) / 512 * nPCIHostBridges +
Pablo Greco 40546a
+                           8192;
Pablo Greco 40546a
+    } else if (usesVFIO) {
Pablo Greco 40546a
+        /* For regular (non-NVLink2 present) VFIO passthrough, the value
Pablo Greco 40546a
+         * of passthroughLimit is:
Pablo Greco 40546a
+         *
Pablo Greco 40546a
+         * passthroughLimit := max( 2 GiB * #PHBs,                       (c)
Pablo Greco 40546a
+         *                          memory                               (d)
Pablo Greco 40546a
+         *                          + memory * 1/512 * #PHBs + 8 MiB )   (e)
Pablo Greco 40546a
+         *
Pablo Greco 40546a
+         * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 2
Pablo Greco 40546a
+         * GiB rather than 1 GiB
Pablo Greco 40546a
+         *
Pablo Greco 40546a
+         * (d) is the with-DDW (and memory pre-registration and related
Pablo Greco 40546a
+         * features) DMA window accounting - assuming that we only account
Pablo Greco 40546a
+         * RAM once, even if mapped to multiple PHBs
Pablo Greco 40546a
+         *
Pablo Greco 40546a
+         * (e) is the with-DDW userspace view and overhead for the 64-bit
Pablo Greco 40546a
+         * DMA window. This is based a bit on expected guest behaviour, but
Pablo Greco 40546a
+         * there really isn't a way to completely avoid that. We assume the
Pablo Greco 40546a
+         * guest requests a 64-bit DMA window (per PHB) just big enough to
Pablo Greco 40546a
+         * map all its RAM. 4 kiB page size gives the 1/512; it will be
Pablo Greco 40546a
+         * less with 64 kiB pages, less still if the guest is mapped with
Pablo Greco 40546a
+         * hugepages (unlike the default 32-bit DMA window, DDW windows
Pablo Greco 40546a
+         * can use large IOMMU pages). 8 MiB is for second and further level
Pablo Greco 40546a
+         * overheads, like (b) */
Pablo Greco 40546a
         passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges,
Pablo Greco 40546a
                                memory +
Pablo Greco 40546a
                                memory / 512 * nPCIHostBridges + 8192);
Pablo Greco 40546a
+    }
Pablo Greco 40546a
 
Pablo Greco 40546a
     memKB = baseLimit + passthroughLimit;
Pablo Greco 40546a
 
Pablo Greco 40546a
-- 
Pablo Greco 40546a
2.21.0
Pablo Greco 40546a