|
Pablo Greco |
40546a |
From 5347b12008842b5c86f766e391c6f3756afbff7d Mon Sep 17 00:00:00 2001
|
|
Pablo Greco |
40546a |
Message-Id: <5347b12008842b5c86f766e391c6f3756afbff7d@dist-git>
|
|
Pablo Greco |
40546a |
From: Daniel Henrique Barboza <danielhb413@gmail.com>
|
|
Pablo Greco |
40546a |
Date: Fri, 3 May 2019 13:54:53 +0200
|
|
Pablo Greco |
40546a |
Subject: [PATCH] PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
The NVIDIA V100 GPU has an onboard RAM that is mapped into the
|
|
Pablo Greco |
40546a |
host memory and accessible as normal RAM via an NVLink2 bridge. When
|
|
Pablo Greco |
40546a |
passed through in a guest, QEMU puts the NVIDIA RAM window in a
|
|
Pablo Greco |
40546a |
non-contiguous area, above the PCI MMIO area that starts at 32TiB.
|
|
Pablo Greco |
40546a |
This means that the NVIDIA RAM window starts at 64TiB and go all the
|
|
Pablo Greco |
40546a |
way to 128TiB.
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
This means that the guest might request a 64-bit window, for each PCI
|
|
Pablo Greco |
40546a |
Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM
|
|
Pablo Greco |
40546a |
window isn't counted as regular RAM, thus this window is considered
|
|
Pablo Greco |
40546a |
only for the allocation of the Translation and Control Entry (TCE).
|
|
Pablo Greco |
40546a |
For more information about how NVLink2 support works in QEMU,
|
|
Pablo Greco |
40546a |
refer to the accepted implementation [1].
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
This memory layout differs from the existing VFIO case, requiring its
|
|
Pablo Greco |
40546a |
own formula. This patch changes the PPC64 code of
|
|
Pablo Greco |
40546a |
@qemuDomainGetMemLockLimitBytes to:
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
- detect if we have a NVLink2 bridge being passed through to the
|
|
Pablo Greco |
40546a |
guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function
|
|
Pablo Greco |
40546a |
added in the previous patch. The existence of the NVLink2 bridge in
|
|
Pablo Greco |
40546a |
the guest means that we are dealing with the NVLink2 memory layout;
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
- if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a
|
|
Pablo Greco |
40546a |
different way to account for the extra memory the TCE table can alloc.
|
|
Pablo Greco |
40546a |
The 64TiB..128TiB window is more than enough to fit all possible
|
|
Pablo Greco |
40546a |
GPUs, thus the memLimit is the same regardless of passing through 1 or
|
|
Pablo Greco |
40546a |
multiple V100 GPUs.
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
Further reading explaining the background
|
|
Pablo Greco |
40546a |
[1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg03700.html
|
|
Pablo Greco |
40546a |
[2] https://www.redhat.com/archives/libvir-list/2019-March/msg00660.html
|
|
Pablo Greco |
40546a |
[3] https://www.redhat.com/archives/libvir-list/2019-April/msg00527.html
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
|
|
Pablo Greco |
40546a |
Reviewed-by: Erik Skultety <eskultet@redhat.com>
|
|
Pablo Greco |
40546a |
(cherry picked from commit 1a922648f67f56c4374d647feebf2adb9a642f96)
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
https://bugzilla.redhat.com/show_bug.cgi?id=1505998
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
Conflicts:
|
|
Pablo Greco |
40546a |
The upstream commit relied on:
|
|
Pablo Greco |
40546a |
- v4.7.0-37-gb72183223f
|
|
Pablo Greco |
40546a |
- v4.7.0-38-ga14f597266
|
|
Pablo Greco |
40546a |
which were not backported so virPCIDeviceAddressAsString had to
|
|
Pablo Greco |
40546a |
swapped for the former virDomainPCIAddressAsString in order to
|
|
Pablo Greco |
40546a |
compile.
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
Signed-off-by: Erik Skultety <eskultet@redhat.com>
|
|
Pablo Greco |
40546a |
Message-Id: <03c00ebf46d85b0615134ef8655e67a4c909b7da.1556884443.git.eskultet@redhat.com>
|
|
Pablo Greco |
40546a |
Reviewed-by: Andrea Bolognani <abologna@redhat.com>
|
|
Pablo Greco |
40546a |
---
|
|
Pablo Greco |
40546a |
src/qemu/qemu_domain.c | 80 ++++++++++++++++++++++++++++++++----------
|
|
Pablo Greco |
40546a |
1 file changed, 61 insertions(+), 19 deletions(-)
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
|
|
Pablo Greco |
40546a |
index a8bc618389..21f0722495 100644
|
|
Pablo Greco |
40546a |
--- a/src/qemu/qemu_domain.c
|
|
Pablo Greco |
40546a |
+++ b/src/qemu/qemu_domain.c
|
|
Pablo Greco |
40546a |
@@ -9813,7 +9813,7 @@ qemuDomainUpdateCurrentMemorySize(virQEMUDriverPtr driver,
|
|
Pablo Greco |
40546a |
* such as '0004:04:00.0', and tells if the device is a NVLink2
|
|
Pablo Greco |
40546a |
* bridge.
|
|
Pablo Greco |
40546a |
*/
|
|
Pablo Greco |
40546a |
-static ATTRIBUTE_UNUSED bool
|
|
Pablo Greco |
40546a |
+static bool
|
|
Pablo Greco |
40546a |
ppc64VFIODeviceIsNV2Bridge(const char *device)
|
|
Pablo Greco |
40546a |
{
|
|
Pablo Greco |
40546a |
const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink",
|
|
Pablo Greco |
40546a |
@@ -9851,7 +9851,9 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
|
|
Pablo Greco |
40546a |
unsigned long long maxMemory = 0;
|
|
Pablo Greco |
40546a |
unsigned long long passthroughLimit = 0;
|
|
Pablo Greco |
40546a |
size_t i, nPCIHostBridges = 0;
|
|
Pablo Greco |
40546a |
+ virPCIDeviceAddressPtr pciAddr;
|
|
Pablo Greco |
40546a |
bool usesVFIO = false;
|
|
Pablo Greco |
40546a |
+ bool nvlink2Capable = false;
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
for (i = 0; i < def->ncontrollers; i++) {
|
|
Pablo Greco |
40546a |
virDomainControllerDefPtr cont = def->controllers[i];
|
|
Pablo Greco |
40546a |
@@ -9869,7 +9871,17 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
|
|
Pablo Greco |
40546a |
dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI &&
|
|
Pablo Greco |
40546a |
dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) {
|
|
Pablo Greco |
40546a |
usesVFIO = true;
|
|
Pablo Greco |
40546a |
- break;
|
|
Pablo Greco |
40546a |
+
|
|
Pablo Greco |
40546a |
+ pciAddr = &dev->source.subsys.u.pci.addr;
|
|
Pablo Greco |
40546a |
+ if (virPCIDeviceAddressIsValid(pciAddr, false)) {
|
|
Pablo Greco |
40546a |
+ VIR_AUTOFREE(char *) pciAddrStr = NULL;
|
|
Pablo Greco |
40546a |
+
|
|
Pablo Greco |
40546a |
+ pciAddrStr = virDomainPCIAddressAsString(pciAddr);
|
|
Pablo Greco |
40546a |
+ if (ppc64VFIODeviceIsNV2Bridge(pciAddrStr)) {
|
|
Pablo Greco |
40546a |
+ nvlink2Capable = true;
|
|
Pablo Greco |
40546a |
+ break;
|
|
Pablo Greco |
40546a |
+ }
|
|
Pablo Greco |
40546a |
+ }
|
|
Pablo Greco |
40546a |
}
|
|
Pablo Greco |
40546a |
}
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
@@ -9896,29 +9908,59 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
|
|
Pablo Greco |
40546a |
4096 * nPCIHostBridges +
|
|
Pablo Greco |
40546a |
8192;
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
- /* passthroughLimit := max( 2 GiB * #PHBs, (c)
|
|
Pablo Greco |
40546a |
- * memory (d)
|
|
Pablo Greco |
40546a |
- * + memory * 1/512 * #PHBs + 8 MiB ) (e)
|
|
Pablo Greco |
40546a |
+ /* NVLink2 support in QEMU is a special case of the passthrough
|
|
Pablo Greco |
40546a |
+ * mechanics explained in the usesVFIO case below. The GPU RAM
|
|
Pablo Greco |
40546a |
+ * is placed with a gap after maxMemory. The current QEMU
|
|
Pablo Greco |
40546a |
+ * implementation puts the NVIDIA RAM above the PCI MMIO, which
|
|
Pablo Greco |
40546a |
+ * starts at 32TiB and is the MMIO reserved for the guest main RAM.
|
|
Pablo Greco |
40546a |
*
|
|
Pablo Greco |
40546a |
- * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 2 GiB
|
|
Pablo Greco |
40546a |
- * rather than 1 GiB
|
|
Pablo Greco |
40546a |
+ * This window ends at 64TiB, and this is where the GPUs are being
|
|
Pablo Greco |
40546a |
+ * placed. The next available window size is at 128TiB, and
|
|
Pablo Greco |
40546a |
+ * 64TiB..128TiB will fit all possible NVIDIA GPUs.
|
|
Pablo Greco |
40546a |
*
|
|
Pablo Greco |
40546a |
- * (d) is the with-DDW (and memory pre-registration and related
|
|
Pablo Greco |
40546a |
- * features) DMA window accounting - assuming that we only account RAM
|
|
Pablo Greco |
40546a |
- * once, even if mapped to multiple PHBs
|
|
Pablo Greco |
40546a |
+ * The same assumption as the most common case applies here:
|
|
Pablo Greco |
40546a |
+ * the guest will request a 64-bit DMA window, per PHB, that is
|
|
Pablo Greco |
40546a |
+ * big enough to map all its RAM, which is now at 128TiB due
|
|
Pablo Greco |
40546a |
+ * to the GPUs.
|
|
Pablo Greco |
40546a |
*
|
|
Pablo Greco |
40546a |
- * (e) is the with-DDW userspace view and overhead for the 64-bit DMA
|
|
Pablo Greco |
40546a |
- * window. This is based a bit on expected guest behaviour, but there
|
|
Pablo Greco |
40546a |
- * really isn't a way to completely avoid that. We assume the guest
|
|
Pablo Greco |
40546a |
- * requests a 64-bit DMA window (per PHB) just big enough to map all
|
|
Pablo Greco |
40546a |
- * its RAM. 4 kiB page size gives the 1/512; it will be less with 64
|
|
Pablo Greco |
40546a |
- * kiB pages, less still if the guest is mapped with hugepages (unlike
|
|
Pablo Greco |
40546a |
- * the default 32-bit DMA window, DDW windows can use large IOMMU
|
|
Pablo Greco |
40546a |
- * pages). 8 MiB is for second and further level overheads, like (b) */
|
|
Pablo Greco |
40546a |
- if (usesVFIO)
|
|
Pablo Greco |
40546a |
+ * Note that the NVIDIA RAM window must be accounted for the TCE
|
|
Pablo Greco |
40546a |
+ * table size, but *not* for the main RAM (maxMemory). This gives
|
|
Pablo Greco |
40546a |
+ * us the following passthroughLimit for the NVLink2 case:
|
|
Pablo Greco |
40546a |
+ *
|
|
Pablo Greco |
40546a |
+ * passthroughLimit = maxMemory +
|
|
Pablo Greco |
40546a |
+ * 128TiB/512KiB * #PHBs + 8 MiB */
|
|
Pablo Greco |
40546a |
+ if (nvlink2Capable) {
|
|
Pablo Greco |
40546a |
+ passthroughLimit = maxMemory +
|
|
Pablo Greco |
40546a |
+ 128 * (1ULL<<30) / 512 * nPCIHostBridges +
|
|
Pablo Greco |
40546a |
+ 8192;
|
|
Pablo Greco |
40546a |
+ } else if (usesVFIO) {
|
|
Pablo Greco |
40546a |
+ /* For regular (non-NVLink2 present) VFIO passthrough, the value
|
|
Pablo Greco |
40546a |
+ * of passthroughLimit is:
|
|
Pablo Greco |
40546a |
+ *
|
|
Pablo Greco |
40546a |
+ * passthroughLimit := max( 2 GiB * #PHBs, (c)
|
|
Pablo Greco |
40546a |
+ * memory (d)
|
|
Pablo Greco |
40546a |
+ * + memory * 1/512 * #PHBs + 8 MiB ) (e)
|
|
Pablo Greco |
40546a |
+ *
|
|
Pablo Greco |
40546a |
+ * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 2
|
|
Pablo Greco |
40546a |
+ * GiB rather than 1 GiB
|
|
Pablo Greco |
40546a |
+ *
|
|
Pablo Greco |
40546a |
+ * (d) is the with-DDW (and memory pre-registration and related
|
|
Pablo Greco |
40546a |
+ * features) DMA window accounting - assuming that we only account
|
|
Pablo Greco |
40546a |
+ * RAM once, even if mapped to multiple PHBs
|
|
Pablo Greco |
40546a |
+ *
|
|
Pablo Greco |
40546a |
+ * (e) is the with-DDW userspace view and overhead for the 64-bit
|
|
Pablo Greco |
40546a |
+ * DMA window. This is based a bit on expected guest behaviour, but
|
|
Pablo Greco |
40546a |
+ * there really isn't a way to completely avoid that. We assume the
|
|
Pablo Greco |
40546a |
+ * guest requests a 64-bit DMA window (per PHB) just big enough to
|
|
Pablo Greco |
40546a |
+ * map all its RAM. 4 kiB page size gives the 1/512; it will be
|
|
Pablo Greco |
40546a |
+ * less with 64 kiB pages, less still if the guest is mapped with
|
|
Pablo Greco |
40546a |
+ * hugepages (unlike the default 32-bit DMA window, DDW windows
|
|
Pablo Greco |
40546a |
+ * can use large IOMMU pages). 8 MiB is for second and further level
|
|
Pablo Greco |
40546a |
+ * overheads, like (b) */
|
|
Pablo Greco |
40546a |
passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges,
|
|
Pablo Greco |
40546a |
memory +
|
|
Pablo Greco |
40546a |
memory / 512 * nPCIHostBridges + 8192);
|
|
Pablo Greco |
40546a |
+ }
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
memKB = baseLimit + passthroughLimit;
|
|
Pablo Greco |
40546a |
|
|
Pablo Greco |
40546a |
--
|
|
Pablo Greco |
40546a |
2.21.0
|
|
Pablo Greco |
40546a |
|