Tree - rpms/libvirt - CentOS Git server

render / rpms / libvirt

Forked from rpms/libvirt 9 months ago

Source
Stats

Blame SOURCES/libvirt-PPC64-support-for-NVIDIA-V100-GPU-with-NVLink2-passthrough.patch

Blob History Raw

Pablo Greco	40546a	`From 5347b12008842b5c86f766e391c6f3756afbff7d Mon Sep 17 00:00:00 2001`
Pablo Greco	40546a	`Message-Id: <5347b12008842b5c86f766e391c6f3756afbff7d@dist-git>`
Pablo Greco	40546a	`From: Daniel Henrique Barboza <danielhb413@gmail.com>`
Pablo Greco	40546a	`Date: Fri, 3 May 2019 13:54:53 +0200`
Pablo Greco	40546a	`Subject: [PATCH] PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough`
Pablo Greco	40546a
Pablo Greco	40546a	`The NVIDIA V100 GPU has an onboard RAM that is mapped into the`
Pablo Greco	40546a	`host memory and accessible as normal RAM via an NVLink2 bridge. When`
Pablo Greco	40546a	`passed through in a guest, QEMU puts the NVIDIA RAM window in a`
Pablo Greco	40546a	`non-contiguous area, above the PCI MMIO area that starts at 32TiB.`
Pablo Greco	40546a	`This means that the NVIDIA RAM window starts at 64TiB and go all the`
Pablo Greco	40546a	`way to 128TiB.`
Pablo Greco	40546a
Pablo Greco	40546a	`This means that the guest might request a 64-bit window, for each PCI`
Pablo Greco	40546a	`Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM`
Pablo Greco	40546a	`window isn't counted as regular RAM, thus this window is considered`
Pablo Greco	40546a	`only for the allocation of the Translation and Control Entry (TCE).`
Pablo Greco	40546a	`For more information about how NVLink2 support works in QEMU,`
Pablo Greco	40546a	`refer to the accepted implementation [1].`
Pablo Greco	40546a
Pablo Greco	40546a	`This memory layout differs from the existing VFIO case, requiring its`
Pablo Greco	40546a	`own formula. This patch changes the PPC64 code of`
Pablo Greco	40546a	`@qemuDomainGetMemLockLimitBytes to:`
Pablo Greco	40546a
Pablo Greco	40546a	`- detect if we have a NVLink2 bridge being passed through to the`
Pablo Greco	40546a	`guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function`
Pablo Greco	40546a	`added in the previous patch. The existence of the NVLink2 bridge in`
Pablo Greco	40546a	`the guest means that we are dealing with the NVLink2 memory layout;`
Pablo Greco	40546a
Pablo Greco	40546a	`- if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a`
Pablo Greco	40546a	`different way to account for the extra memory the TCE table can alloc.`
Pablo Greco	40546a	`The 64TiB..128TiB window is more than enough to fit all possible`
Pablo Greco	40546a	`GPUs, thus the memLimit is the same regardless of passing through 1 or`
Pablo Greco	40546a	`multiple V100 GPUs.`
Pablo Greco	40546a
Pablo Greco	40546a	`Further reading explaining the background`
Pablo Greco	40546a	`[1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg03700.html`
Pablo Greco	40546a	`[2] https://www.redhat.com/archives/libvir-list/2019-March/msg00660.html`
Pablo Greco	40546a	`[3] https://www.redhat.com/archives/libvir-list/2019-April/msg00527.html`
Pablo Greco	40546a
Pablo Greco	40546a	`Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>`
Pablo Greco	40546a	`Reviewed-by: Erik Skultety <eskultet@redhat.com>`
Pablo Greco	40546a	`(cherry picked from commit 1a922648f67f56c4374d647feebf2adb9a642f96)`
Pablo Greco	40546a
Pablo Greco	40546a	`https://bugzilla.redhat.com/show_bug.cgi?id=1505998`
Pablo Greco	40546a
Pablo Greco	40546a	`Conflicts:`
Pablo Greco	40546a	`The upstream commit relied on:`
Pablo Greco	40546a	`- v4.7.0-37-gb72183223f`
Pablo Greco	40546a	`- v4.7.0-38-ga14f597266`
Pablo Greco	40546a	`which were not backported so virPCIDeviceAddressAsString had to`
Pablo Greco	40546a	`swapped for the former virDomainPCIAddressAsString in order to`
Pablo Greco	40546a	`compile.`
Pablo Greco	40546a
Pablo Greco	40546a	`Signed-off-by: Erik Skultety <eskultet@redhat.com>`
Pablo Greco	40546a	`Message-Id: <03c00ebf46d85b0615134ef8655e67a4c909b7da.1556884443.git.eskultet@redhat.com>`
Pablo Greco	40546a	`Reviewed-by: Andrea Bolognani <abologna@redhat.com>`
Pablo Greco	40546a	`---`
Pablo Greco	40546a	`src/qemu/qemu_domain.c \| 80 ++++++++++++++++++++++++++++++++----------`
Pablo Greco	40546a	`1 file changed, 61 insertions(+), 19 deletions(-)`
Pablo Greco	40546a
Pablo Greco	40546a	`diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c`
Pablo Greco	40546a	`index a8bc618389..21f0722495 100644`
Pablo Greco	40546a	`--- a/src/qemu/qemu_domain.c`
Pablo Greco	40546a	`+++ b/src/qemu/qemu_domain.c`
Pablo Greco	40546a	`@@ -9813,7 +9813,7 @@ qemuDomainUpdateCurrentMemorySize(virQEMUDriverPtr driver,`
Pablo Greco	40546a	`* such as '0004:04:00.0', and tells if the device is a NVLink2`
Pablo Greco	40546a	`* bridge.`
Pablo Greco	40546a	`*/`
Pablo Greco	40546a	`-static ATTRIBUTE_UNUSED bool`
Pablo Greco	40546a	`+static bool`
Pablo Greco	40546a	`ppc64VFIODeviceIsNV2Bridge(const char *device)`
Pablo Greco	40546a	`{`
Pablo Greco	40546a	`const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink",`
Pablo Greco	40546a	`@@ -9851,7 +9851,9 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)`
Pablo Greco	40546a	`unsigned long long maxMemory = 0;`
Pablo Greco	40546a	`unsigned long long passthroughLimit = 0;`
Pablo Greco	40546a	`size_t i, nPCIHostBridges = 0;`
Pablo Greco	40546a	`+ virPCIDeviceAddressPtr pciAddr;`
Pablo Greco	40546a	`bool usesVFIO = false;`
Pablo Greco	40546a	`+ bool nvlink2Capable = false;`
Pablo Greco	40546a
Pablo Greco	40546a	`for (i = 0; i < def->ncontrollers; i++) {`
Pablo Greco	40546a	`virDomainControllerDefPtr cont = def->controllers[i];`
Pablo Greco	40546a	`@@ -9869,7 +9871,17 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)`
Pablo Greco	40546a	`dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI &&`
Pablo Greco	40546a	`dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) {`
Pablo Greco	40546a	`usesVFIO = true;`
Pablo Greco	40546a	`- break;`
Pablo Greco	40546a	`+`
Pablo Greco	40546a	`+ pciAddr = &dev->source.subsys.u.pci.addr;`
Pablo Greco	40546a	`+ if (virPCIDeviceAddressIsValid(pciAddr, false)) {`
Pablo Greco	40546a	`+ VIR_AUTOFREE(char *) pciAddrStr = NULL;`
Pablo Greco	40546a	`+`
Pablo Greco	40546a	`+ pciAddrStr = virDomainPCIAddressAsString(pciAddr);`
Pablo Greco	40546a	`+ if (ppc64VFIODeviceIsNV2Bridge(pciAddrStr)) {`
Pablo Greco	40546a	`+ nvlink2Capable = true;`
Pablo Greco	40546a	`+ break;`
Pablo Greco	40546a	`+ }`
Pablo Greco	40546a	`+ }`
Pablo Greco	40546a	`}`
Pablo Greco	40546a	`}`
Pablo Greco	40546a
Pablo Greco	40546a	`@@ -9896,29 +9908,59 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)`
Pablo Greco	40546a	`4096 * nPCIHostBridges +`
Pablo Greco	40546a	`8192;`
Pablo Greco	40546a
Pablo Greco	40546a	`- /* passthroughLimit := max( 2 GiB * #PHBs, (c)`
Pablo Greco	40546a	`- * memory (d)`
Pablo Greco	40546a	`- * + memory * 1/512 * #PHBs + 8 MiB ) (e)`
Pablo Greco	40546a	`+ /* NVLink2 support in QEMU is a special case of the passthrough`
Pablo Greco	40546a	`+ * mechanics explained in the usesVFIO case below. The GPU RAM`
Pablo Greco	40546a	`+ * is placed with a gap after maxMemory. The current QEMU`
Pablo Greco	40546a	`+ * implementation puts the NVIDIA RAM above the PCI MMIO, which`
Pablo Greco	40546a	`+ * starts at 32TiB and is the MMIO reserved for the guest main RAM.`
Pablo Greco	40546a	`*`
Pablo Greco	40546a	`- * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 2 GiB`
Pablo Greco	40546a	`- * rather than 1 GiB`
Pablo Greco	40546a	`+ * This window ends at 64TiB, and this is where the GPUs are being`
Pablo Greco	40546a	`+ * placed. The next available window size is at 128TiB, and`
Pablo Greco	40546a	`+ * 64TiB..128TiB will fit all possible NVIDIA GPUs.`
Pablo Greco	40546a	`*`
Pablo Greco	40546a	`- * (d) is the with-DDW (and memory pre-registration and related`
Pablo Greco	40546a	`- * features) DMA window accounting - assuming that we only account RAM`
Pablo Greco	40546a	`- * once, even if mapped to multiple PHBs`
Pablo Greco	40546a	`+ * The same assumption as the most common case applies here:`
Pablo Greco	40546a	`+ * the guest will request a 64-bit DMA window, per PHB, that is`
Pablo Greco	40546a	`+ * big enough to map all its RAM, which is now at 128TiB due`
Pablo Greco	40546a	`+ * to the GPUs.`
Pablo Greco	40546a	`*`
Pablo Greco	40546a	`- * (e) is the with-DDW userspace view and overhead for the 64-bit DMA`
Pablo Greco	40546a	`- * window. This is based a bit on expected guest behaviour, but there`
Pablo Greco	40546a	`- * really isn't a way to completely avoid that. We assume the guest`
Pablo Greco	40546a	`- * requests a 64-bit DMA window (per PHB) just big enough to map all`
Pablo Greco	40546a	`- * its RAM. 4 kiB page size gives the 1/512; it will be less with 64`
Pablo Greco	40546a	`- * kiB pages, less still if the guest is mapped with hugepages (unlike`
Pablo Greco	40546a	`- * the default 32-bit DMA window, DDW windows can use large IOMMU`
Pablo Greco	40546a	`- * pages). 8 MiB is for second and further level overheads, like (b) */`
Pablo Greco	40546a	`- if (usesVFIO)`
Pablo Greco	40546a	`+ * Note that the NVIDIA RAM window must be accounted for the TCE`
Pablo Greco	40546a	`+ * table size, but not for the main RAM (maxMemory). This gives`
Pablo Greco	40546a	`+ * us the following passthroughLimit for the NVLink2 case:`
Pablo Greco	40546a	`+ *`
Pablo Greco	40546a	`+ * passthroughLimit = maxMemory +`
Pablo Greco	40546a	`+ * 128TiB/512KiB * #PHBs + 8 MiB */`
Pablo Greco	40546a	`+ if (nvlink2Capable) {`
Pablo Greco	40546a	`+ passthroughLimit = maxMemory +`
Pablo Greco	40546a	`+ 128 * (1ULL<<30) / 512 * nPCIHostBridges +`
Pablo Greco	40546a	`+ 8192;`
Pablo Greco	40546a	`+ } else if (usesVFIO) {`
Pablo Greco	40546a	`+ /* For regular (non-NVLink2 present) VFIO passthrough, the value`
Pablo Greco	40546a	`+ * of passthroughLimit is:`
Pablo Greco	40546a	`+ *`
Pablo Greco	40546a	`+ * passthroughLimit := max( 2 GiB * #PHBs, (c)`
Pablo Greco	40546a	`+ * memory (d)`
Pablo Greco	40546a	`+ * + memory * 1/512 * #PHBs + 8 MiB ) (e)`
Pablo Greco	40546a	`+ *`
Pablo Greco	40546a	`+ * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 2`
Pablo Greco	40546a	`+ * GiB rather than 1 GiB`
Pablo Greco	40546a	`+ *`
Pablo Greco	40546a	`+ * (d) is the with-DDW (and memory pre-registration and related`
Pablo Greco	40546a	`+ * features) DMA window accounting - assuming that we only account`
Pablo Greco	40546a	`+ * RAM once, even if mapped to multiple PHBs`
Pablo Greco	40546a	`+ *`
Pablo Greco	40546a	`+ * (e) is the with-DDW userspace view and overhead for the 64-bit`
Pablo Greco	40546a	`+ * DMA window. This is based a bit on expected guest behaviour, but`
Pablo Greco	40546a	`+ * there really isn't a way to completely avoid that. We assume the`
Pablo Greco	40546a	`+ * guest requests a 64-bit DMA window (per PHB) just big enough to`
Pablo Greco	40546a	`+ * map all its RAM. 4 kiB page size gives the 1/512; it will be`
Pablo Greco	40546a	`+ * less with 64 kiB pages, less still if the guest is mapped with`
Pablo Greco	40546a	`+ * hugepages (unlike the default 32-bit DMA window, DDW windows`
Pablo Greco	40546a	`+ * can use large IOMMU pages). 8 MiB is for second and further level`
Pablo Greco	40546a	`+ * overheads, like (b) */`
Pablo Greco	40546a	`passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges,`
Pablo Greco	40546a	`memory +`
Pablo Greco	40546a	`memory / 512 * nPCIHostBridges + 8192);`
Pablo Greco	40546a	`+ }`
Pablo Greco	40546a
Pablo Greco	40546a	`memKB = baseLimit + passthroughLimit;`
Pablo Greco	40546a
Pablo Greco	40546a	`--`
Pablo Greco	40546a	`2.21.0`
Pablo Greco	40546a

render / rpms / libvirt

Source Code

Blame SOURCES/libvirt-PPC64-support-for-NVIDIA-V100-GPU-with-NVLink2-passthrough.patch