Tree - rpms/qemu-kvm - CentOS Git server

yeahuh / rpms / qemu-kvm

Forked from rpms/qemu-kvm 2 years ago

Source
Stats

Blame SOURCES/kvm-spapr-Support-NVIDIA-V100-GPU-with-NVLink2.patch

Blob History Raw

		b38b0f	`From 5dc7b745eb04e799b95e7e8d17868970a65621df Mon Sep 17 00:00:00 2001`
		b38b0f	`From: David Gibson <dgibson@redhat.com>`
		b38b0f	`Date: Thu, 30 May 2019 04:37:28 +0100`
		b38b0f	`Subject: [PATCH 7/8] spapr: Support NVIDIA V100 GPU with NVLink2`
		b38b0f
		b38b0f	`RH-Author: David Gibson <dgibson@redhat.com>`
		b38b0f	`Message-id: <20190530043728.32575-7-dgibson@redhat.com>`
		b38b0f	`Patchwork-id: 88423`
		b38b0f	`O-Subject: [RHEL-8.1 qemu-kvm PATCH 6/6] spapr: Support NVIDIA V100 GPU with NVLink2`
		b38b0f	`Bugzilla: 1710662`
		b38b0f	`RH-Acked-by: Laurent Vivier <lvivier@redhat.com>`
		b38b0f	`RH-Acked-by: Auger Eric <eric.auger@redhat.com>`
		b38b0f	`RH-Acked-by: Cornelia Huck <cohuck@redhat.com>`
		b38b0f
		b38b0f	`From: Alexey Kardashevskiy <aik@ozlabs.ru>`
		b38b0f
		b38b0f	`NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory`
		b38b0f	`space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver`
		b38b0f	`implements special regions for such GPUs and emulates an NVLink bridge.`
		b38b0f	`NVLink2-enabled POWER9 CPUs also provide address translation services`
		b38b0f	`which includes an ATS shootdown (ATSD) register exported via the NVLink`
		b38b0f	`bridge device.`
		b38b0f
		b38b0f	`This adds a quirk to VFIO to map the GPU memory and create an MR;`
		b38b0f	`the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses`
		b38b0f	`this to get the MR and map it to the system address space.`
		b38b0f	`Another quirk does the same for ATSD.`
		b38b0f
		b38b0f	`This adds additional steps to sPAPR PHB setup:`
		b38b0f
		b38b0f	`1. Search for specific GPUs and NPUs, collect findings in`
		b38b0f	`sPAPRPHBState::nvgpus, manage system address space mappings;`
		b38b0f
		b38b0f	`2. Add device-specific properties such as "ibm,npu", "ibm,gpu",`
		b38b0f	`"memory-block", "link-speed" to advertise the NVLink2 function to`
		b38b0f	`the guest;`
		b38b0f
		b38b0f	`3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;`
		b38b0f
		b38b0f	`4. Add new memory blocks (with extra "linux,memory-usable" to prevent`
		b38b0f	`the guest OS from accessing the new memory until it is onlined) and`
		b38b0f	`npuphb# nodes representing an NPU unit for every vPHB as the GPU driver`
		b38b0f	`uses it for link discovery.`
		b38b0f
		b38b0f	`This allocates space for GPU RAM and ATSD like we do for MMIOs by`
		b38b0f	`adding 2 new parameters to the phb_placement() hook. Older machine types`
		b38b0f	`set these to zero.`
		b38b0f
		b38b0f	`This puts new memory nodes in a separate NUMA node to as the GPU RAM`
		b38b0f	`needs to be configured equally distant from any other node in the system.`
		b38b0f	`Unlike the host setup which assigns numa ids from 255 downwards, this`
		b38b0f	`adds new NUMA nodes after the user configures nodes or from 1 if none`
		b38b0f	`were configured.`
		b38b0f
		b38b0f	`This adds requirement similar to EEH - one IOMMU group per vPHB.`
		b38b0f	`The reason for this is that ATSD registers belong to a physical NPU`
		b38b0f	`so they cannot invalidate translations on GPUs attached to another NPU.`
		b38b0f	`It is guaranteed by the host platform as it does not mix NVLink bridges`
		b38b0f	`or GPUs from different NPU in the same IOMMU group. If more than one`
		b38b0f	`IOMMU group is detected on a vPHB, this disables ATSD support for that`
		b38b0f	`vPHB and prints a warning.`
		b38b0f
		b38b0f	`Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>`
		b38b0f	`[aw: for vfio portions]`
		b38b0f	`Acked-by: Alex Williamson <alex.williamson@redhat.com>`
		b38b0f	`Message-Id: <20190312082103.130561-1-aik@ozlabs.ru>`
		b38b0f	`Signed-off-by: David Gibson <david@gibson.dropbear.id.au>`
		b38b0f	`(cherry picked from commit ec132efaa81f09861a3bd6afad94827e74543b3f)`
		b38b0f
		b38b0f	`Signed-off-by: Danilo C. L. de Paula <ddepaula@redhat.com>`
		b38b0f
		b38b0f	`Conflicts:`
		b38b0f	`hw/ppc/spapr.c`
		b38b0f	`hw/ppc/spapr_pci.c`
		b38b0f	`hw/vfio/trace-events`
		b38b0f	`include/hw/pci-host/spapr.h`
		b38b0f	`include/hw/ppc/spapr.h`
		b38b0f
		b38b0f	`Conflicts come for several reasons:`
		b38b0f	`1) Some contextual conflicts`
		b38b0f	`2) Downstream tree does not have PHB hotplug, so upstream changes to`
		b38b0f	`that code need to be dropped, we also need to adapt some hunks to`
		b38b0f	`apply to the code as it existed before PHB hotplug was added`
		b38b0f	`3) Upstream had a mass renaming of spapr types to give more`
		b38b0f	`consistent CamelCasing. We don't have that change downstream, so`
		b38b0f	`we need to adjust accordingly.`
		b38b0f	`4) We add an explicit include of qemu/units.h, since it's not indirectly`
		b38b0f	`included downstream (and it's messy to backport the patch which adds`
		b38b0f	`that)`
		b38b0f
		b38b0f	`Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1710662`
		b38b0f
		b38b0f	`Signed-off-by: David Gibson <dgibson@redhat.com>`
		b38b0f	`Signed-off-by: Danilo C. L. de Paula <ddepaula@redhat.com>`
		b38b0f	`---`
		b38b0f	`hw/ppc/Makefile.objs \| 2 +-`
		b38b0f	`hw/ppc/spapr.c \| 31 ++-`
		b38b0f	`hw/ppc/spapr_pci.c \| 21 ++-`
		b38b0f	`hw/ppc/spapr_pci_nvlink2.c \| 450 ++++++++++++++++++++++++++++++++++++++++++++`
		b38b0f	`hw/vfio/pci-quirks.c \| 131 +++++++++++++`
		b38b0f	`hw/vfio/pci.c \| 14 ++`
		b38b0f	`hw/vfio/pci.h \| 2 +`
		b38b0f	`hw/vfio/trace-events \| 4 +`
		b38b0f	`include/hw/pci-host/spapr.h \| 46 +++++`
		b38b0f	`include/hw/ppc/spapr.h \| 5 +-`
		b38b0f	`10 files changed, 697 insertions(+), 9 deletions(-)`
		b38b0f	`create mode 100644 hw/ppc/spapr_pci_nvlink2.c`
		b38b0f
		b38b0f	`diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs`
		b38b0f	`index a46a989..d07e999 100644`
		b38b0f	`--- a/hw/ppc/Makefile.objs`
		b38b0f	`+++ b/hw/ppc/Makefile.objs`
		b38b0f	`@@ -8,7 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o`
		b38b0f	`# IBM PowerNV`
		b38b0f	`obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o`
		b38b0f	`ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)`
		b38b0f	`-obj-y += spapr_pci_vfio.o`
		b38b0f	`+obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o`
		b38b0f	`endif`
		b38b0f	`obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o`
		b38b0f	`# PowerPC 4xx boards`
		b38b0f	`diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c`
		b38b0f	`index b57c0be..c72aad1 100644`
		b38b0f	`--- a/hw/ppc/spapr.c`
		b38b0f	`+++ b/hw/ppc/spapr.c`
		b38b0f	`@@ -910,12 +910,13 @@ static void spapr_dt_rtas(sPAPRMachineState spapr, void fdt)`
		b38b0f	`0, cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE),`
		b38b0f	`cpu_to_be32(max_cpus / smp_threads),`
		b38b0f	`};`
		b38b0f	`+ uint32_t maxdomain = cpu_to_be32(spapr->gpu_numa_id > 1 ? 1 : 0);`
		b38b0f	`uint32_t maxdomains[] = {`
		b38b0f	`cpu_to_be32(4),`
		b38b0f	`- cpu_to_be32(0),`
		b38b0f	`- cpu_to_be32(0),`
		b38b0f	`- cpu_to_be32(0),`
		b38b0f	`- cpu_to_be32(nb_numa_nodes ? nb_numa_nodes : 1),`
		b38b0f	`+ maxdomain,`
		b38b0f	`+ maxdomain,`
		b38b0f	`+ maxdomain,`
		b38b0f	`+ cpu_to_be32(spapr->gpu_numa_id),`
		b38b0f	`};`
		b38b0f
		b38b0f	`_FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));`
		b38b0f	`@@ -1515,6 +1516,16 @@ static void spapr_machine_reset(void)`
		b38b0f	`ppc_set_compat(first_ppc_cpu, spapr->max_compat_pvr, &error_fatal);`
		b38b0f	`}`
		b38b0f
		b38b0f	`+ /*`
		b38b0f	`+ * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.`
		b38b0f	`+ * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is`
		b38b0f	`+ * called from vPHB reset handler so we initialize the counter here.`
		b38b0f	`+ * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM`
		b38b0f	`+ * must be equally distant from any other node.`
		b38b0f	`+ * The final value of spapr->gpu_numa_id is going to be written to`
		b38b0f	`+ * max-associativity-domains in spapr_build_fdt().`
		b38b0f	`+ */`
		b38b0f	`+ spapr->gpu_numa_id = MAX(1, nb_numa_nodes);`
		b38b0f	`qemu_devices_reset();`
		b38b0f
		b38b0f	`/* DRC reset may cause a device to be unplugged. This will cause troubles`
		b38b0f	`@@ -3601,7 +3612,8 @@ static const CPUArchIdList spapr_possible_cpu_arch_ids(MachineState machine)`
		b38b0f	`static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,`
		b38b0f	`uint64_t buid, hwaddr pio,`
		b38b0f	`hwaddr mmio32, hwaddr mmio64,`
		b38b0f	`- unsigned n_dma, uint32_t liobns, Error *errp)`
		b38b0f	`+ unsigned n_dma, uint32_t *liobns,`
		b38b0f	`+ hwaddr nv2gpa, hwaddr nv2atsd, Error **errp)`
		b38b0f	`{`
		b38b0f	`/*`
		b38b0f	`* New-style PHB window placement.`
		b38b0f	`@@ -3648,6 +3660,9 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,`
		b38b0f	`pio = SPAPR_PCI_BASE + index SPAPR_PCI_IO_WIN_SIZE;`
		b38b0f	`mmio32 = SPAPR_PCI_BASE + (index + 1) SPAPR_PCI_MEM32_WIN_SIZE;`
		b38b0f	`mmio64 = SPAPR_PCI_BASE + (index + 1) SPAPR_PCI_MEM64_WIN_SIZE;`
		b38b0f	`+`
		b38b0f	`+ nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index SPAPR_PCI_NV2RAM64_WIN_SIZE;`
		b38b0f	`+ nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index SPAPR_PCI_NV2ATSD_WIN_SIZE;`
		b38b0f	`}`
		b38b0f
		b38b0f	`static ICSState spapr_ics_get(XICSFabric dev, int irq)`
		b38b0f	`@@ -4133,7 +4148,8 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", false);`
		b38b0f	`static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,`
		b38b0f	`uint64_t buid, hwaddr pio,`
		b38b0f	`hwaddr mmio32, hwaddr mmio64,`
		b38b0f	`- unsigned n_dma, uint32_t liobns, Error *errp)`
		b38b0f	`+ unsigned n_dma, uint32_t *liobns,`
		b38b0f	`+ hwaddr nv2gpa, hwaddr nv2atsd, Error **errp)`
		b38b0f	`{`
		b38b0f	`/* Legacy PHB placement for pseries-2.7 and earlier machine types */`
		b38b0f	`const uint64_t base_buid = 0x800000020000000ULL;`
		b38b0f	`@@ -4177,6 +4193,9 @@ static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,`
		b38b0f	`* fallback behaviour of automatically splitting a large "32-bit"`
		b38b0f	`* window into contiguous 32-bit and 64-bit windows`
		b38b0f	`*/`
		b38b0f	`+`
		b38b0f	`+ *nv2gpa = 0;`
		b38b0f	`+ *nv2atsd = 0;`
		b38b0f	`}`
		b38b0f
		b38b0f	`#if 0 /* Disabled for Red Hat Enterprise Linux */`
		b38b0f	`diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c`
		b38b0f	`index f936ce6..d82f957 100644`
		b38b0f	`--- a/hw/ppc/spapr_pci.c`
		b38b0f	`+++ b/hw/ppc/spapr_pci.c`
		b38b0f	`@@ -1326,6 +1326,8 @@ static void spapr_populate_pci_child_dt(PCIDevice dev, void fdt, int offset,`
		b38b0f	`if (sphb->pcie_ecs && pci_is_express(dev)) {`
		b38b0f	`_FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type", 0x1));`
		b38b0f	`}`
		b38b0f	`+`
		b38b0f	`+ spapr_phb_nvgpu_populate_pcidev_dt(dev, fdt, offset, sphb);`
		b38b0f	`}`
		b38b0f
		b38b0f	`/* create OF node for pci device and required OF DT properties */`
		b38b0f	`@@ -1559,7 +1561,9 @@ static void spapr_phb_realize(DeviceState dev, Error *errp)`
		b38b0f	`smc->phb_placement(spapr, sphb->index,`
		b38b0f	`&sphb->buid, &sphb->io_win_addr,`
		b38b0f	`&sphb->mem_win_addr, &sphb->mem64_win_addr,`
		b38b0f	`- windows_supported, sphb->dma_liobn, &local_err);`
		b38b0f	`+ windows_supported, sphb->dma_liobn,`
		b38b0f	`+ &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr,`
		b38b0f	`+ &local_err);`
		b38b0f	`if (local_err) {`
		b38b0f	`error_propagate(errp, local_err);`
		b38b0f	`return;`
		b38b0f	`@@ -1764,8 +1768,14 @@ void spapr_phb_dma_reset(sPAPRPHBState *sphb)`
		b38b0f	`static void spapr_phb_reset(DeviceState *qdev)`
		b38b0f	`{`
		b38b0f	`sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);`
		b38b0f	`+ Error *errp = NULL;`
		b38b0f
		b38b0f	`spapr_phb_dma_reset(sphb);`
		b38b0f	`+ spapr_phb_nvgpu_free(sphb);`
		b38b0f	`+ spapr_phb_nvgpu_setup(sphb, &errp);`
		b38b0f	`+ if (errp) {`
		b38b0f	`+ error_report_err(errp);`
		b38b0f	`+ }`
		b38b0f
		b38b0f	`/* Reset the IOMMU state */`
		b38b0f	`object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);`
		b38b0f	`@@ -1798,6 +1808,8 @@ static Property spapr_phb_properties[] = {`
		b38b0f	`pre_2_8_migration, false),`
		b38b0f	`DEFINE_PROP_BOOL("pcie-extended-configuration-space", sPAPRPHBState,`
		b38b0f	`pcie_ecs, true),`
		b38b0f	`+ DEFINE_PROP_UINT64("gpa", sPAPRPHBState, nv2_gpa_win_addr, 0),`
		b38b0f	`+ DEFINE_PROP_UINT64("atsd", sPAPRPHBState, nv2_atsd_win_addr, 0),`
		b38b0f	`DEFINE_PROP_END_OF_LIST(),`
		b38b0f	`};`
		b38b0f
		b38b0f	`@@ -2089,6 +2101,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,`
		b38b0f	`sPAPRTCETable *tcet;`
		b38b0f	`PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;`
		b38b0f	`sPAPRFDT s_fdt;`
		b38b0f	`+ Error *errp = NULL;`
		b38b0f
		b38b0f	`/* Start populating the FDT */`
		b38b0f	`nodename = g_strdup_printf("pci@%" PRIx64, phb->buid);`
		b38b0f	`@@ -2170,6 +2183,12 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,`
		b38b0f	`return ret;`
		b38b0f	`}`
		b38b0f
		b38b0f	`+ spapr_phb_nvgpu_populate_dt(phb, fdt, bus_off, &errp);`
		b38b0f	`+ if (errp) {`
		b38b0f	`+ error_report_err(errp);`
		b38b0f	`+ }`
		b38b0f	`+ spapr_phb_nvgpu_ram_populate_dt(phb, fdt);`
		b38b0f	`+`
		b38b0f	`return 0;`
		b38b0f	`}`
		b38b0f
		b38b0f	`diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c`
		b38b0f	`new file mode 100644`
		b38b0f	`index 0000000..60b14d8`
		b38b0f	`--- /dev/null`
		b38b0f	`+++ b/hw/ppc/spapr_pci_nvlink2.c`
		b38b0f	`@@ -0,0 +1,450 @@`
		b38b0f	`+/*`
		b38b0f	`+ * QEMU sPAPR PCI for NVLink2 pass through`
		b38b0f	`+ *`
		b38b0f	`+ * Copyright (c) 2019 Alexey Kardashevskiy, IBM Corporation.`
		b38b0f	`+ *`
		b38b0f	`+ * Permission is hereby granted, free of charge, to any person obtaining a copy`
		b38b0f	`+ * of this software and associated documentation files (the "Software"), to deal`
		b38b0f	`+ * in the Software without restriction, including without limitation the rights`
		b38b0f	`+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell`
		b38b0f	`+ * copies of the Software, and to permit persons to whom the Software is`
		b38b0f	`+ * furnished to do so, subject to the following conditions:`
		b38b0f	`+ *`
		b38b0f	`+ * The above copyright notice and this permission notice shall be included in`
		b38b0f	`+ * all copies or substantial portions of the Software.`
		b38b0f	`+ *`
		b38b0f	`+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR`
		b38b0f	`+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,`
		b38b0f	`+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL`
		b38b0f	`+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER`
		b38b0f	`+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,`
		b38b0f	`+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN`
		b38b0f	`+ * THE SOFTWARE.`
		b38b0f	`+ */`
		b38b0f	`+#include "qemu/osdep.h"`
		b38b0f	`+#include "qapi/error.h"`
		b38b0f	`+#include "qemu-common.h"`
		b38b0f	`+#include "hw/pci/pci.h"`
		b38b0f	`+#include "hw/pci-host/spapr.h"`
		b38b0f	`+#include "qemu/error-report.h"`
		b38b0f	`+#include "hw/ppc/fdt.h"`
		b38b0f	`+#include "hw/pci/pci_bridge.h"`
		b38b0f	`+`
		b38b0f	`+#define PHANDLE_PCIDEV(phb, pdev) (0x12000000 \| \`
		b38b0f	`+ (((phb)->index) << 16) \| ((pdev)->devfn))`
		b38b0f	`+#define PHANDLE_GPURAM(phb, n) (0x110000FF \| ((n) << 8) \| \`
		b38b0f	`+ (((phb)->index) << 16))`
		b38b0f	`+#define PHANDLE_NVLINK(phb, gn, nn) (0x00130000 \| (((phb)->index) << 8) \| \`
		b38b0f	`+ ((gn) << 4) \| (nn))`
		b38b0f	`+`
		b38b0f	`+#define SPAPR_GPU_NUMA_ID (cpu_to_be32(1))`
		b38b0f	`+`
		b38b0f	`+struct spapr_phb_pci_nvgpu_config {`
		b38b0f	`+ uint64_t nv2_ram_current;`
		b38b0f	`+ uint64_t nv2_atsd_current;`
		b38b0f	`+ int num; /* number of non empty (i.e. tgt!=0) entries in slots[] */`
		b38b0f	`+ struct spapr_phb_pci_nvgpu_slot {`
		b38b0f	`+ uint64_t tgt;`
		b38b0f	`+ uint64_t gpa;`
		b38b0f	`+ unsigned numa_id;`
		b38b0f	`+ PCIDevice *gpdev;`
		b38b0f	`+ int linknum;`
		b38b0f	`+ struct {`
		b38b0f	`+ uint64_t atsd_gpa;`
		b38b0f	`+ PCIDevice *npdev;`
		b38b0f	`+ uint32_t link_speed;`
		b38b0f	`+ } links[NVGPU_MAX_LINKS];`
		b38b0f	`+ } slots[NVGPU_MAX_NUM];`
		b38b0f	`+ Error *errp;`
		b38b0f	`+};`
		b38b0f	`+`
		b38b0f	`+static struct spapr_phb_pci_nvgpu_slot *`
		b38b0f	`+spapr_nvgpu_get_slot(struct spapr_phb_pci_nvgpu_config *nvgpus, uint64_t tgt)`
		b38b0f	`+{`
		b38b0f	`+ int i;`
		b38b0f	`+`
		b38b0f	`+ /* Search for partially collected "slot" */`
		b38b0f	`+ for (i = 0; i < nvgpus->num; ++i) {`
		b38b0f	`+ if (nvgpus->slots[i].tgt == tgt) {`
		b38b0f	`+ return &nvgpus->slots[i];`
		b38b0f	`+ }`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ if (nvgpus->num == ARRAY_SIZE(nvgpus->slots)) {`
		b38b0f	`+ return NULL;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ i = nvgpus->num;`
		b38b0f	`+ nvgpus->slots[i].tgt = tgt;`
		b38b0f	`+ ++nvgpus->num;`
		b38b0f	`+`
		b38b0f	`+ return &nvgpus->slots[i];`
		b38b0f	`+}`
		b38b0f	`+`
		b38b0f	`+static void spapr_pci_collect_nvgpu(struct spapr_phb_pci_nvgpu_config *nvgpus,`
		b38b0f	`+ PCIDevice *pdev, uint64_t tgt,`
		b38b0f	`+ MemoryRegion mr, Error *errp)`
		b38b0f	`+{`
		b38b0f	`+ MachineState *machine = MACHINE(qdev_get_machine());`
		b38b0f	`+ sPAPRMachineState *spapr = SPAPR_MACHINE(machine);`
		b38b0f	`+ struct spapr_phb_pci_nvgpu_slot *nvslot = spapr_nvgpu_get_slot(nvgpus, tgt);`
		b38b0f	`+`
		b38b0f	`+ if (!nvslot) {`
		b38b0f	`+ error_setg(errp, "Found too many GPUs per vPHB");`
		b38b0f	`+ return;`
		b38b0f	`+ }`
		b38b0f	`+ g_assert(!nvslot->gpdev);`
		b38b0f	`+ nvslot->gpdev = pdev;`
		b38b0f	`+`
		b38b0f	`+ nvslot->gpa = nvgpus->nv2_ram_current;`
		b38b0f	`+ nvgpus->nv2_ram_current += memory_region_size(mr);`
		b38b0f	`+ nvslot->numa_id = spapr->gpu_numa_id;`
		b38b0f	`+ ++spapr->gpu_numa_id;`
		b38b0f	`+}`
		b38b0f	`+`
		b38b0f	`+static void spapr_pci_collect_nvnpu(struct spapr_phb_pci_nvgpu_config *nvgpus,`
		b38b0f	`+ PCIDevice *pdev, uint64_t tgt,`
		b38b0f	`+ MemoryRegion mr, Error *errp)`
		b38b0f	`+{`
		b38b0f	`+ struct spapr_phb_pci_nvgpu_slot *nvslot = spapr_nvgpu_get_slot(nvgpus, tgt);`
		b38b0f	`+ int j;`
		b38b0f	`+`
		b38b0f	`+ if (!nvslot) {`
		b38b0f	`+ error_setg(errp, "Found too many NVLink bridges per vPHB");`
		b38b0f	`+ return;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ j = nvslot->linknum;`
		b38b0f	`+ if (j == ARRAY_SIZE(nvslot->links)) {`
		b38b0f	`+ error_setg(errp, "Found too many NVLink bridges per GPU");`
		b38b0f	`+ return;`
		b38b0f	`+ }`
		b38b0f	`+ ++nvslot->linknum;`
		b38b0f	`+`
		b38b0f	`+ g_assert(!nvslot->links[j].npdev);`
		b38b0f	`+ nvslot->links[j].npdev = pdev;`
		b38b0f	`+ nvslot->links[j].atsd_gpa = nvgpus->nv2_atsd_current;`
		b38b0f	`+ nvgpus->nv2_atsd_current += memory_region_size(mr);`
		b38b0f	`+ nvslot->links[j].link_speed =`
		b38b0f	`+ object_property_get_uint(OBJECT(pdev), "nvlink2-link-speed", NULL);`
		b38b0f	`+}`
		b38b0f	`+`
		b38b0f	`+static void spapr_phb_pci_collect_nvgpu(PCIBus bus, PCIDevice pdev,`
		b38b0f	`+ void *opaque)`
		b38b0f	`+{`
		b38b0f	`+ PCIBus *sec_bus;`
		b38b0f	`+ Object *po = OBJECT(pdev);`
		b38b0f	`+ uint64_t tgt = object_property_get_uint(po, "nvlink2-tgt", NULL);`
		b38b0f	`+`
		b38b0f	`+ if (tgt) {`
		b38b0f	`+ Error *local_err = NULL;`
		b38b0f	`+ struct spapr_phb_pci_nvgpu_config *nvgpus = opaque;`
		b38b0f	`+ Object *mr_gpu = object_property_get_link(po, "nvlink2-mr[0]", NULL);`
		b38b0f	`+ Object *mr_npu = object_property_get_link(po, "nvlink2-atsd-mr[0]",`
		b38b0f	`+ NULL);`
		b38b0f	`+`
		b38b0f	`+ g_assert(mr_gpu \|\| mr_npu);`
		b38b0f	`+ if (mr_gpu) {`
		b38b0f	`+ spapr_pci_collect_nvgpu(nvgpus, pdev, tgt, MEMORY_REGION(mr_gpu),`
		b38b0f	`+ &local_err);`
		b38b0f	`+ } else {`
		b38b0f	`+ spapr_pci_collect_nvnpu(nvgpus, pdev, tgt, MEMORY_REGION(mr_npu),`
		b38b0f	`+ &local_err);`
		b38b0f	`+ }`
		b38b0f	`+ error_propagate(&nvgpus->errp, local_err);`
		b38b0f	`+ }`
		b38b0f	`+ if ((pci_default_read_config(pdev, PCI_HEADER_TYPE, 1) !=`
		b38b0f	`+ PCI_HEADER_TYPE_BRIDGE)) {`
		b38b0f	`+ return;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev));`
		b38b0f	`+ if (!sec_bus) {`
		b38b0f	`+ return;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ pci_for_each_device(sec_bus, pci_bus_num(sec_bus),`
		b38b0f	`+ spapr_phb_pci_collect_nvgpu, opaque);`
		b38b0f	`+}`
		b38b0f	`+`
		b38b0f	`+void spapr_phb_nvgpu_setup(sPAPRPHBState sphb, Error *errp)`
		b38b0f	`+{`
		b38b0f	`+ int i, j, valid_gpu_num;`
		b38b0f	`+ PCIBus *bus;`
		b38b0f	`+`
		b38b0f	`+ /* Search for GPUs and NPUs */`
		b38b0f	`+ if (!sphb->nv2_gpa_win_addr \|\| !sphb->nv2_atsd_win_addr) {`
		b38b0f	`+ return;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ sphb->nvgpus = g_new0(struct spapr_phb_pci_nvgpu_config, 1);`
		b38b0f	`+ sphb->nvgpus->nv2_ram_current = sphb->nv2_gpa_win_addr;`
		b38b0f	`+ sphb->nvgpus->nv2_atsd_current = sphb->nv2_atsd_win_addr;`
		b38b0f	`+`
		b38b0f	`+ bus = PCI_HOST_BRIDGE(sphb)->bus;`
		b38b0f	`+ pci_for_each_device(bus, pci_bus_num(bus),`
		b38b0f	`+ spapr_phb_pci_collect_nvgpu, sphb->nvgpus);`
		b38b0f	`+`
		b38b0f	`+ if (sphb->nvgpus->errp) {`
		b38b0f	`+ error_propagate(errp, sphb->nvgpus->errp);`
		b38b0f	`+ sphb->nvgpus->errp = NULL;`
		b38b0f	`+ goto cleanup_exit;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ /* Add found GPU RAM and ATSD MRs if found */`
		b38b0f	`+ for (i = 0, valid_gpu_num = 0; i < sphb->nvgpus->num; ++i) {`
		b38b0f	`+ Object *nvmrobj;`
		b38b0f	`+ struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];`
		b38b0f	`+`
		b38b0f	`+ if (!nvslot->gpdev) {`
		b38b0f	`+ continue;`
		b38b0f	`+ }`
		b38b0f	`+ nvmrobj = object_property_get_link(OBJECT(nvslot->gpdev),`
		b38b0f	`+ "nvlink2-mr[0]", NULL);`
		b38b0f	`+ /* ATSD is pointless without GPU RAM MR so skip those */`
		b38b0f	`+ if (!nvmrobj) {`
		b38b0f	`+ continue;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ ++valid_gpu_num;`
		b38b0f	`+ memory_region_add_subregion(get_system_memory(), nvslot->gpa,`
		b38b0f	`+ MEMORY_REGION(nvmrobj));`
		b38b0f	`+`
		b38b0f	`+ for (j = 0; j < nvslot->linknum; ++j) {`
		b38b0f	`+ Object *atsdmrobj;`
		b38b0f	`+`
		b38b0f	`+ atsdmrobj = object_property_get_link(OBJECT(nvslot->links[j].npdev),`
		b38b0f	`+ "nvlink2-atsd-mr[0]", NULL);`
		b38b0f	`+ if (!atsdmrobj) {`
		b38b0f	`+ continue;`
		b38b0f	`+ }`
		b38b0f	`+ memory_region_add_subregion(get_system_memory(),`
		b38b0f	`+ nvslot->links[j].atsd_gpa,`
		b38b0f	`+ MEMORY_REGION(atsdmrobj));`
		b38b0f	`+ }`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ if (valid_gpu_num) {`
		b38b0f	`+ return;`
		b38b0f	`+ }`
		b38b0f	`+ /* We did not find any interesting GPU */`
		b38b0f	`+cleanup_exit:`
		b38b0f	`+ g_free(sphb->nvgpus);`
		b38b0f	`+ sphb->nvgpus = NULL;`
		b38b0f	`+}`
		b38b0f	`+`
		b38b0f	`+void spapr_phb_nvgpu_free(sPAPRPHBState *sphb)`
		b38b0f	`+{`
		b38b0f	`+ int i, j;`
		b38b0f	`+`
		b38b0f	`+ if (!sphb->nvgpus) {`
		b38b0f	`+ return;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ for (i = 0; i < sphb->nvgpus->num; ++i) {`
		b38b0f	`+ struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];`
		b38b0f	`+ Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),`
		b38b0f	`+ "nvlink2-mr[0]", NULL);`
		b38b0f	`+`
		b38b0f	`+ if (nv_mrobj) {`
		b38b0f	`+ memory_region_del_subregion(get_system_memory(),`
		b38b0f	`+ MEMORY_REGION(nv_mrobj));`
		b38b0f	`+ }`
		b38b0f	`+ for (j = 0; j < nvslot->linknum; ++j) {`
		b38b0f	`+ PCIDevice *npdev = nvslot->links[j].npdev;`
		b38b0f	`+ Object *atsd_mrobj;`
		b38b0f	`+ atsd_mrobj = object_property_get_link(OBJECT(npdev),`
		b38b0f	`+ "nvlink2-atsd-mr[0]", NULL);`
		b38b0f	`+ if (atsd_mrobj) {`
		b38b0f	`+ memory_region_del_subregion(get_system_memory(),`
		b38b0f	`+ MEMORY_REGION(atsd_mrobj));`
		b38b0f	`+ }`
		b38b0f	`+ }`
		b38b0f	`+ }`
		b38b0f	`+ g_free(sphb->nvgpus);`
		b38b0f	`+ sphb->nvgpus = NULL;`
		b38b0f	`+}`
		b38b0f	`+`
		b38b0f	`+void spapr_phb_nvgpu_populate_dt(sPAPRPHBState sphb, void fdt, int bus_off,`
		b38b0f	`+ Error **errp)`
		b38b0f	`+{`
		b38b0f	`+ int i, j, atsdnum = 0;`
		b38b0f	`+ uint64_t atsd[8]; /* The existing limitation of known guests */`
		b38b0f	`+`
		b38b0f	`+ if (!sphb->nvgpus) {`
		b38b0f	`+ return;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ for (i = 0; (i < sphb->nvgpus->num) && (atsdnum < ARRAY_SIZE(atsd)); ++i) {`
		b38b0f	`+ struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];`
		b38b0f	`+`
		b38b0f	`+ if (!nvslot->gpdev) {`
		b38b0f	`+ continue;`
		b38b0f	`+ }`
		b38b0f	`+ for (j = 0; j < nvslot->linknum; ++j) {`
		b38b0f	`+ if (!nvslot->links[j].atsd_gpa) {`
		b38b0f	`+ continue;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ if (atsdnum == ARRAY_SIZE(atsd)) {`
		b38b0f	`+ error_report("Only %"PRIuPTR" ATSD registers supported",`
		b38b0f	`+ ARRAY_SIZE(atsd));`
		b38b0f	`+ break;`
		b38b0f	`+ }`
		b38b0f	`+ atsd[atsdnum] = cpu_to_be64(nvslot->links[j].atsd_gpa);`
		b38b0f	`+ ++atsdnum;`
		b38b0f	`+ }`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ if (!atsdnum) {`
		b38b0f	`+ error_setg(errp, "No ATSD registers found");`
		b38b0f	`+ return;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ if (!spapr_phb_eeh_available(sphb)) {`
		b38b0f	`+ /*`
		b38b0f	`+ * ibm,mmio-atsd contains ATSD registers; these belong to an NPU PHB`
		b38b0f	`+ * which we do not emulate as a separate device. Instead we put`
		b38b0f	`+ * ibm,mmio-atsd to the vPHB with GPU and make sure that we do not`
		b38b0f	`+ * put GPUs from different IOMMU groups to the same vPHB to ensure`
		b38b0f	`+ * that the guest will use ATSDs from the corresponding NPU.`
		b38b0f	`+ */`
		b38b0f	`+ error_setg(errp, "ATSD requires separate vPHB per GPU IOMMU group");`
		b38b0f	`+ return;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ _FDT((fdt_setprop(fdt, bus_off, "ibm,mmio-atsd", atsd,`
		b38b0f	`+ atsdnum * sizeof(atsd[0]))));`
		b38b0f	`+}`
		b38b0f	`+`
		b38b0f	`+void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState sphb, void fdt)`
		b38b0f	`+{`
		b38b0f	`+ int i, j, linkidx, npuoff;`
		b38b0f	`+ char *npuname;`
		b38b0f	`+`
		b38b0f	`+ if (!sphb->nvgpus) {`
		b38b0f	`+ return;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ npuname = g_strdup_printf("npuphb%d", sphb->index);`
		b38b0f	`+ npuoff = fdt_add_subnode(fdt, 0, npuname);`
		b38b0f	`+ _FDT(npuoff);`
		b38b0f	`+ _FDT(fdt_setprop_cell(fdt, npuoff, "#address-cells", 1));`
		b38b0f	`+ _FDT(fdt_setprop_cell(fdt, npuoff, "#size-cells", 0));`
		b38b0f	`+ /* Advertise NPU as POWER9 so the guest can enable NPU2 contexts */`
		b38b0f	`+ _FDT((fdt_setprop_string(fdt, npuoff, "compatible", "ibm,power9-npu")));`
		b38b0f	`+ g_free(npuname);`
		b38b0f	`+`
		b38b0f	`+ for (i = 0, linkidx = 0; i < sphb->nvgpus->num; ++i) {`
		b38b0f	`+ for (j = 0; j < sphb->nvgpus->slots[i].linknum; ++j) {`
		b38b0f	`+ char *linkname = g_strdup_printf("link@%d", linkidx);`
		b38b0f	`+ int off = fdt_add_subnode(fdt, npuoff, linkname);`
		b38b0f	`+`
		b38b0f	`+ _FDT(off);`
		b38b0f	`+ /* _FDT((fdt_setprop_cell(fdt, off, "reg", linkidx))); */`
		b38b0f	`+ _FDT((fdt_setprop_string(fdt, off, "compatible",`
		b38b0f	`+ "ibm,npu-link")));`
		b38b0f	`+ _FDT((fdt_setprop_cell(fdt, off, "phandle",`
		b38b0f	`+ PHANDLE_NVLINK(sphb, i, j))));`
		b38b0f	`+ _FDT((fdt_setprop_cell(fdt, off, "ibm,npu-link-index", linkidx)));`
		b38b0f	`+ g_free(linkname);`
		b38b0f	`+ ++linkidx;`
		b38b0f	`+ }`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ /* Add memory nodes for GPU RAM and mark them unusable */`
		b38b0f	`+ for (i = 0; i < sphb->nvgpus->num; ++i) {`
		b38b0f	`+ struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];`
		b38b0f	`+ Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),`
		b38b0f	`+ "nvlink2-mr[0]", NULL);`
		b38b0f	`+ uint32_t associativity[] = {`
		b38b0f	`+ cpu_to_be32(0x4),`
		b38b0f	`+ SPAPR_GPU_NUMA_ID,`
		b38b0f	`+ SPAPR_GPU_NUMA_ID,`
		b38b0f	`+ SPAPR_GPU_NUMA_ID,`
		b38b0f	`+ cpu_to_be32(nvslot->numa_id)`
		b38b0f	`+ };`
		b38b0f	`+ uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);`
		b38b0f	`+ uint64_t mem_reg[2] = { cpu_to_be64(nvslot->gpa), cpu_to_be64(size) };`
		b38b0f	`+ char *mem_name = g_strdup_printf("memory@%"PRIx64, nvslot->gpa);`
		b38b0f	`+ int off = fdt_add_subnode(fdt, 0, mem_name);`
		b38b0f	`+`
		b38b0f	`+ _FDT(off);`
		b38b0f	`+ _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));`
		b38b0f	`+ _FDT((fdt_setprop(fdt, off, "reg", mem_reg, sizeof(mem_reg))));`
		b38b0f	`+ _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,`
		b38b0f	`+ sizeof(associativity))));`
		b38b0f	`+`
		b38b0f	`+ _FDT((fdt_setprop_string(fdt, off, "compatible",`
		b38b0f	`+ "ibm,coherent-device-memory")));`
		b38b0f	`+`
		b38b0f	`+ mem_reg[1] = cpu_to_be64(0);`
		b38b0f	`+ _FDT((fdt_setprop(fdt, off, "linux,usable-memory", mem_reg,`
		b38b0f	`+ sizeof(mem_reg))));`
		b38b0f	`+ _FDT((fdt_setprop_cell(fdt, off, "phandle",`
		b38b0f	`+ PHANDLE_GPURAM(sphb, i))));`
		b38b0f	`+ g_free(mem_name);`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+}`
		b38b0f	`+`
		b38b0f	`+void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice dev, void fdt, int offset,`
		b38b0f	`+ sPAPRPHBState *sphb)`
		b38b0f	`+{`
		b38b0f	`+ int i, j;`
		b38b0f	`+`
		b38b0f	`+ if (!sphb->nvgpus) {`
		b38b0f	`+ return;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ for (i = 0; i < sphb->nvgpus->num; ++i) {`
		b38b0f	`+ struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];`
		b38b0f	`+`
		b38b0f	`+ /* Skip "slot" without attached GPU */`
		b38b0f	`+ if (!nvslot->gpdev) {`
		b38b0f	`+ continue;`
		b38b0f	`+ }`
		b38b0f	`+ if (dev == nvslot->gpdev) {`
		b38b0f	`+ uint32_t npus[nvslot->linknum];`
		b38b0f	`+`
		b38b0f	`+ for (j = 0; j < nvslot->linknum; ++j) {`
		b38b0f	`+ PCIDevice *npdev = nvslot->links[j].npdev;`
		b38b0f	`+`
		b38b0f	`+ npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev));`
		b38b0f	`+ }`
		b38b0f	`+ _FDT(fdt_setprop(fdt, offset, "ibm,npu", npus,`
		b38b0f	`+ j * sizeof(npus[0])));`
		b38b0f	`+ _FDT((fdt_setprop_cell(fdt, offset, "phandle",`
		b38b0f	`+ PHANDLE_PCIDEV(sphb, dev))));`
		b38b0f	`+ continue;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ for (j = 0; j < nvslot->linknum; ++j) {`
		b38b0f	`+ if (dev != nvslot->links[j].npdev) {`
		b38b0f	`+ continue;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ _FDT((fdt_setprop_cell(fdt, offset, "phandle",`
		b38b0f	`+ PHANDLE_PCIDEV(sphb, dev))));`
		b38b0f	`+ _FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu",`
		b38b0f	`+ PHANDLE_PCIDEV(sphb, nvslot->gpdev)));`
		b38b0f	`+ _FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink",`
		b38b0f	`+ PHANDLE_NVLINK(sphb, i, j))));`
		b38b0f	`+ /*`
		b38b0f	`+ * If we ever want to emulate GPU RAM at the same location as on`
		b38b0f	`+ * the host - here is the encoding GPA->TGT:`
		b38b0f	`+ *`
		b38b0f	`+ * gta = ((sphb->nv2_gpa >> 42) & 0x1) << 42;`
		b38b0f	`+ * gta \|= ((sphb->nv2_gpa >> 45) & 0x3) << 43;`
		b38b0f	`+ * gta \|= ((sphb->nv2_gpa >> 49) & 0x3) << 45;`
		b38b0f	`+ * gta \|= sphb->nv2_gpa & ((1UL << 43) - 1);`
		b38b0f	`+ */`
		b38b0f	`+ _FDT(fdt_setprop_cell(fdt, offset, "memory-region",`
		b38b0f	`+ PHANDLE_GPURAM(sphb, i)));`
		b38b0f	`+ _FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr",`
		b38b0f	`+ nvslot->tgt));`
		b38b0f	`+ _FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed",`
		b38b0f	`+ nvslot->links[j].link_speed));`
		b38b0f	`+ }`
		b38b0f	`+ }`
		b38b0f	`+}`
		b38b0f	`diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c`
		b38b0f	`index 92457ed..1beedca 100644`
		b38b0f	`--- a/hw/vfio/pci-quirks.c`
		b38b0f	`+++ b/hw/vfio/pci-quirks.c`
		b38b0f	`@@ -1968,3 +1968,134 @@ int vfio_add_virt_caps(VFIOPCIDevice vdev, Error *errp)`
		b38b0f
		b38b0f	`return 0;`
		b38b0f	`}`
		b38b0f	`+`
		b38b0f	`+static void vfio_pci_nvlink2_get_tgt(Object obj, Visitor v,`
		b38b0f	`+ const char *name,`
		b38b0f	`+ void opaque, Error *errp)`
		b38b0f	`+{`
		b38b0f	`+ uint64_t tgt = (uintptr_t) opaque;`
		b38b0f	`+ visit_type_uint64(v, name, &tgt, errp);`
		b38b0f	`+}`
		b38b0f	`+`
		b38b0f	`+static void vfio_pci_nvlink2_get_link_speed(Object obj, Visitor v,`
		b38b0f	`+ const char *name,`
		b38b0f	`+ void opaque, Error *errp)`
		b38b0f	`+{`
		b38b0f	`+ uint32_t link_speed = (uint32_t)(uintptr_t) opaque;`
		b38b0f	`+ visit_type_uint32(v, name, &link_speed, errp);`
		b38b0f	`+}`
		b38b0f	`+`
		b38b0f	`+int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice vdev, Error *errp)`
		b38b0f	`+{`
		b38b0f	`+ int ret;`
		b38b0f	`+ void *p;`
		b38b0f	`+ struct vfio_region_info *nv2reg = NULL;`
		b38b0f	`+ struct vfio_info_cap_header *hdr;`
		b38b0f	`+ struct vfio_region_info_cap_nvlink2_ssatgt *cap;`
		b38b0f	`+ VFIOQuirk *quirk;`
		b38b0f	`+`
		b38b0f	`+ ret = vfio_get_dev_region_info(&vdev->vbasedev,`
		b38b0f	`+ VFIO_REGION_TYPE_PCI_VENDOR_TYPE \|`
		b38b0f	`+ PCI_VENDOR_ID_NVIDIA,`
		b38b0f	`+ VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,`
		b38b0f	`+ &nv2reg);`
		b38b0f	`+ if (ret) {`
		b38b0f	`+ return ret;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ hdr = vfio_get_region_info_cap(nv2reg, VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);`
		b38b0f	`+ if (!hdr) {`
		b38b0f	`+ ret = -ENODEV;`
		b38b0f	`+ goto free_exit;`
		b38b0f	`+ }`
		b38b0f	`+ cap = (void *) hdr;`
		b38b0f	`+`
		b38b0f	`+ p = mmap(NULL, nv2reg->size, PROT_READ \| PROT_WRITE \| PROT_EXEC,`
		b38b0f	`+ MAP_SHARED, vdev->vbasedev.fd, nv2reg->offset);`
		b38b0f	`+ if (p == MAP_FAILED) {`
		b38b0f	`+ ret = -errno;`
		b38b0f	`+ goto free_exit;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ quirk = vfio_quirk_alloc(1);`
		b38b0f	`+ memory_region_init_ram_ptr(&quirk->mem[0], OBJECT(vdev), "nvlink2-mr",`
		b38b0f	`+ nv2reg->size, p);`
		b38b0f	`+ QLIST_INSERT_HEAD(&vdev->bars[0].quirks, quirk, next);`
		b38b0f	`+`
		b38b0f	`+ object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",`
		b38b0f	`+ vfio_pci_nvlink2_get_tgt, NULL, NULL,`
		b38b0f	`+ (void *) (uintptr_t) cap->tgt, NULL);`
		b38b0f	`+ trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt,`
		b38b0f	`+ nv2reg->size);`
		b38b0f	`+free_exit:`
		b38b0f	`+ g_free(nv2reg);`
		b38b0f	`+`
		b38b0f	`+ return ret;`
		b38b0f	`+}`
		b38b0f	`+`
		b38b0f	`+int vfio_pci_nvlink2_init(VFIOPCIDevice vdev, Error *errp)`
		b38b0f	`+{`
		b38b0f	`+ int ret;`
		b38b0f	`+ void *p;`
		b38b0f	`+ struct vfio_region_info *atsdreg = NULL;`
		b38b0f	`+ struct vfio_info_cap_header *hdr;`
		b38b0f	`+ struct vfio_region_info_cap_nvlink2_ssatgt *captgt;`
		b38b0f	`+ struct vfio_region_info_cap_nvlink2_lnkspd *capspeed;`
		b38b0f	`+ VFIOQuirk *quirk;`
		b38b0f	`+`
		b38b0f	`+ ret = vfio_get_dev_region_info(&vdev->vbasedev,`
		b38b0f	`+ VFIO_REGION_TYPE_PCI_VENDOR_TYPE \|`
		b38b0f	`+ PCI_VENDOR_ID_IBM,`
		b38b0f	`+ VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,`
		b38b0f	`+ &atsdreg);`
		b38b0f	`+ if (ret) {`
		b38b0f	`+ return ret;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ hdr = vfio_get_region_info_cap(atsdreg,`
		b38b0f	`+ VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);`
		b38b0f	`+ if (!hdr) {`
		b38b0f	`+ ret = -ENODEV;`
		b38b0f	`+ goto free_exit;`
		b38b0f	`+ }`
		b38b0f	`+ captgt = (void *) hdr;`
		b38b0f	`+`
		b38b0f	`+ hdr = vfio_get_region_info_cap(atsdreg,`
		b38b0f	`+ VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD);`
		b38b0f	`+ if (!hdr) {`
		b38b0f	`+ ret = -ENODEV;`
		b38b0f	`+ goto free_exit;`
		b38b0f	`+ }`
		b38b0f	`+ capspeed = (void *) hdr;`
		b38b0f	`+`
		b38b0f	`+ /* Some NVLink bridges may not have assigned ATSD */`
		b38b0f	`+ if (atsdreg->size) {`
		b38b0f	`+ p = mmap(NULL, atsdreg->size, PROT_READ \| PROT_WRITE \| PROT_EXEC,`
		b38b0f	`+ MAP_SHARED, vdev->vbasedev.fd, atsdreg->offset);`
		b38b0f	`+ if (p == MAP_FAILED) {`
		b38b0f	`+ ret = -errno;`
		b38b0f	`+ goto free_exit;`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ quirk = vfio_quirk_alloc(1);`
		b38b0f	`+ memory_region_init_ram_device_ptr(&quirk->mem[0], OBJECT(vdev),`
		b38b0f	`+ "nvlink2-atsd-mr", atsdreg->size, p);`
		b38b0f	`+ QLIST_INSERT_HEAD(&vdev->bars[0].quirks, quirk, next);`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",`
		b38b0f	`+ vfio_pci_nvlink2_get_tgt, NULL, NULL,`
		b38b0f	`+ (void *) (uintptr_t) captgt->tgt, NULL);`
		b38b0f	`+ trace_vfio_pci_nvlink2_setup_quirk_ssatgt(vdev->vbasedev.name, captgt->tgt,`
		b38b0f	`+ atsdreg->size);`
		b38b0f	`+`
		b38b0f	`+ object_property_add(OBJECT(vdev), "nvlink2-link-speed", "uint32",`
		b38b0f	`+ vfio_pci_nvlink2_get_link_speed, NULL, NULL,`
		b38b0f	`+ (void *) (uintptr_t) capspeed->link_speed, NULL);`
		b38b0f	`+ trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name,`
		b38b0f	`+ capspeed->link_speed);`
		b38b0f	`+free_exit:`
		b38b0f	`+ g_free(atsdreg);`
		b38b0f	`+`
		b38b0f	`+ return ret;`
		b38b0f	`+}`
		b38b0f	`diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c`
		b38b0f	`index ba3a393..735dcae 100644`
		b38b0f	`--- a/hw/vfio/pci.c`
		b38b0f	`+++ b/hw/vfio/pci.c`
		b38b0f	`@@ -3078,6 +3078,20 @@ static void vfio_realize(PCIDevice pdev, Error *errp)`
		b38b0f	`}`
		b38b0f	`}`
		b38b0f
		b38b0f	`+ if (vdev->vendor_id == PCI_VENDOR_ID_NVIDIA) {`
		b38b0f	`+ ret = vfio_pci_nvidia_v100_ram_init(vdev, errp);`
		b38b0f	`+ if (ret && ret != -ENODEV) {`
		b38b0f	`+ error_report("Failed to setup NVIDIA V100 GPU RAM");`
		b38b0f	`+ }`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`+ if (vdev->vendor_id == PCI_VENDOR_ID_IBM) {`
		b38b0f	`+ ret = vfio_pci_nvlink2_init(vdev, errp);`
		b38b0f	`+ if (ret && ret != -ENODEV) {`
		b38b0f	`+ error_report("Failed to setup NVlink2 bridge");`
		b38b0f	`+ }`
		b38b0f	`+ }`
		b38b0f	`+`
		b38b0f	`vfio_register_err_notifier(vdev);`
		b38b0f	`vfio_register_req_notifier(vdev);`
		b38b0f	`vfio_setup_resetfn_quirk(vdev);`
		b38b0f	`diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h`
		b38b0f	`index 629c875..bf07b43 100644`
		b38b0f	`--- a/hw/vfio/pci.h`
		b38b0f	`+++ b/hw/vfio/pci.h`
		b38b0f	`@@ -175,6 +175,8 @@ int vfio_populate_vga(VFIOPCIDevice vdev, Error *errp);`
		b38b0f	`int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,`
		b38b0f	`struct vfio_region_info *info,`
		b38b0f	`Error **errp);`
		b38b0f	`+int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice vdev, Error *errp);`
		b38b0f	`+int vfio_pci_nvlink2_init(VFIOPCIDevice vdev, Error *errp);`
		b38b0f
		b38b0f	`int vfio_display_probe(VFIOPCIDevice vdev, Error *errp);`
		b38b0f	`void vfio_display_finalize(VFIOPCIDevice *vdev);`
		b38b0f	`diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events`
		b38b0f	`index 9487887..c9a9c14 100644`
		b38b0f	`--- a/hw/vfio/trace-events`
		b38b0f	`+++ b/hw/vfio/trace-events`
		b38b0f	`@@ -84,6 +84,10 @@ vfio_pci_igd_opregion_enabled(const char *name) "%s"`
		b38b0f	`vfio_pci_igd_host_bridge_enabled(const char *name) "%s"`
		b38b0f	`vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"`
		b38b0f
		b38b0f	`+vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64`
		b38b0f	`+vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64`
		b38b0f	`+vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"`
		b38b0f	`+`
		b38b0f	`# hw/vfio/common.c`
		b38b0f	`vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"`
		b38b0f	`vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64`
		b38b0f	`diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h`
		b38b0f	`index 0fae4fc..cd29c59 100644`
		b38b0f	`--- a/include/hw/pci-host/spapr.h`
		b38b0f	`+++ b/include/hw/pci-host/spapr.h`
		b38b0f	`@@ -24,6 +24,7 @@`
		b38b0f	`#include "hw/pci/pci.h"`
		b38b0f	`#include "hw/pci/pci_host.h"`
		b38b0f	`#include "hw/ppc/xics.h"`
		b38b0f	`+#include "qemu/units.h"`
		b38b0f
		b38b0f	`#define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"`
		b38b0f
		b38b0f	`@@ -87,6 +88,9 @@ struct sPAPRPHBState {`
		b38b0f	`uint32_t mig_liobn;`
		b38b0f	`hwaddr mig_mem_win_addr, mig_mem_win_size;`
		b38b0f	`hwaddr mig_io_win_addr, mig_io_win_size;`
		b38b0f	`+ hwaddr nv2_gpa_win_addr;`
		b38b0f	`+ hwaddr nv2_atsd_win_addr;`
		b38b0f	`+ struct spapr_phb_pci_nvgpu_config *nvgpus;`
		b38b0f	`};`
		b38b0f
		b38b0f	`#define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL`
		b38b0f	`@@ -104,6 +108,22 @@ struct sPAPRPHBState {`
		b38b0f
		b38b0f	`#define SPAPR_PCI_MSI_WINDOW 0x40000000000ULL`
		b38b0f
		b38b0f	`+#define SPAPR_PCI_NV2RAM64_WIN_BASE SPAPR_PCI_LIMIT`
		b38b0f	`+#define SPAPR_PCI_NV2RAM64_WIN_SIZE (2 * TiB) /* For up to 6 GPUs 256GB each */`
		b38b0f	`+`
		b38b0f	`+/* Max number of these GPUsper a physical box */`
		b38b0f	`+#define NVGPU_MAX_NUM 6`
		b38b0f	`+/* Max number of NVLinks per GPU in any physical box */`
		b38b0f	`+#define NVGPU_MAX_LINKS 3`
		b38b0f	`+`
		b38b0f	`+/*`
		b38b0f	`+ * GPU RAM starts at 64TiB so huge DMA window to cover it all ends at 128TiB`
		b38b0f	`+ * which is enough. We do not need DMA for ATSD so we put them at 128TiB.`
		b38b0f	`+ */`
		b38b0f	`+#define SPAPR_PCI_NV2ATSD_WIN_BASE (128 * TiB)`
		b38b0f	`+#define SPAPR_PCI_NV2ATSD_WIN_SIZE (NVGPU_MAX_NUM * NVGPU_MAX_LINKS * \`
		b38b0f	`+ 64 * KiB)`
		b38b0f	`+`
		b38b0f	`static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)`
		b38b0f	`{`
		b38b0f	`sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());`
		b38b0f	`@@ -135,6 +155,13 @@ int spapr_phb_vfio_eeh_get_state(sPAPRPHBState sphb, int state);`
		b38b0f	`int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option);`
		b38b0f	`int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb);`
		b38b0f	`void spapr_phb_vfio_reset(DeviceState *qdev);`
		b38b0f	`+void spapr_phb_nvgpu_setup(sPAPRPHBState sphb, Error *errp);`
		b38b0f	`+void spapr_phb_nvgpu_free(sPAPRPHBState *sphb);`
		b38b0f	`+void spapr_phb_nvgpu_populate_dt(sPAPRPHBState sphb, void fdt, int bus_off,`
		b38b0f	`+ Error **errp);`
		b38b0f	`+void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState sphb, void fdt);`
		b38b0f	`+void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice dev, void fdt, int offset,`
		b38b0f	`+ sPAPRPHBState *sphb);`
		b38b0f	`#else`
		b38b0f	`static inline bool spapr_phb_eeh_available(sPAPRPHBState *sphb)`
		b38b0f	`{`
		b38b0f	`@@ -161,6 +188,25 @@ static inline int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)`
		b38b0f	`static inline void spapr_phb_vfio_reset(DeviceState *qdev)`
		b38b0f	`{`
		b38b0f	`}`
		b38b0f	`+static inline void spapr_phb_nvgpu_setup(sPAPRPHBState sphb, Error *errp)`
		b38b0f	`+{`
		b38b0f	`+}`
		b38b0f	`+static inline void spapr_phb_nvgpu_free(sPAPRPHBState *sphb)`
		b38b0f	`+{`
		b38b0f	`+}`
		b38b0f	`+static inline void spapr_phb_nvgpu_populate_dt(sPAPRPHBState sphb, void fdt,`
		b38b0f	`+ int bus_off, Error **errp)`
		b38b0f	`+{`
		b38b0f	`+}`
		b38b0f	`+static inline void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb,`
		b38b0f	`+ void *fdt)`
		b38b0f	`+{`
		b38b0f	`+}`
		b38b0f	`+static inline void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice dev, void fdt,`
		b38b0f	`+ int offset,`
		b38b0f	`+ sPAPRPHBState *sphb)`
		b38b0f	`+{`
		b38b0f	`+}`
		b38b0f	`#endif`
		b38b0f
		b38b0f	`void spapr_phb_dma_reset(sPAPRPHBState *sphb);`
		b38b0f	`diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h`
		b38b0f	`index beb42bc..72cfa49 100644`
		b38b0f	`--- a/include/hw/ppc/spapr.h`
		b38b0f	`+++ b/include/hw/ppc/spapr.h`
		b38b0f	`@@ -104,7 +104,8 @@ struct sPAPRMachineClass {`
		b38b0f	`void (phb_placement)(sPAPRMachineState spapr, uint32_t index,`
		b38b0f	`uint64_t buid, hwaddr pio,`
		b38b0f	`hwaddr mmio32, hwaddr mmio64,`
		b38b0f	`- unsigned n_dma, uint32_t liobns, Error *errp);`
		b38b0f	`+ unsigned n_dma, uint32_t liobns, hwaddr nv2gpa,`
		b38b0f	`+ hwaddr nv2atsd, Error *errp);`
		b38b0f	`sPAPRResizeHPT resize_hpt_default;`
		b38b0f	`sPAPRCapabilities default_caps;`
		b38b0f	`};`
		b38b0f	`@@ -171,6 +172,8 @@ struct sPAPRMachineState {`
		b38b0f
		b38b0f	`bool cmd_line_caps[SPAPR_CAP_NUM];`
		b38b0f	`sPAPRCapabilities def, eff, mig;`
		b38b0f	`+`
		b38b0f	`+ unsigned gpu_numa_id;`
		b38b0f	`};`
		b38b0f
		b38b0f	`#define H_SUCCESS 0`
		b38b0f	`--`
		b38b0f	`1.8.3.1`
		b38b0f

yeahuh / rpms / qemu-kvm

Source Code

Blame SOURCES/kvm-spapr-Support-NVIDIA-V100-GPU-with-NVLink2.patch