yeahuh / rpms / qemu-kvm

Forked from rpms/qemu-kvm 2 years ago
Clone

Blame SOURCES/kvm-spapr-Support-NVIDIA-V100-GPU-with-NVLink2.patch

b38b0f
From 5dc7b745eb04e799b95e7e8d17868970a65621df Mon Sep 17 00:00:00 2001
b38b0f
From: David Gibson <dgibson@redhat.com>
b38b0f
Date: Thu, 30 May 2019 04:37:28 +0100
b38b0f
Subject: [PATCH 7/8] spapr: Support NVIDIA V100 GPU with NVLink2
b38b0f
b38b0f
RH-Author: David Gibson <dgibson@redhat.com>
b38b0f
Message-id: <20190530043728.32575-7-dgibson@redhat.com>
b38b0f
Patchwork-id: 88423
b38b0f
O-Subject: [RHEL-8.1 qemu-kvm PATCH 6/6] spapr: Support NVIDIA V100 GPU with NVLink2
b38b0f
Bugzilla: 1710662
b38b0f
RH-Acked-by: Laurent Vivier <lvivier@redhat.com>
b38b0f
RH-Acked-by: Auger Eric <eric.auger@redhat.com>
b38b0f
RH-Acked-by: Cornelia Huck <cohuck@redhat.com>
b38b0f
b38b0f
From: Alexey Kardashevskiy <aik@ozlabs.ru>
b38b0f
b38b0f
NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
b38b0f
space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
b38b0f
implements special regions for such GPUs and emulates an NVLink bridge.
b38b0f
NVLink2-enabled POWER9 CPUs also provide address translation services
b38b0f
which includes an ATS shootdown (ATSD) register exported via the NVLink
b38b0f
bridge device.
b38b0f
b38b0f
This adds a quirk to VFIO to map the GPU memory and create an MR;
b38b0f
the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
b38b0f
this to get the MR and map it to the system address space.
b38b0f
Another quirk does the same for ATSD.
b38b0f
b38b0f
This adds additional steps to sPAPR PHB setup:
b38b0f
b38b0f
1. Search for specific GPUs and NPUs, collect findings in
b38b0f
sPAPRPHBState::nvgpus, manage system address space mappings;
b38b0f
b38b0f
2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
b38b0f
"memory-block", "link-speed" to advertise the NVLink2 function to
b38b0f
the guest;
b38b0f
b38b0f
3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
b38b0f
b38b0f
4. Add new memory blocks (with extra "linux,memory-usable" to prevent
b38b0f
the guest OS from accessing the new memory until it is onlined) and
b38b0f
npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
b38b0f
uses it for link discovery.
b38b0f
b38b0f
This allocates space for GPU RAM and ATSD like we do for MMIOs by
b38b0f
adding 2 new parameters to the phb_placement() hook. Older machine types
b38b0f
set these to zero.
b38b0f
b38b0f
This puts new memory nodes in a separate NUMA node to as the GPU RAM
b38b0f
needs to be configured equally distant from any other node in the system.
b38b0f
Unlike the host setup which assigns numa ids from 255 downwards, this
b38b0f
adds new NUMA nodes after the user configures nodes or from 1 if none
b38b0f
were configured.
b38b0f
b38b0f
This adds requirement similar to EEH - one IOMMU group per vPHB.
b38b0f
The reason for this is that ATSD registers belong to a physical NPU
b38b0f
so they cannot invalidate translations on GPUs attached to another NPU.
b38b0f
It is guaranteed by the host platform as it does not mix NVLink bridges
b38b0f
or GPUs from different NPU in the same IOMMU group. If more than one
b38b0f
IOMMU group is detected on a vPHB, this disables ATSD support for that
b38b0f
vPHB and prints a warning.
b38b0f
b38b0f
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
b38b0f
[aw: for vfio portions]
b38b0f
Acked-by: Alex Williamson <alex.williamson@redhat.com>
b38b0f
Message-Id: <20190312082103.130561-1-aik@ozlabs.ru>
b38b0f
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
b38b0f
(cherry picked from commit ec132efaa81f09861a3bd6afad94827e74543b3f)
b38b0f
b38b0f
Signed-off-by: Danilo C. L. de Paula <ddepaula@redhat.com>
b38b0f
b38b0f
Conflicts:
b38b0f
	hw/ppc/spapr.c
b38b0f
	hw/ppc/spapr_pci.c
b38b0f
	hw/vfio/trace-events
b38b0f
	include/hw/pci-host/spapr.h
b38b0f
	include/hw/ppc/spapr.h
b38b0f
b38b0f
Conflicts come for several reasons:
b38b0f
  1) Some contextual conflicts
b38b0f
  2) Downstream tree does not have PHB hotplug, so upstream changes to
b38b0f
     that code need to be dropped, we also need to adapt some hunks to
b38b0f
     apply to the code as it existed before PHB hotplug was added
b38b0f
  3) Upstream had a mass renaming of spapr types to give more
b38b0f
     consistent CamelCasing.  We don't have that change downstream, so
b38b0f
     we need to adjust accordingly.
b38b0f
  4) We add an explicit include of qemu/units.h, since it's not indirectly
b38b0f
     included downstream (and it's messy to backport the patch which adds
b38b0f
     that)
b38b0f
b38b0f
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1710662
b38b0f
b38b0f
Signed-off-by: David Gibson <dgibson@redhat.com>
b38b0f
Signed-off-by: Danilo C. L. de Paula <ddepaula@redhat.com>
b38b0f
---
b38b0f
 hw/ppc/Makefile.objs        |   2 +-
b38b0f
 hw/ppc/spapr.c              |  31 ++-
b38b0f
 hw/ppc/spapr_pci.c          |  21 ++-
b38b0f
 hw/ppc/spapr_pci_nvlink2.c  | 450 ++++++++++++++++++++++++++++++++++++++++++++
b38b0f
 hw/vfio/pci-quirks.c        | 131 +++++++++++++
b38b0f
 hw/vfio/pci.c               |  14 ++
b38b0f
 hw/vfio/pci.h               |   2 +
b38b0f
 hw/vfio/trace-events        |   4 +
b38b0f
 include/hw/pci-host/spapr.h |  46 +++++
b38b0f
 include/hw/ppc/spapr.h      |   5 +-
b38b0f
 10 files changed, 697 insertions(+), 9 deletions(-)
b38b0f
 create mode 100644 hw/ppc/spapr_pci_nvlink2.c
b38b0f
b38b0f
diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
b38b0f
index a46a989..d07e999 100644
b38b0f
--- a/hw/ppc/Makefile.objs
b38b0f
+++ b/hw/ppc/Makefile.objs
b38b0f
@@ -8,7 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o
b38b0f
 # IBM PowerNV
b38b0f
 obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
b38b0f
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
b38b0f
-obj-y += spapr_pci_vfio.o
b38b0f
+obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o
b38b0f
 endif
b38b0f
 obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
b38b0f
 # PowerPC 4xx boards
b38b0f
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
b38b0f
index b57c0be..c72aad1 100644
b38b0f
--- a/hw/ppc/spapr.c
b38b0f
+++ b/hw/ppc/spapr.c
b38b0f
@@ -910,12 +910,13 @@ static void spapr_dt_rtas(sPAPRMachineState *spapr, void *fdt)
b38b0f
         0, cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE),
b38b0f
         cpu_to_be32(max_cpus / smp_threads),
b38b0f
     };
b38b0f
+    uint32_t maxdomain = cpu_to_be32(spapr->gpu_numa_id > 1 ? 1 : 0);
b38b0f
     uint32_t maxdomains[] = {
b38b0f
         cpu_to_be32(4),
b38b0f
-        cpu_to_be32(0),
b38b0f
-        cpu_to_be32(0),
b38b0f
-        cpu_to_be32(0),
b38b0f
-        cpu_to_be32(nb_numa_nodes ? nb_numa_nodes : 1),
b38b0f
+        maxdomain,
b38b0f
+        maxdomain,
b38b0f
+        maxdomain,
b38b0f
+        cpu_to_be32(spapr->gpu_numa_id),
b38b0f
     };
b38b0f
 
b38b0f
     _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));
b38b0f
@@ -1515,6 +1516,16 @@ static void spapr_machine_reset(void)
b38b0f
         ppc_set_compat(first_ppc_cpu, spapr->max_compat_pvr, &error_fatal);
b38b0f
     }
b38b0f
 
b38b0f
+    /*
b38b0f
+     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
b38b0f
+     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
b38b0f
+     * called from vPHB reset handler so we initialize the counter here.
b38b0f
+     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
b38b0f
+     * must be equally distant from any other node.
b38b0f
+     * The final value of spapr->gpu_numa_id is going to be written to
b38b0f
+     * max-associativity-domains in spapr_build_fdt().
b38b0f
+     */
b38b0f
+    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
b38b0f
     qemu_devices_reset();
b38b0f
 
b38b0f
     /* DRC reset may cause a device to be unplugged. This will cause troubles
b38b0f
@@ -3601,7 +3612,8 @@ static const CPUArchIdList *spapr_possible_cpu_arch_ids(MachineState *machine)
b38b0f
 static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
b38b0f
                                 uint64_t *buid, hwaddr *pio,
b38b0f
                                 hwaddr *mmio32, hwaddr *mmio64,
b38b0f
-                                unsigned n_dma, uint32_t *liobns, Error **errp)
b38b0f
+                                unsigned n_dma, uint32_t *liobns,
b38b0f
+                                hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
b38b0f
 {
b38b0f
     /*
b38b0f
      * New-style PHB window placement.
b38b0f
@@ -3648,6 +3660,9 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
b38b0f
     *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE;
b38b0f
     *mmio32 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM32_WIN_SIZE;
b38b0f
     *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE;
b38b0f
+
b38b0f
+    *nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index * SPAPR_PCI_NV2RAM64_WIN_SIZE;
b38b0f
+    *nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index * SPAPR_PCI_NV2ATSD_WIN_SIZE;
b38b0f
 }
b38b0f
 
b38b0f
 static ICSState *spapr_ics_get(XICSFabric *dev, int irq)
b38b0f
@@ -4133,7 +4148,8 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
b38b0f
 static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
b38b0f
                               uint64_t *buid, hwaddr *pio,
b38b0f
                               hwaddr *mmio32, hwaddr *mmio64,
b38b0f
-                              unsigned n_dma, uint32_t *liobns, Error **errp)
b38b0f
+                              unsigned n_dma, uint32_t *liobns,
b38b0f
+                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
b38b0f
 {
b38b0f
     /* Legacy PHB placement for pseries-2.7 and earlier machine types */
b38b0f
     const uint64_t base_buid = 0x800000020000000ULL;
b38b0f
@@ -4177,6 +4193,9 @@ static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
b38b0f
      * fallback behaviour of automatically splitting a large "32-bit"
b38b0f
      * window into contiguous 32-bit and 64-bit windows
b38b0f
      */
b38b0f
+
b38b0f
+    *nv2gpa = 0;
b38b0f
+    *nv2atsd = 0;
b38b0f
 }
b38b0f
 
b38b0f
 #if 0 /* Disabled for Red Hat Enterprise Linux */
b38b0f
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
b38b0f
index f936ce6..d82f957 100644
b38b0f
--- a/hw/ppc/spapr_pci.c
b38b0f
+++ b/hw/ppc/spapr_pci.c
b38b0f
@@ -1326,6 +1326,8 @@ static void spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
b38b0f
     if (sphb->pcie_ecs && pci_is_express(dev)) {
b38b0f
         _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type", 0x1));
b38b0f
     }
b38b0f
+
b38b0f
+    spapr_phb_nvgpu_populate_pcidev_dt(dev, fdt, offset, sphb);
b38b0f
 }
b38b0f
 
b38b0f
 /* create OF node for pci device and required OF DT properties */
b38b0f
@@ -1559,7 +1561,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
b38b0f
         smc->phb_placement(spapr, sphb->index,
b38b0f
                            &sphb->buid, &sphb->io_win_addr,
b38b0f
                            &sphb->mem_win_addr, &sphb->mem64_win_addr,
b38b0f
-                           windows_supported, sphb->dma_liobn, &local_err);
b38b0f
+                           windows_supported, sphb->dma_liobn,
b38b0f
+                           &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr,
b38b0f
+                           &local_err);
b38b0f
         if (local_err) {
b38b0f
             error_propagate(errp, local_err);
b38b0f
             return;
b38b0f
@@ -1764,8 +1768,14 @@ void spapr_phb_dma_reset(sPAPRPHBState *sphb)
b38b0f
 static void spapr_phb_reset(DeviceState *qdev)
b38b0f
 {
b38b0f
     sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
b38b0f
+    Error *errp = NULL;
b38b0f
 
b38b0f
     spapr_phb_dma_reset(sphb);
b38b0f
+    spapr_phb_nvgpu_free(sphb);
b38b0f
+    spapr_phb_nvgpu_setup(sphb, &errp);
b38b0f
+    if (errp) {
b38b0f
+        error_report_err(errp);
b38b0f
+    }
b38b0f
 
b38b0f
     /* Reset the IOMMU state */
b38b0f
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
b38b0f
@@ -1798,6 +1808,8 @@ static Property spapr_phb_properties[] = {
b38b0f
                      pre_2_8_migration, false),
b38b0f
     DEFINE_PROP_BOOL("pcie-extended-configuration-space", sPAPRPHBState,
b38b0f
                      pcie_ecs, true),
b38b0f
+    DEFINE_PROP_UINT64("gpa", sPAPRPHBState, nv2_gpa_win_addr, 0),
b38b0f
+    DEFINE_PROP_UINT64("atsd", sPAPRPHBState, nv2_atsd_win_addr, 0),
b38b0f
     DEFINE_PROP_END_OF_LIST(),
b38b0f
 };
b38b0f
 
b38b0f
@@ -2089,6 +2101,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
b38b0f
     sPAPRTCETable *tcet;
b38b0f
     PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
b38b0f
     sPAPRFDT s_fdt;
b38b0f
+    Error *errp = NULL;
b38b0f
 
b38b0f
     /* Start populating the FDT */
b38b0f
     nodename = g_strdup_printf("pci@%" PRIx64, phb->buid);
b38b0f
@@ -2170,6 +2183,12 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
b38b0f
         return ret;
b38b0f
     }
b38b0f
 
b38b0f
+    spapr_phb_nvgpu_populate_dt(phb, fdt, bus_off, &errp);
b38b0f
+    if (errp) {
b38b0f
+        error_report_err(errp);
b38b0f
+    }
b38b0f
+    spapr_phb_nvgpu_ram_populate_dt(phb, fdt);
b38b0f
+
b38b0f
     return 0;
b38b0f
 }
b38b0f
 
b38b0f
diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
b38b0f
new file mode 100644
b38b0f
index 0000000..60b14d8
b38b0f
--- /dev/null
b38b0f
+++ b/hw/ppc/spapr_pci_nvlink2.c
b38b0f
@@ -0,0 +1,450 @@
b38b0f
+/*
b38b0f
+ * QEMU sPAPR PCI for NVLink2 pass through
b38b0f
+ *
b38b0f
+ * Copyright (c) 2019 Alexey Kardashevskiy, IBM Corporation.
b38b0f
+ *
b38b0f
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
b38b0f
+ * of this software and associated documentation files (the "Software"), to deal
b38b0f
+ * in the Software without restriction, including without limitation the rights
b38b0f
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
b38b0f
+ * copies of the Software, and to permit persons to whom the Software is
b38b0f
+ * furnished to do so, subject to the following conditions:
b38b0f
+ *
b38b0f
+ * The above copyright notice and this permission notice shall be included in
b38b0f
+ * all copies or substantial portions of the Software.
b38b0f
+ *
b38b0f
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
b38b0f
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
b38b0f
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
b38b0f
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
b38b0f
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
b38b0f
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
b38b0f
+ * THE SOFTWARE.
b38b0f
+ */
b38b0f
+#include "qemu/osdep.h"
b38b0f
+#include "qapi/error.h"
b38b0f
+#include "qemu-common.h"
b38b0f
+#include "hw/pci/pci.h"
b38b0f
+#include "hw/pci-host/spapr.h"
b38b0f
+#include "qemu/error-report.h"
b38b0f
+#include "hw/ppc/fdt.h"
b38b0f
+#include "hw/pci/pci_bridge.h"
b38b0f
+
b38b0f
+#define PHANDLE_PCIDEV(phb, pdev)    (0x12000000 | \
b38b0f
+                                     (((phb)->index) << 16) | ((pdev)->devfn))
b38b0f
+#define PHANDLE_GPURAM(phb, n)       (0x110000FF | ((n) << 8) | \
b38b0f
+                                     (((phb)->index) << 16))
b38b0f
+#define PHANDLE_NVLINK(phb, gn, nn)  (0x00130000 | (((phb)->index) << 8) | \
b38b0f
+                                     ((gn) << 4) | (nn))
b38b0f
+
b38b0f
+#define SPAPR_GPU_NUMA_ID           (cpu_to_be32(1))
b38b0f
+
b38b0f
+struct spapr_phb_pci_nvgpu_config {
b38b0f
+    uint64_t nv2_ram_current;
b38b0f
+    uint64_t nv2_atsd_current;
b38b0f
+    int num; /* number of non empty (i.e. tgt!=0) entries in slots[] */
b38b0f
+    struct spapr_phb_pci_nvgpu_slot {
b38b0f
+        uint64_t tgt;
b38b0f
+        uint64_t gpa;
b38b0f
+        unsigned numa_id;
b38b0f
+        PCIDevice *gpdev;
b38b0f
+        int linknum;
b38b0f
+        struct {
b38b0f
+            uint64_t atsd_gpa;
b38b0f
+            PCIDevice *npdev;
b38b0f
+            uint32_t link_speed;
b38b0f
+        } links[NVGPU_MAX_LINKS];
b38b0f
+    } slots[NVGPU_MAX_NUM];
b38b0f
+    Error *errp;
b38b0f
+};
b38b0f
+
b38b0f
+static struct spapr_phb_pci_nvgpu_slot *
b38b0f
+spapr_nvgpu_get_slot(struct spapr_phb_pci_nvgpu_config *nvgpus, uint64_t tgt)
b38b0f
+{
b38b0f
+    int i;
b38b0f
+
b38b0f
+    /* Search for partially collected "slot" */
b38b0f
+    for (i = 0; i < nvgpus->num; ++i) {
b38b0f
+        if (nvgpus->slots[i].tgt == tgt) {
b38b0f
+            return &nvgpus->slots[i];
b38b0f
+        }
b38b0f
+    }
b38b0f
+
b38b0f
+    if (nvgpus->num == ARRAY_SIZE(nvgpus->slots)) {
b38b0f
+        return NULL;
b38b0f
+    }
b38b0f
+
b38b0f
+    i = nvgpus->num;
b38b0f
+    nvgpus->slots[i].tgt = tgt;
b38b0f
+    ++nvgpus->num;
b38b0f
+
b38b0f
+    return &nvgpus->slots[i];
b38b0f
+}
b38b0f
+
b38b0f
+static void spapr_pci_collect_nvgpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
b38b0f
+                                    PCIDevice *pdev, uint64_t tgt,
b38b0f
+                                    MemoryRegion *mr, Error **errp)
b38b0f
+{
b38b0f
+    MachineState *machine = MACHINE(qdev_get_machine());
b38b0f
+    sPAPRMachineState *spapr = SPAPR_MACHINE(machine);
b38b0f
+    struct spapr_phb_pci_nvgpu_slot *nvslot = spapr_nvgpu_get_slot(nvgpus, tgt);
b38b0f
+
b38b0f
+    if (!nvslot) {
b38b0f
+        error_setg(errp, "Found too many GPUs per vPHB");
b38b0f
+        return;
b38b0f
+    }
b38b0f
+    g_assert(!nvslot->gpdev);
b38b0f
+    nvslot->gpdev = pdev;
b38b0f
+
b38b0f
+    nvslot->gpa = nvgpus->nv2_ram_current;
b38b0f
+    nvgpus->nv2_ram_current += memory_region_size(mr);
b38b0f
+    nvslot->numa_id = spapr->gpu_numa_id;
b38b0f
+    ++spapr->gpu_numa_id;
b38b0f
+}
b38b0f
+
b38b0f
+static void spapr_pci_collect_nvnpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
b38b0f
+                                    PCIDevice *pdev, uint64_t tgt,
b38b0f
+                                    MemoryRegion *mr, Error **errp)
b38b0f
+{
b38b0f
+    struct spapr_phb_pci_nvgpu_slot *nvslot = spapr_nvgpu_get_slot(nvgpus, tgt);
b38b0f
+    int j;
b38b0f
+
b38b0f
+    if (!nvslot) {
b38b0f
+        error_setg(errp, "Found too many NVLink bridges per vPHB");
b38b0f
+        return;
b38b0f
+    }
b38b0f
+
b38b0f
+    j = nvslot->linknum;
b38b0f
+    if (j == ARRAY_SIZE(nvslot->links)) {
b38b0f
+        error_setg(errp, "Found too many NVLink bridges per GPU");
b38b0f
+        return;
b38b0f
+    }
b38b0f
+    ++nvslot->linknum;
b38b0f
+
b38b0f
+    g_assert(!nvslot->links[j].npdev);
b38b0f
+    nvslot->links[j].npdev = pdev;
b38b0f
+    nvslot->links[j].atsd_gpa = nvgpus->nv2_atsd_current;
b38b0f
+    nvgpus->nv2_atsd_current += memory_region_size(mr);
b38b0f
+    nvslot->links[j].link_speed =
b38b0f
+        object_property_get_uint(OBJECT(pdev), "nvlink2-link-speed", NULL);
b38b0f
+}
b38b0f
+
b38b0f
+static void spapr_phb_pci_collect_nvgpu(PCIBus *bus, PCIDevice *pdev,
b38b0f
+                                        void *opaque)
b38b0f
+{
b38b0f
+    PCIBus *sec_bus;
b38b0f
+    Object *po = OBJECT(pdev);
b38b0f
+    uint64_t tgt = object_property_get_uint(po, "nvlink2-tgt", NULL);
b38b0f
+
b38b0f
+    if (tgt) {
b38b0f
+        Error *local_err = NULL;
b38b0f
+        struct spapr_phb_pci_nvgpu_config *nvgpus = opaque;
b38b0f
+        Object *mr_gpu = object_property_get_link(po, "nvlink2-mr[0]", NULL);
b38b0f
+        Object *mr_npu = object_property_get_link(po, "nvlink2-atsd-mr[0]",
b38b0f
+                                                  NULL);
b38b0f
+
b38b0f
+        g_assert(mr_gpu || mr_npu);
b38b0f
+        if (mr_gpu) {
b38b0f
+            spapr_pci_collect_nvgpu(nvgpus, pdev, tgt, MEMORY_REGION(mr_gpu),
b38b0f
+                                    &local_err);
b38b0f
+        } else {
b38b0f
+            spapr_pci_collect_nvnpu(nvgpus, pdev, tgt, MEMORY_REGION(mr_npu),
b38b0f
+                                    &local_err);
b38b0f
+        }
b38b0f
+        error_propagate(&nvgpus->errp, local_err);
b38b0f
+    }
b38b0f
+    if ((pci_default_read_config(pdev, PCI_HEADER_TYPE, 1) !=
b38b0f
+         PCI_HEADER_TYPE_BRIDGE)) {
b38b0f
+        return;
b38b0f
+    }
b38b0f
+
b38b0f
+    sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev));
b38b0f
+    if (!sec_bus) {
b38b0f
+        return;
b38b0f
+    }
b38b0f
+
b38b0f
+    pci_for_each_device(sec_bus, pci_bus_num(sec_bus),
b38b0f
+                        spapr_phb_pci_collect_nvgpu, opaque);
b38b0f
+}
b38b0f
+
b38b0f
+void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb, Error **errp)
b38b0f
+{
b38b0f
+    int i, j, valid_gpu_num;
b38b0f
+    PCIBus *bus;
b38b0f
+
b38b0f
+    /* Search for GPUs and NPUs */
b38b0f
+    if (!sphb->nv2_gpa_win_addr || !sphb->nv2_atsd_win_addr) {
b38b0f
+        return;
b38b0f
+    }
b38b0f
+
b38b0f
+    sphb->nvgpus = g_new0(struct spapr_phb_pci_nvgpu_config, 1);
b38b0f
+    sphb->nvgpus->nv2_ram_current = sphb->nv2_gpa_win_addr;
b38b0f
+    sphb->nvgpus->nv2_atsd_current = sphb->nv2_atsd_win_addr;
b38b0f
+
b38b0f
+    bus = PCI_HOST_BRIDGE(sphb)->bus;
b38b0f
+    pci_for_each_device(bus, pci_bus_num(bus),
b38b0f
+                        spapr_phb_pci_collect_nvgpu, sphb->nvgpus);
b38b0f
+
b38b0f
+    if (sphb->nvgpus->errp) {
b38b0f
+        error_propagate(errp, sphb->nvgpus->errp);
b38b0f
+        sphb->nvgpus->errp = NULL;
b38b0f
+        goto cleanup_exit;
b38b0f
+    }
b38b0f
+
b38b0f
+    /* Add found GPU RAM and ATSD MRs if found */
b38b0f
+    for (i = 0, valid_gpu_num = 0; i < sphb->nvgpus->num; ++i) {
b38b0f
+        Object *nvmrobj;
b38b0f
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
b38b0f
+
b38b0f
+        if (!nvslot->gpdev) {
b38b0f
+            continue;
b38b0f
+        }
b38b0f
+        nvmrobj = object_property_get_link(OBJECT(nvslot->gpdev),
b38b0f
+                                           "nvlink2-mr[0]", NULL);
b38b0f
+        /* ATSD is pointless without GPU RAM MR so skip those */
b38b0f
+        if (!nvmrobj) {
b38b0f
+            continue;
b38b0f
+        }
b38b0f
+
b38b0f
+        ++valid_gpu_num;
b38b0f
+        memory_region_add_subregion(get_system_memory(), nvslot->gpa,
b38b0f
+                                    MEMORY_REGION(nvmrobj));
b38b0f
+
b38b0f
+        for (j = 0; j < nvslot->linknum; ++j) {
b38b0f
+            Object *atsdmrobj;
b38b0f
+
b38b0f
+            atsdmrobj = object_property_get_link(OBJECT(nvslot->links[j].npdev),
b38b0f
+                                                 "nvlink2-atsd-mr[0]", NULL);
b38b0f
+            if (!atsdmrobj) {
b38b0f
+                continue;
b38b0f
+            }
b38b0f
+            memory_region_add_subregion(get_system_memory(),
b38b0f
+                                        nvslot->links[j].atsd_gpa,
b38b0f
+                                        MEMORY_REGION(atsdmrobj));
b38b0f
+        }
b38b0f
+    }
b38b0f
+
b38b0f
+    if (valid_gpu_num) {
b38b0f
+        return;
b38b0f
+    }
b38b0f
+    /* We did not find any interesting GPU */
b38b0f
+cleanup_exit:
b38b0f
+    g_free(sphb->nvgpus);
b38b0f
+    sphb->nvgpus = NULL;
b38b0f
+}
b38b0f
+
b38b0f
+void spapr_phb_nvgpu_free(sPAPRPHBState *sphb)
b38b0f
+{
b38b0f
+    int i, j;
b38b0f
+
b38b0f
+    if (!sphb->nvgpus) {
b38b0f
+        return;
b38b0f
+    }
b38b0f
+
b38b0f
+    for (i = 0; i < sphb->nvgpus->num; ++i) {
b38b0f
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
b38b0f
+        Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
b38b0f
+                                                    "nvlink2-mr[0]", NULL);
b38b0f
+
b38b0f
+        if (nv_mrobj) {
b38b0f
+            memory_region_del_subregion(get_system_memory(),
b38b0f
+                                        MEMORY_REGION(nv_mrobj));
b38b0f
+        }
b38b0f
+        for (j = 0; j < nvslot->linknum; ++j) {
b38b0f
+            PCIDevice *npdev = nvslot->links[j].npdev;
b38b0f
+            Object *atsd_mrobj;
b38b0f
+            atsd_mrobj = object_property_get_link(OBJECT(npdev),
b38b0f
+                                                  "nvlink2-atsd-mr[0]", NULL);
b38b0f
+            if (atsd_mrobj) {
b38b0f
+                memory_region_del_subregion(get_system_memory(),
b38b0f
+                                            MEMORY_REGION(atsd_mrobj));
b38b0f
+            }
b38b0f
+        }
b38b0f
+    }
b38b0f
+    g_free(sphb->nvgpus);
b38b0f
+    sphb->nvgpus = NULL;
b38b0f
+}
b38b0f
+
b38b0f
+void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off,
b38b0f
+                                 Error **errp)
b38b0f
+{
b38b0f
+    int i, j, atsdnum = 0;
b38b0f
+    uint64_t atsd[8]; /* The existing limitation of known guests */
b38b0f
+
b38b0f
+    if (!sphb->nvgpus) {
b38b0f
+        return;
b38b0f
+    }
b38b0f
+
b38b0f
+    for (i = 0; (i < sphb->nvgpus->num) && (atsdnum < ARRAY_SIZE(atsd)); ++i) {
b38b0f
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
b38b0f
+
b38b0f
+        if (!nvslot->gpdev) {
b38b0f
+            continue;
b38b0f
+        }
b38b0f
+        for (j = 0; j < nvslot->linknum; ++j) {
b38b0f
+            if (!nvslot->links[j].atsd_gpa) {
b38b0f
+                continue;
b38b0f
+            }
b38b0f
+
b38b0f
+            if (atsdnum == ARRAY_SIZE(atsd)) {
b38b0f
+                error_report("Only %"PRIuPTR" ATSD registers supported",
b38b0f
+                             ARRAY_SIZE(atsd));
b38b0f
+                break;
b38b0f
+            }
b38b0f
+            atsd[atsdnum] = cpu_to_be64(nvslot->links[j].atsd_gpa);
b38b0f
+            ++atsdnum;
b38b0f
+        }
b38b0f
+    }
b38b0f
+
b38b0f
+    if (!atsdnum) {
b38b0f
+        error_setg(errp, "No ATSD registers found");
b38b0f
+        return;
b38b0f
+    }
b38b0f
+
b38b0f
+    if (!spapr_phb_eeh_available(sphb)) {
b38b0f
+        /*
b38b0f
+         * ibm,mmio-atsd contains ATSD registers; these belong to an NPU PHB
b38b0f
+         * which we do not emulate as a separate device. Instead we put
b38b0f
+         * ibm,mmio-atsd to the vPHB with GPU and make sure that we do not
b38b0f
+         * put GPUs from different IOMMU groups to the same vPHB to ensure
b38b0f
+         * that the guest will use ATSDs from the corresponding NPU.
b38b0f
+         */
b38b0f
+        error_setg(errp, "ATSD requires separate vPHB per GPU IOMMU group");
b38b0f
+        return;
b38b0f
+    }
b38b0f
+
b38b0f
+    _FDT((fdt_setprop(fdt, bus_off, "ibm,mmio-atsd", atsd,
b38b0f
+                      atsdnum * sizeof(atsd[0]))));
b38b0f
+}
b38b0f
+
b38b0f
+void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt)
b38b0f
+{
b38b0f
+    int i, j, linkidx, npuoff;
b38b0f
+    char *npuname;
b38b0f
+
b38b0f
+    if (!sphb->nvgpus) {
b38b0f
+        return;
b38b0f
+    }
b38b0f
+
b38b0f
+    npuname = g_strdup_printf("npuphb%d", sphb->index);
b38b0f
+    npuoff = fdt_add_subnode(fdt, 0, npuname);
b38b0f
+    _FDT(npuoff);
b38b0f
+    _FDT(fdt_setprop_cell(fdt, npuoff, "#address-cells", 1));
b38b0f
+    _FDT(fdt_setprop_cell(fdt, npuoff, "#size-cells", 0));
b38b0f
+    /* Advertise NPU as POWER9 so the guest can enable NPU2 contexts */
b38b0f
+    _FDT((fdt_setprop_string(fdt, npuoff, "compatible", "ibm,power9-npu")));
b38b0f
+    g_free(npuname);
b38b0f
+
b38b0f
+    for (i = 0, linkidx = 0; i < sphb->nvgpus->num; ++i) {
b38b0f
+        for (j = 0; j < sphb->nvgpus->slots[i].linknum; ++j) {
b38b0f
+            char *linkname = g_strdup_printf("link@%d", linkidx);
b38b0f
+            int off = fdt_add_subnode(fdt, npuoff, linkname);
b38b0f
+
b38b0f
+            _FDT(off);
b38b0f
+            /* _FDT((fdt_setprop_cell(fdt, off, "reg", linkidx))); */
b38b0f
+            _FDT((fdt_setprop_string(fdt, off, "compatible",
b38b0f
+                                     "ibm,npu-link")));
b38b0f
+            _FDT((fdt_setprop_cell(fdt, off, "phandle",
b38b0f
+                                   PHANDLE_NVLINK(sphb, i, j))));
b38b0f
+            _FDT((fdt_setprop_cell(fdt, off, "ibm,npu-link-index", linkidx)));
b38b0f
+            g_free(linkname);
b38b0f
+            ++linkidx;
b38b0f
+        }
b38b0f
+    }
b38b0f
+
b38b0f
+    /* Add memory nodes for GPU RAM and mark them unusable */
b38b0f
+    for (i = 0; i < sphb->nvgpus->num; ++i) {
b38b0f
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
b38b0f
+        Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
b38b0f
+                                                    "nvlink2-mr[0]", NULL);
b38b0f
+        uint32_t associativity[] = {
b38b0f
+            cpu_to_be32(0x4),
b38b0f
+            SPAPR_GPU_NUMA_ID,
b38b0f
+            SPAPR_GPU_NUMA_ID,
b38b0f
+            SPAPR_GPU_NUMA_ID,
b38b0f
+            cpu_to_be32(nvslot->numa_id)
b38b0f
+        };
b38b0f
+        uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);
b38b0f
+        uint64_t mem_reg[2] = { cpu_to_be64(nvslot->gpa), cpu_to_be64(size) };
b38b0f
+        char *mem_name = g_strdup_printf("memory@%"PRIx64, nvslot->gpa);
b38b0f
+        int off = fdt_add_subnode(fdt, 0, mem_name);
b38b0f
+
b38b0f
+        _FDT(off);
b38b0f
+        _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
b38b0f
+        _FDT((fdt_setprop(fdt, off, "reg", mem_reg, sizeof(mem_reg))));
b38b0f
+        _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
b38b0f
+                          sizeof(associativity))));
b38b0f
+
b38b0f
+        _FDT((fdt_setprop_string(fdt, off, "compatible",
b38b0f
+                                 "ibm,coherent-device-memory")));
b38b0f
+
b38b0f
+        mem_reg[1] = cpu_to_be64(0);
b38b0f
+        _FDT((fdt_setprop(fdt, off, "linux,usable-memory", mem_reg,
b38b0f
+                          sizeof(mem_reg))));
b38b0f
+        _FDT((fdt_setprop_cell(fdt, off, "phandle",
b38b0f
+                               PHANDLE_GPURAM(sphb, i))));
b38b0f
+        g_free(mem_name);
b38b0f
+    }
b38b0f
+
b38b0f
+}
b38b0f
+
b38b0f
+void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
b38b0f
+                                        sPAPRPHBState *sphb)
b38b0f
+{
b38b0f
+    int i, j;
b38b0f
+
b38b0f
+    if (!sphb->nvgpus) {
b38b0f
+        return;
b38b0f
+    }
b38b0f
+
b38b0f
+    for (i = 0; i < sphb->nvgpus->num; ++i) {
b38b0f
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
b38b0f
+
b38b0f
+        /* Skip "slot" without attached GPU */
b38b0f
+        if (!nvslot->gpdev) {
b38b0f
+            continue;
b38b0f
+        }
b38b0f
+        if (dev == nvslot->gpdev) {
b38b0f
+            uint32_t npus[nvslot->linknum];
b38b0f
+
b38b0f
+            for (j = 0; j < nvslot->linknum; ++j) {
b38b0f
+                PCIDevice *npdev = nvslot->links[j].npdev;
b38b0f
+
b38b0f
+                npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev));
b38b0f
+            }
b38b0f
+            _FDT(fdt_setprop(fdt, offset, "ibm,npu", npus,
b38b0f
+                             j * sizeof(npus[0])));
b38b0f
+            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
b38b0f
+                                   PHANDLE_PCIDEV(sphb, dev))));
b38b0f
+            continue;
b38b0f
+        }
b38b0f
+
b38b0f
+        for (j = 0; j < nvslot->linknum; ++j) {
b38b0f
+            if (dev != nvslot->links[j].npdev) {
b38b0f
+                continue;
b38b0f
+            }
b38b0f
+
b38b0f
+            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
b38b0f
+                                   PHANDLE_PCIDEV(sphb, dev))));
b38b0f
+            _FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu",
b38b0f
+                                  PHANDLE_PCIDEV(sphb, nvslot->gpdev)));
b38b0f
+            _FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink",
b38b0f
+                                   PHANDLE_NVLINK(sphb, i, j))));
b38b0f
+            /*
b38b0f
+             * If we ever want to emulate GPU RAM at the same location as on
b38b0f
+             * the host - here is the encoding GPA->TGT:
b38b0f
+             *
b38b0f
+             * gta  = ((sphb->nv2_gpa >> 42) & 0x1) << 42;
b38b0f
+             * gta |= ((sphb->nv2_gpa >> 45) & 0x3) << 43;
b38b0f
+             * gta |= ((sphb->nv2_gpa >> 49) & 0x3) << 45;
b38b0f
+             * gta |= sphb->nv2_gpa & ((1UL << 43) - 1);
b38b0f
+             */
b38b0f
+            _FDT(fdt_setprop_cell(fdt, offset, "memory-region",
b38b0f
+                                  PHANDLE_GPURAM(sphb, i)));
b38b0f
+            _FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr",
b38b0f
+                                 nvslot->tgt));
b38b0f
+            _FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed",
b38b0f
+                                  nvslot->links[j].link_speed));
b38b0f
+        }
b38b0f
+    }
b38b0f
+}
b38b0f
diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
b38b0f
index 92457ed..1beedca 100644
b38b0f
--- a/hw/vfio/pci-quirks.c
b38b0f
+++ b/hw/vfio/pci-quirks.c
b38b0f
@@ -1968,3 +1968,134 @@ int vfio_add_virt_caps(VFIOPCIDevice *vdev, Error **errp)
b38b0f
 
b38b0f
     return 0;
b38b0f
 }
b38b0f
+
b38b0f
+static void vfio_pci_nvlink2_get_tgt(Object *obj, Visitor *v,
b38b0f
+                                     const char *name,
b38b0f
+                                     void *opaque, Error **errp)
b38b0f
+{
b38b0f
+    uint64_t tgt = (uintptr_t) opaque;
b38b0f
+    visit_type_uint64(v, name, &tgt, errp);
b38b0f
+}
b38b0f
+
b38b0f
+static void vfio_pci_nvlink2_get_link_speed(Object *obj, Visitor *v,
b38b0f
+                                                 const char *name,
b38b0f
+                                                 void *opaque, Error **errp)
b38b0f
+{
b38b0f
+    uint32_t link_speed = (uint32_t)(uintptr_t) opaque;
b38b0f
+    visit_type_uint32(v, name, &link_speed, errp);
b38b0f
+}
b38b0f
+
b38b0f
+int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp)
b38b0f
+{
b38b0f
+    int ret;
b38b0f
+    void *p;
b38b0f
+    struct vfio_region_info *nv2reg = NULL;
b38b0f
+    struct vfio_info_cap_header *hdr;
b38b0f
+    struct vfio_region_info_cap_nvlink2_ssatgt *cap;
b38b0f
+    VFIOQuirk *quirk;
b38b0f
+
b38b0f
+    ret = vfio_get_dev_region_info(&vdev->vbasedev,
b38b0f
+                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
b38b0f
+                                   PCI_VENDOR_ID_NVIDIA,
b38b0f
+                                   VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,
b38b0f
+                                   &nv2reg);
b38b0f
+    if (ret) {
b38b0f
+        return ret;
b38b0f
+    }
b38b0f
+
b38b0f
+    hdr = vfio_get_region_info_cap(nv2reg, VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
b38b0f
+    if (!hdr) {
b38b0f
+        ret = -ENODEV;
b38b0f
+        goto free_exit;
b38b0f
+    }
b38b0f
+    cap = (void *) hdr;
b38b0f
+
b38b0f
+    p = mmap(NULL, nv2reg->size, PROT_READ | PROT_WRITE | PROT_EXEC,
b38b0f
+             MAP_SHARED, vdev->vbasedev.fd, nv2reg->offset);
b38b0f
+    if (p == MAP_FAILED) {
b38b0f
+        ret = -errno;
b38b0f
+        goto free_exit;
b38b0f
+    }
b38b0f
+
b38b0f
+    quirk = vfio_quirk_alloc(1);
b38b0f
+    memory_region_init_ram_ptr(&quirk->mem[0], OBJECT(vdev), "nvlink2-mr",
b38b0f
+                               nv2reg->size, p);
b38b0f
+    QLIST_INSERT_HEAD(&vdev->bars[0].quirks, quirk, next);
b38b0f
+
b38b0f
+    object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
b38b0f
+                        vfio_pci_nvlink2_get_tgt, NULL, NULL,
b38b0f
+                        (void *) (uintptr_t) cap->tgt, NULL);
b38b0f
+    trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt,
b38b0f
+                                          nv2reg->size);
b38b0f
+free_exit:
b38b0f
+    g_free(nv2reg);
b38b0f
+
b38b0f
+    return ret;
b38b0f
+}
b38b0f
+
b38b0f
+int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp)
b38b0f
+{
b38b0f
+    int ret;
b38b0f
+    void *p;
b38b0f
+    struct vfio_region_info *atsdreg = NULL;
b38b0f
+    struct vfio_info_cap_header *hdr;
b38b0f
+    struct vfio_region_info_cap_nvlink2_ssatgt *captgt;
b38b0f
+    struct vfio_region_info_cap_nvlink2_lnkspd *capspeed;
b38b0f
+    VFIOQuirk *quirk;
b38b0f
+
b38b0f
+    ret = vfio_get_dev_region_info(&vdev->vbasedev,
b38b0f
+                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
b38b0f
+                                   PCI_VENDOR_ID_IBM,
b38b0f
+                                   VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
b38b0f
+                                   &atsdreg);
b38b0f
+    if (ret) {
b38b0f
+        return ret;
b38b0f
+    }
b38b0f
+
b38b0f
+    hdr = vfio_get_region_info_cap(atsdreg,
b38b0f
+                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
b38b0f
+    if (!hdr) {
b38b0f
+        ret = -ENODEV;
b38b0f
+        goto free_exit;
b38b0f
+    }
b38b0f
+    captgt = (void *) hdr;
b38b0f
+
b38b0f
+    hdr = vfio_get_region_info_cap(atsdreg,
b38b0f
+                                   VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD);
b38b0f
+    if (!hdr) {
b38b0f
+        ret = -ENODEV;
b38b0f
+        goto free_exit;
b38b0f
+    }
b38b0f
+    capspeed = (void *) hdr;
b38b0f
+
b38b0f
+    /* Some NVLink bridges may not have assigned ATSD */
b38b0f
+    if (atsdreg->size) {
b38b0f
+        p = mmap(NULL, atsdreg->size, PROT_READ | PROT_WRITE | PROT_EXEC,
b38b0f
+                 MAP_SHARED, vdev->vbasedev.fd, atsdreg->offset);
b38b0f
+        if (p == MAP_FAILED) {
b38b0f
+            ret = -errno;
b38b0f
+            goto free_exit;
b38b0f
+        }
b38b0f
+
b38b0f
+        quirk = vfio_quirk_alloc(1);
b38b0f
+        memory_region_init_ram_device_ptr(&quirk->mem[0], OBJECT(vdev),
b38b0f
+                                          "nvlink2-atsd-mr", atsdreg->size, p);
b38b0f
+        QLIST_INSERT_HEAD(&vdev->bars[0].quirks, quirk, next);
b38b0f
+    }
b38b0f
+
b38b0f
+    object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
b38b0f
+                        vfio_pci_nvlink2_get_tgt, NULL, NULL,
b38b0f
+                        (void *) (uintptr_t) captgt->tgt, NULL);
b38b0f
+    trace_vfio_pci_nvlink2_setup_quirk_ssatgt(vdev->vbasedev.name, captgt->tgt,
b38b0f
+                                              atsdreg->size);
b38b0f
+
b38b0f
+    object_property_add(OBJECT(vdev), "nvlink2-link-speed", "uint32",
b38b0f
+                        vfio_pci_nvlink2_get_link_speed, NULL, NULL,
b38b0f
+                        (void *) (uintptr_t) capspeed->link_speed, NULL);
b38b0f
+    trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name,
b38b0f
+                                              capspeed->link_speed);
b38b0f
+free_exit:
b38b0f
+    g_free(atsdreg);
b38b0f
+
b38b0f
+    return ret;
b38b0f
+}
b38b0f
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
b38b0f
index ba3a393..735dcae 100644
b38b0f
--- a/hw/vfio/pci.c
b38b0f
+++ b/hw/vfio/pci.c
b38b0f
@@ -3078,6 +3078,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
b38b0f
         }
b38b0f
     }
b38b0f
 
b38b0f
+    if (vdev->vendor_id == PCI_VENDOR_ID_NVIDIA) {
b38b0f
+        ret = vfio_pci_nvidia_v100_ram_init(vdev, errp);
b38b0f
+        if (ret && ret != -ENODEV) {
b38b0f
+            error_report("Failed to setup NVIDIA V100 GPU RAM");
b38b0f
+        }
b38b0f
+    }
b38b0f
+
b38b0f
+    if (vdev->vendor_id == PCI_VENDOR_ID_IBM) {
b38b0f
+        ret = vfio_pci_nvlink2_init(vdev, errp);
b38b0f
+        if (ret && ret != -ENODEV) {
b38b0f
+            error_report("Failed to setup NVlink2 bridge");
b38b0f
+        }
b38b0f
+    }
b38b0f
+
b38b0f
     vfio_register_err_notifier(vdev);
b38b0f
     vfio_register_req_notifier(vdev);
b38b0f
     vfio_setup_resetfn_quirk(vdev);
b38b0f
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
b38b0f
index 629c875..bf07b43 100644
b38b0f
--- a/hw/vfio/pci.h
b38b0f
+++ b/hw/vfio/pci.h
b38b0f
@@ -175,6 +175,8 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp);
b38b0f
 int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
b38b0f
                                struct vfio_region_info *info,
b38b0f
                                Error **errp);
b38b0f
+int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
b38b0f
+int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
b38b0f
 
b38b0f
 int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
b38b0f
 void vfio_display_finalize(VFIOPCIDevice *vdev);
b38b0f
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
b38b0f
index 9487887..c9a9c14 100644
b38b0f
--- a/hw/vfio/trace-events
b38b0f
+++ b/hw/vfio/trace-events
b38b0f
@@ -84,6 +84,10 @@ vfio_pci_igd_opregion_enabled(const char *name) "%s"
b38b0f
 vfio_pci_igd_host_bridge_enabled(const char *name) "%s"
b38b0f
 vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
b38b0f
 
b38b0f
+vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
b38b0f
+vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
b38b0f
+vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
b38b0f
+
b38b0f
 # hw/vfio/common.c
b38b0f
 vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
b38b0f
 vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
b38b0f
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
b38b0f
index 0fae4fc..cd29c59 100644
b38b0f
--- a/include/hw/pci-host/spapr.h
b38b0f
+++ b/include/hw/pci-host/spapr.h
b38b0f
@@ -24,6 +24,7 @@
b38b0f
 #include "hw/pci/pci.h"
b38b0f
 #include "hw/pci/pci_host.h"
b38b0f
 #include "hw/ppc/xics.h"
b38b0f
+#include "qemu/units.h"
b38b0f
 
b38b0f
 #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
b38b0f
 
b38b0f
@@ -87,6 +88,9 @@ struct sPAPRPHBState {
b38b0f
     uint32_t mig_liobn;
b38b0f
     hwaddr mig_mem_win_addr, mig_mem_win_size;
b38b0f
     hwaddr mig_io_win_addr, mig_io_win_size;
b38b0f
+    hwaddr nv2_gpa_win_addr;
b38b0f
+    hwaddr nv2_atsd_win_addr;
b38b0f
+    struct spapr_phb_pci_nvgpu_config *nvgpus;
b38b0f
 };
b38b0f
 
b38b0f
 #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
b38b0f
@@ -104,6 +108,22 @@ struct sPAPRPHBState {
b38b0f
 
b38b0f
 #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
b38b0f
 
b38b0f
+#define SPAPR_PCI_NV2RAM64_WIN_BASE  SPAPR_PCI_LIMIT
b38b0f
+#define SPAPR_PCI_NV2RAM64_WIN_SIZE  (2 * TiB) /* For up to 6 GPUs 256GB each */
b38b0f
+
b38b0f
+/* Max number of these GPUsper a physical box */
b38b0f
+#define NVGPU_MAX_NUM                6
b38b0f
+/* Max number of NVLinks per GPU in any physical box */
b38b0f
+#define NVGPU_MAX_LINKS              3
b38b0f
+
b38b0f
+/*
b38b0f
+ * GPU RAM starts at 64TiB so huge DMA window to cover it all ends at 128TiB
b38b0f
+ * which is enough. We do not need DMA for ATSD so we put them at 128TiB.
b38b0f
+ */
b38b0f
+#define SPAPR_PCI_NV2ATSD_WIN_BASE   (128 * TiB)
b38b0f
+#define SPAPR_PCI_NV2ATSD_WIN_SIZE   (NVGPU_MAX_NUM * NVGPU_MAX_LINKS * \
b38b0f
+                                      64 * KiB)
b38b0f
+
b38b0f
 static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
b38b0f
 {
b38b0f
     sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
b38b0f
@@ -135,6 +155,13 @@ int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
b38b0f
 int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option);
b38b0f
 int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb);
b38b0f
 void spapr_phb_vfio_reset(DeviceState *qdev);
b38b0f
+void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb, Error **errp);
b38b0f
+void spapr_phb_nvgpu_free(sPAPRPHBState *sphb);
b38b0f
+void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off,
b38b0f
+                                 Error **errp);
b38b0f
+void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt);
b38b0f
+void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
b38b0f
+                                        sPAPRPHBState *sphb);
b38b0f
 #else
b38b0f
 static inline bool spapr_phb_eeh_available(sPAPRPHBState *sphb)
b38b0f
 {
b38b0f
@@ -161,6 +188,25 @@ static inline int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
b38b0f
 static inline void spapr_phb_vfio_reset(DeviceState *qdev)
b38b0f
 {
b38b0f
 }
b38b0f
+static inline void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb, Error **errp)
b38b0f
+{
b38b0f
+}
b38b0f
+static inline void spapr_phb_nvgpu_free(sPAPRPHBState *sphb)
b38b0f
+{
b38b0f
+}
b38b0f
+static inline void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt,
b38b0f
+                                               int bus_off, Error **errp)
b38b0f
+{
b38b0f
+}
b38b0f
+static inline void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb,
b38b0f
+                                                   void *fdt)
b38b0f
+{
b38b0f
+}
b38b0f
+static inline void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt,
b38b0f
+                                                      int offset,
b38b0f
+                                                      sPAPRPHBState *sphb)
b38b0f
+{
b38b0f
+}
b38b0f
 #endif
b38b0f
 
b38b0f
 void spapr_phb_dma_reset(sPAPRPHBState *sphb);
b38b0f
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
b38b0f
index beb42bc..72cfa49 100644
b38b0f
--- a/include/hw/ppc/spapr.h
b38b0f
+++ b/include/hw/ppc/spapr.h
b38b0f
@@ -104,7 +104,8 @@ struct sPAPRMachineClass {
b38b0f
     void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
b38b0f
                           uint64_t *buid, hwaddr *pio, 
b38b0f
                           hwaddr *mmio32, hwaddr *mmio64,
b38b0f
-                          unsigned n_dma, uint32_t *liobns, Error **errp);
b38b0f
+                          unsigned n_dma, uint32_t *liobns, hwaddr *nv2gpa,
b38b0f
+                          hwaddr *nv2atsd, Error **errp);
b38b0f
     sPAPRResizeHPT resize_hpt_default;
b38b0f
     sPAPRCapabilities default_caps;
b38b0f
 };
b38b0f
@@ -171,6 +172,8 @@ struct sPAPRMachineState {
b38b0f
 
b38b0f
     bool cmd_line_caps[SPAPR_CAP_NUM];
b38b0f
     sPAPRCapabilities def, eff, mig;
b38b0f
+
b38b0f
+    unsigned gpu_numa_id;
b38b0f
 };
b38b0f
 
b38b0f
 #define H_SUCCESS         0
b38b0f
-- 
b38b0f
1.8.3.1
b38b0f