cryptospore / rpms / qemu-kvm

Forked from rpms/qemu-kvm 2 years ago
Clone
Pablo Greco e6a3ae
From 5dc7b745eb04e799b95e7e8d17868970a65621df Mon Sep 17 00:00:00 2001
Pablo Greco e6a3ae
From: David Gibson <dgibson@redhat.com>
Pablo Greco e6a3ae
Date: Thu, 30 May 2019 04:37:28 +0100
Pablo Greco e6a3ae
Subject: [PATCH 7/8] spapr: Support NVIDIA V100 GPU with NVLink2
Pablo Greco e6a3ae
Pablo Greco e6a3ae
RH-Author: David Gibson <dgibson@redhat.com>
Pablo Greco e6a3ae
Message-id: <20190530043728.32575-7-dgibson@redhat.com>
Pablo Greco e6a3ae
Patchwork-id: 88423
Pablo Greco e6a3ae
O-Subject: [RHEL-8.1 qemu-kvm PATCH 6/6] spapr: Support NVIDIA V100 GPU with NVLink2
Pablo Greco e6a3ae
Bugzilla: 1710662
Pablo Greco e6a3ae
RH-Acked-by: Laurent Vivier <lvivier@redhat.com>
Pablo Greco e6a3ae
RH-Acked-by: Auger Eric <eric.auger@redhat.com>
Pablo Greco e6a3ae
RH-Acked-by: Cornelia Huck <cohuck@redhat.com>
Pablo Greco e6a3ae
Pablo Greco e6a3ae
From: Alexey Kardashevskiy <aik@ozlabs.ru>
Pablo Greco e6a3ae
Pablo Greco e6a3ae
NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
Pablo Greco e6a3ae
space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
Pablo Greco e6a3ae
implements special regions for such GPUs and emulates an NVLink bridge.
Pablo Greco e6a3ae
NVLink2-enabled POWER9 CPUs also provide address translation services
Pablo Greco e6a3ae
which includes an ATS shootdown (ATSD) register exported via the NVLink
Pablo Greco e6a3ae
bridge device.
Pablo Greco e6a3ae
Pablo Greco e6a3ae
This adds a quirk to VFIO to map the GPU memory and create an MR;
Pablo Greco e6a3ae
the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
Pablo Greco e6a3ae
this to get the MR and map it to the system address space.
Pablo Greco e6a3ae
Another quirk does the same for ATSD.
Pablo Greco e6a3ae
Pablo Greco e6a3ae
This adds additional steps to sPAPR PHB setup:
Pablo Greco e6a3ae
Pablo Greco e6a3ae
1. Search for specific GPUs and NPUs, collect findings in
Pablo Greco e6a3ae
sPAPRPHBState::nvgpus, manage system address space mappings;
Pablo Greco e6a3ae
Pablo Greco e6a3ae
2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
Pablo Greco e6a3ae
"memory-block", "link-speed" to advertise the NVLink2 function to
Pablo Greco e6a3ae
the guest;
Pablo Greco e6a3ae
Pablo Greco e6a3ae
3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
Pablo Greco e6a3ae
Pablo Greco e6a3ae
4. Add new memory blocks (with extra "linux,memory-usable" to prevent
Pablo Greco e6a3ae
the guest OS from accessing the new memory until it is onlined) and
Pablo Greco e6a3ae
npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
Pablo Greco e6a3ae
uses it for link discovery.
Pablo Greco e6a3ae
Pablo Greco e6a3ae
This allocates space for GPU RAM and ATSD like we do for MMIOs by
Pablo Greco e6a3ae
adding 2 new parameters to the phb_placement() hook. Older machine types
Pablo Greco e6a3ae
set these to zero.
Pablo Greco e6a3ae
Pablo Greco e6a3ae
This puts new memory nodes in a separate NUMA node to as the GPU RAM
Pablo Greco e6a3ae
needs to be configured equally distant from any other node in the system.
Pablo Greco e6a3ae
Unlike the host setup which assigns numa ids from 255 downwards, this
Pablo Greco e6a3ae
adds new NUMA nodes after the user configures nodes or from 1 if none
Pablo Greco e6a3ae
were configured.
Pablo Greco e6a3ae
Pablo Greco e6a3ae
This adds requirement similar to EEH - one IOMMU group per vPHB.
Pablo Greco e6a3ae
The reason for this is that ATSD registers belong to a physical NPU
Pablo Greco e6a3ae
so they cannot invalidate translations on GPUs attached to another NPU.
Pablo Greco e6a3ae
It is guaranteed by the host platform as it does not mix NVLink bridges
Pablo Greco e6a3ae
or GPUs from different NPU in the same IOMMU group. If more than one
Pablo Greco e6a3ae
IOMMU group is detected on a vPHB, this disables ATSD support for that
Pablo Greco e6a3ae
vPHB and prints a warning.
Pablo Greco e6a3ae
Pablo Greco e6a3ae
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Pablo Greco e6a3ae
[aw: for vfio portions]
Pablo Greco e6a3ae
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Pablo Greco e6a3ae
Message-Id: <20190312082103.130561-1-aik@ozlabs.ru>
Pablo Greco e6a3ae
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Pablo Greco e6a3ae
(cherry picked from commit ec132efaa81f09861a3bd6afad94827e74543b3f)
Pablo Greco e6a3ae
Pablo Greco e6a3ae
Signed-off-by: Danilo C. L. de Paula <ddepaula@redhat.com>
Pablo Greco e6a3ae
Pablo Greco e6a3ae
Conflicts:
Pablo Greco e6a3ae
	hw/ppc/spapr.c
Pablo Greco e6a3ae
	hw/ppc/spapr_pci.c
Pablo Greco e6a3ae
	hw/vfio/trace-events
Pablo Greco e6a3ae
	include/hw/pci-host/spapr.h
Pablo Greco e6a3ae
	include/hw/ppc/spapr.h
Pablo Greco e6a3ae
Pablo Greco e6a3ae
Conflicts come for several reasons:
Pablo Greco e6a3ae
  1) Some contextual conflicts
Pablo Greco e6a3ae
  2) Downstream tree does not have PHB hotplug, so upstream changes to
Pablo Greco e6a3ae
     that code need to be dropped, we also need to adapt some hunks to
Pablo Greco e6a3ae
     apply to the code as it existed before PHB hotplug was added
Pablo Greco e6a3ae
  3) Upstream had a mass renaming of spapr types to give more
Pablo Greco e6a3ae
     consistent CamelCasing.  We don't have that change downstream, so
Pablo Greco e6a3ae
     we need to adjust accordingly.
Pablo Greco e6a3ae
  4) We add an explicit include of qemu/units.h, since it's not indirectly
Pablo Greco e6a3ae
     included downstream (and it's messy to backport the patch which adds
Pablo Greco e6a3ae
     that)
Pablo Greco e6a3ae
Pablo Greco e6a3ae
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1710662
Pablo Greco e6a3ae
Pablo Greco e6a3ae
Signed-off-by: David Gibson <dgibson@redhat.com>
Pablo Greco e6a3ae
Signed-off-by: Danilo C. L. de Paula <ddepaula@redhat.com>
Pablo Greco e6a3ae
---
Pablo Greco e6a3ae
 hw/ppc/Makefile.objs        |   2 +-
Pablo Greco e6a3ae
 hw/ppc/spapr.c              |  31 ++-
Pablo Greco e6a3ae
 hw/ppc/spapr_pci.c          |  21 ++-
Pablo Greco e6a3ae
 hw/ppc/spapr_pci_nvlink2.c  | 450 ++++++++++++++++++++++++++++++++++++++++++++
Pablo Greco e6a3ae
 hw/vfio/pci-quirks.c        | 131 +++++++++++++
Pablo Greco e6a3ae
 hw/vfio/pci.c               |  14 ++
Pablo Greco e6a3ae
 hw/vfio/pci.h               |   2 +
Pablo Greco e6a3ae
 hw/vfio/trace-events        |   4 +
Pablo Greco e6a3ae
 include/hw/pci-host/spapr.h |  46 +++++
Pablo Greco e6a3ae
 include/hw/ppc/spapr.h      |   5 +-
Pablo Greco e6a3ae
 10 files changed, 697 insertions(+), 9 deletions(-)
Pablo Greco e6a3ae
 create mode 100644 hw/ppc/spapr_pci_nvlink2.c
Pablo Greco e6a3ae
Pablo Greco e6a3ae
diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
Pablo Greco e6a3ae
index a46a989..d07e999 100644
Pablo Greco e6a3ae
--- a/hw/ppc/Makefile.objs
Pablo Greco e6a3ae
+++ b/hw/ppc/Makefile.objs
Pablo Greco e6a3ae
@@ -8,7 +8,7 @@ obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o
Pablo Greco e6a3ae
 # IBM PowerNV
Pablo Greco e6a3ae
 obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
Pablo Greco e6a3ae
 ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
Pablo Greco e6a3ae
-obj-y += spapr_pci_vfio.o
Pablo Greco e6a3ae
+obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o
Pablo Greco e6a3ae
 endif
Pablo Greco e6a3ae
 obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
Pablo Greco e6a3ae
 # PowerPC 4xx boards
Pablo Greco e6a3ae
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
Pablo Greco e6a3ae
index b57c0be..c72aad1 100644
Pablo Greco e6a3ae
--- a/hw/ppc/spapr.c
Pablo Greco e6a3ae
+++ b/hw/ppc/spapr.c
Pablo Greco e6a3ae
@@ -910,12 +910,13 @@ static void spapr_dt_rtas(sPAPRMachineState *spapr, void *fdt)
Pablo Greco e6a3ae
         0, cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE),
Pablo Greco e6a3ae
         cpu_to_be32(max_cpus / smp_threads),
Pablo Greco e6a3ae
     };
Pablo Greco e6a3ae
+    uint32_t maxdomain = cpu_to_be32(spapr->gpu_numa_id > 1 ? 1 : 0);
Pablo Greco e6a3ae
     uint32_t maxdomains[] = {
Pablo Greco e6a3ae
         cpu_to_be32(4),
Pablo Greco e6a3ae
-        cpu_to_be32(0),
Pablo Greco e6a3ae
-        cpu_to_be32(0),
Pablo Greco e6a3ae
-        cpu_to_be32(0),
Pablo Greco e6a3ae
-        cpu_to_be32(nb_numa_nodes ? nb_numa_nodes : 1),
Pablo Greco e6a3ae
+        maxdomain,
Pablo Greco e6a3ae
+        maxdomain,
Pablo Greco e6a3ae
+        maxdomain,
Pablo Greco e6a3ae
+        cpu_to_be32(spapr->gpu_numa_id),
Pablo Greco e6a3ae
     };
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
     _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));
Pablo Greco e6a3ae
@@ -1515,6 +1516,16 @@ static void spapr_machine_reset(void)
Pablo Greco e6a3ae
         ppc_set_compat(first_ppc_cpu, spapr->max_compat_pvr, &error_fatal);
Pablo Greco e6a3ae
     }
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
+    /*
Pablo Greco e6a3ae
+     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
Pablo Greco e6a3ae
+     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
Pablo Greco e6a3ae
+     * called from vPHB reset handler so we initialize the counter here.
Pablo Greco e6a3ae
+     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
Pablo Greco e6a3ae
+     * must be equally distant from any other node.
Pablo Greco e6a3ae
+     * The final value of spapr->gpu_numa_id is going to be written to
Pablo Greco e6a3ae
+     * max-associativity-domains in spapr_build_fdt().
Pablo Greco e6a3ae
+     */
Pablo Greco e6a3ae
+    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
Pablo Greco e6a3ae
     qemu_devices_reset();
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
     /* DRC reset may cause a device to be unplugged. This will cause troubles
Pablo Greco e6a3ae
@@ -3601,7 +3612,8 @@ static const CPUArchIdList *spapr_possible_cpu_arch_ids(MachineState *machine)
Pablo Greco e6a3ae
 static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
Pablo Greco e6a3ae
                                 uint64_t *buid, hwaddr *pio,
Pablo Greco e6a3ae
                                 hwaddr *mmio32, hwaddr *mmio64,
Pablo Greco e6a3ae
-                                unsigned n_dma, uint32_t *liobns, Error **errp)
Pablo Greco e6a3ae
+                                unsigned n_dma, uint32_t *liobns,
Pablo Greco e6a3ae
+                                hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
Pablo Greco e6a3ae
 {
Pablo Greco e6a3ae
     /*
Pablo Greco e6a3ae
      * New-style PHB window placement.
Pablo Greco e6a3ae
@@ -3648,6 +3660,9 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
Pablo Greco e6a3ae
     *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE;
Pablo Greco e6a3ae
     *mmio32 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM32_WIN_SIZE;
Pablo Greco e6a3ae
     *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    *nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index * SPAPR_PCI_NV2RAM64_WIN_SIZE;
Pablo Greco e6a3ae
+    *nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index * SPAPR_PCI_NV2ATSD_WIN_SIZE;
Pablo Greco e6a3ae
 }
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
 static ICSState *spapr_ics_get(XICSFabric *dev, int irq)
Pablo Greco e6a3ae
@@ -4133,7 +4148,8 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
Pablo Greco e6a3ae
 static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
Pablo Greco e6a3ae
                               uint64_t *buid, hwaddr *pio,
Pablo Greco e6a3ae
                               hwaddr *mmio32, hwaddr *mmio64,
Pablo Greco e6a3ae
-                              unsigned n_dma, uint32_t *liobns, Error **errp)
Pablo Greco e6a3ae
+                              unsigned n_dma, uint32_t *liobns,
Pablo Greco e6a3ae
+                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
Pablo Greco e6a3ae
 {
Pablo Greco e6a3ae
     /* Legacy PHB placement for pseries-2.7 and earlier machine types */
Pablo Greco e6a3ae
     const uint64_t base_buid = 0x800000020000000ULL;
Pablo Greco e6a3ae
@@ -4177,6 +4193,9 @@ static void phb_placement_2_7(sPAPRMachineState *spapr, uint32_t index,
Pablo Greco e6a3ae
      * fallback behaviour of automatically splitting a large "32-bit"
Pablo Greco e6a3ae
      * window into contiguous 32-bit and 64-bit windows
Pablo Greco e6a3ae
      */
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    *nv2gpa = 0;
Pablo Greco e6a3ae
+    *nv2atsd = 0;
Pablo Greco e6a3ae
 }
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
 #if 0 /* Disabled for Red Hat Enterprise Linux */
Pablo Greco e6a3ae
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
Pablo Greco e6a3ae
index f936ce6..d82f957 100644
Pablo Greco e6a3ae
--- a/hw/ppc/spapr_pci.c
Pablo Greco e6a3ae
+++ b/hw/ppc/spapr_pci.c
Pablo Greco e6a3ae
@@ -1326,6 +1326,8 @@ static void spapr_populate_pci_child_dt(PCIDevice *dev, void *fdt, int offset,
Pablo Greco e6a3ae
     if (sphb->pcie_ecs && pci_is_express(dev)) {
Pablo Greco e6a3ae
         _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type", 0x1));
Pablo Greco e6a3ae
     }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    spapr_phb_nvgpu_populate_pcidev_dt(dev, fdt, offset, sphb);
Pablo Greco e6a3ae
 }
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
 /* create OF node for pci device and required OF DT properties */
Pablo Greco e6a3ae
@@ -1559,7 +1561,9 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
Pablo Greco e6a3ae
         smc->phb_placement(spapr, sphb->index,
Pablo Greco e6a3ae
                            &sphb->buid, &sphb->io_win_addr,
Pablo Greco e6a3ae
                            &sphb->mem_win_addr, &sphb->mem64_win_addr,
Pablo Greco e6a3ae
-                           windows_supported, sphb->dma_liobn, &local_err);
Pablo Greco e6a3ae
+                           windows_supported, sphb->dma_liobn,
Pablo Greco e6a3ae
+                           &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr,
Pablo Greco e6a3ae
+                           &local_err);
Pablo Greco e6a3ae
         if (local_err) {
Pablo Greco e6a3ae
             error_propagate(errp, local_err);
Pablo Greco e6a3ae
             return;
Pablo Greco e6a3ae
@@ -1764,8 +1768,14 @@ void spapr_phb_dma_reset(sPAPRPHBState *sphb)
Pablo Greco e6a3ae
 static void spapr_phb_reset(DeviceState *qdev)
Pablo Greco e6a3ae
 {
Pablo Greco e6a3ae
     sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
Pablo Greco e6a3ae
+    Error *errp = NULL;
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
     spapr_phb_dma_reset(sphb);
Pablo Greco e6a3ae
+    spapr_phb_nvgpu_free(sphb);
Pablo Greco e6a3ae
+    spapr_phb_nvgpu_setup(sphb, &errp);
Pablo Greco e6a3ae
+    if (errp) {
Pablo Greco e6a3ae
+        error_report_err(errp);
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
     /* Reset the IOMMU state */
Pablo Greco e6a3ae
     object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
Pablo Greco e6a3ae
@@ -1798,6 +1808,8 @@ static Property spapr_phb_properties[] = {
Pablo Greco e6a3ae
                      pre_2_8_migration, false),
Pablo Greco e6a3ae
     DEFINE_PROP_BOOL("pcie-extended-configuration-space", sPAPRPHBState,
Pablo Greco e6a3ae
                      pcie_ecs, true),
Pablo Greco e6a3ae
+    DEFINE_PROP_UINT64("gpa", sPAPRPHBState, nv2_gpa_win_addr, 0),
Pablo Greco e6a3ae
+    DEFINE_PROP_UINT64("atsd", sPAPRPHBState, nv2_atsd_win_addr, 0),
Pablo Greco e6a3ae
     DEFINE_PROP_END_OF_LIST(),
Pablo Greco e6a3ae
 };
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
@@ -2089,6 +2101,7 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
Pablo Greco e6a3ae
     sPAPRTCETable *tcet;
Pablo Greco e6a3ae
     PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
Pablo Greco e6a3ae
     sPAPRFDT s_fdt;
Pablo Greco e6a3ae
+    Error *errp = NULL;
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
     /* Start populating the FDT */
Pablo Greco e6a3ae
     nodename = g_strdup_printf("pci@%" PRIx64, phb->buid);
Pablo Greco e6a3ae
@@ -2170,6 +2183,12 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
Pablo Greco e6a3ae
         return ret;
Pablo Greco e6a3ae
     }
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
+    spapr_phb_nvgpu_populate_dt(phb, fdt, bus_off, &errp);
Pablo Greco e6a3ae
+    if (errp) {
Pablo Greco e6a3ae
+        error_report_err(errp);
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+    spapr_phb_nvgpu_ram_populate_dt(phb, fdt);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
     return 0;
Pablo Greco e6a3ae
 }
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
Pablo Greco e6a3ae
new file mode 100644
Pablo Greco e6a3ae
index 0000000..60b14d8
Pablo Greco e6a3ae
--- /dev/null
Pablo Greco e6a3ae
+++ b/hw/ppc/spapr_pci_nvlink2.c
Pablo Greco e6a3ae
@@ -0,0 +1,450 @@
Pablo Greco e6a3ae
+/*
Pablo Greco e6a3ae
+ * QEMU sPAPR PCI for NVLink2 pass through
Pablo Greco e6a3ae
+ *
Pablo Greco e6a3ae
+ * Copyright (c) 2019 Alexey Kardashevskiy, IBM Corporation.
Pablo Greco e6a3ae
+ *
Pablo Greco e6a3ae
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
Pablo Greco e6a3ae
+ * of this software and associated documentation files (the "Software"), to deal
Pablo Greco e6a3ae
+ * in the Software without restriction, including without limitation the rights
Pablo Greco e6a3ae
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
Pablo Greco e6a3ae
+ * copies of the Software, and to permit persons to whom the Software is
Pablo Greco e6a3ae
+ * furnished to do so, subject to the following conditions:
Pablo Greco e6a3ae
+ *
Pablo Greco e6a3ae
+ * The above copyright notice and this permission notice shall be included in
Pablo Greco e6a3ae
+ * all copies or substantial portions of the Software.
Pablo Greco e6a3ae
+ *
Pablo Greco e6a3ae
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
Pablo Greco e6a3ae
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
Pablo Greco e6a3ae
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
Pablo Greco e6a3ae
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
Pablo Greco e6a3ae
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
Pablo Greco e6a3ae
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
Pablo Greco e6a3ae
+ * THE SOFTWARE.
Pablo Greco e6a3ae
+ */
Pablo Greco e6a3ae
+#include "qemu/osdep.h"
Pablo Greco e6a3ae
+#include "qapi/error.h"
Pablo Greco e6a3ae
+#include "qemu-common.h"
Pablo Greco e6a3ae
+#include "hw/pci/pci.h"
Pablo Greco e6a3ae
+#include "hw/pci-host/spapr.h"
Pablo Greco e6a3ae
+#include "qemu/error-report.h"
Pablo Greco e6a3ae
+#include "hw/ppc/fdt.h"
Pablo Greco e6a3ae
+#include "hw/pci/pci_bridge.h"
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+#define PHANDLE_PCIDEV(phb, pdev)    (0x12000000 | \
Pablo Greco e6a3ae
+                                     (((phb)->index) << 16) | ((pdev)->devfn))
Pablo Greco e6a3ae
+#define PHANDLE_GPURAM(phb, n)       (0x110000FF | ((n) << 8) | \
Pablo Greco e6a3ae
+                                     (((phb)->index) << 16))
Pablo Greco e6a3ae
+#define PHANDLE_NVLINK(phb, gn, nn)  (0x00130000 | (((phb)->index) << 8) | \
Pablo Greco e6a3ae
+                                     ((gn) << 4) | (nn))
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+#define SPAPR_GPU_NUMA_ID           (cpu_to_be32(1))
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+struct spapr_phb_pci_nvgpu_config {
Pablo Greco e6a3ae
+    uint64_t nv2_ram_current;
Pablo Greco e6a3ae
+    uint64_t nv2_atsd_current;
Pablo Greco e6a3ae
+    int num; /* number of non empty (i.e. tgt!=0) entries in slots[] */
Pablo Greco e6a3ae
+    struct spapr_phb_pci_nvgpu_slot {
Pablo Greco e6a3ae
+        uint64_t tgt;
Pablo Greco e6a3ae
+        uint64_t gpa;
Pablo Greco e6a3ae
+        unsigned numa_id;
Pablo Greco e6a3ae
+        PCIDevice *gpdev;
Pablo Greco e6a3ae
+        int linknum;
Pablo Greco e6a3ae
+        struct {
Pablo Greco e6a3ae
+            uint64_t atsd_gpa;
Pablo Greco e6a3ae
+            PCIDevice *npdev;
Pablo Greco e6a3ae
+            uint32_t link_speed;
Pablo Greco e6a3ae
+        } links[NVGPU_MAX_LINKS];
Pablo Greco e6a3ae
+    } slots[NVGPU_MAX_NUM];
Pablo Greco e6a3ae
+    Error *errp;
Pablo Greco e6a3ae
+};
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+static struct spapr_phb_pci_nvgpu_slot *
Pablo Greco e6a3ae
+spapr_nvgpu_get_slot(struct spapr_phb_pci_nvgpu_config *nvgpus, uint64_t tgt)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+    int i;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    /* Search for partially collected "slot" */
Pablo Greco e6a3ae
+    for (i = 0; i < nvgpus->num; ++i) {
Pablo Greco e6a3ae
+        if (nvgpus->slots[i].tgt == tgt) {
Pablo Greco e6a3ae
+            return &nvgpus->slots[i];
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (nvgpus->num == ARRAY_SIZE(nvgpus->slots)) {
Pablo Greco e6a3ae
+        return NULL;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    i = nvgpus->num;
Pablo Greco e6a3ae
+    nvgpus->slots[i].tgt = tgt;
Pablo Greco e6a3ae
+    ++nvgpus->num;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    return &nvgpus->slots[i];
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+static void spapr_pci_collect_nvgpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
Pablo Greco e6a3ae
+                                    PCIDevice *pdev, uint64_t tgt,
Pablo Greco e6a3ae
+                                    MemoryRegion *mr, Error **errp)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+    MachineState *machine = MACHINE(qdev_get_machine());
Pablo Greco e6a3ae
+    sPAPRMachineState *spapr = SPAPR_MACHINE(machine);
Pablo Greco e6a3ae
+    struct spapr_phb_pci_nvgpu_slot *nvslot = spapr_nvgpu_get_slot(nvgpus, tgt);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (!nvslot) {
Pablo Greco e6a3ae
+        error_setg(errp, "Found too many GPUs per vPHB");
Pablo Greco e6a3ae
+        return;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+    g_assert(!nvslot->gpdev);
Pablo Greco e6a3ae
+    nvslot->gpdev = pdev;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    nvslot->gpa = nvgpus->nv2_ram_current;
Pablo Greco e6a3ae
+    nvgpus->nv2_ram_current += memory_region_size(mr);
Pablo Greco e6a3ae
+    nvslot->numa_id = spapr->gpu_numa_id;
Pablo Greco e6a3ae
+    ++spapr->gpu_numa_id;
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+static void spapr_pci_collect_nvnpu(struct spapr_phb_pci_nvgpu_config *nvgpus,
Pablo Greco e6a3ae
+                                    PCIDevice *pdev, uint64_t tgt,
Pablo Greco e6a3ae
+                                    MemoryRegion *mr, Error **errp)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+    struct spapr_phb_pci_nvgpu_slot *nvslot = spapr_nvgpu_get_slot(nvgpus, tgt);
Pablo Greco e6a3ae
+    int j;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (!nvslot) {
Pablo Greco e6a3ae
+        error_setg(errp, "Found too many NVLink bridges per vPHB");
Pablo Greco e6a3ae
+        return;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    j = nvslot->linknum;
Pablo Greco e6a3ae
+    if (j == ARRAY_SIZE(nvslot->links)) {
Pablo Greco e6a3ae
+        error_setg(errp, "Found too many NVLink bridges per GPU");
Pablo Greco e6a3ae
+        return;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+    ++nvslot->linknum;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    g_assert(!nvslot->links[j].npdev);
Pablo Greco e6a3ae
+    nvslot->links[j].npdev = pdev;
Pablo Greco e6a3ae
+    nvslot->links[j].atsd_gpa = nvgpus->nv2_atsd_current;
Pablo Greco e6a3ae
+    nvgpus->nv2_atsd_current += memory_region_size(mr);
Pablo Greco e6a3ae
+    nvslot->links[j].link_speed =
Pablo Greco e6a3ae
+        object_property_get_uint(OBJECT(pdev), "nvlink2-link-speed", NULL);
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+static void spapr_phb_pci_collect_nvgpu(PCIBus *bus, PCIDevice *pdev,
Pablo Greco e6a3ae
+                                        void *opaque)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+    PCIBus *sec_bus;
Pablo Greco e6a3ae
+    Object *po = OBJECT(pdev);
Pablo Greco e6a3ae
+    uint64_t tgt = object_property_get_uint(po, "nvlink2-tgt", NULL);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (tgt) {
Pablo Greco e6a3ae
+        Error *local_err = NULL;
Pablo Greco e6a3ae
+        struct spapr_phb_pci_nvgpu_config *nvgpus = opaque;
Pablo Greco e6a3ae
+        Object *mr_gpu = object_property_get_link(po, "nvlink2-mr[0]", NULL);
Pablo Greco e6a3ae
+        Object *mr_npu = object_property_get_link(po, "nvlink2-atsd-mr[0]",
Pablo Greco e6a3ae
+                                                  NULL);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+        g_assert(mr_gpu || mr_npu);
Pablo Greco e6a3ae
+        if (mr_gpu) {
Pablo Greco e6a3ae
+            spapr_pci_collect_nvgpu(nvgpus, pdev, tgt, MEMORY_REGION(mr_gpu),
Pablo Greco e6a3ae
+                                    &local_err);
Pablo Greco e6a3ae
+        } else {
Pablo Greco e6a3ae
+            spapr_pci_collect_nvnpu(nvgpus, pdev, tgt, MEMORY_REGION(mr_npu),
Pablo Greco e6a3ae
+                                    &local_err);
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+        error_propagate(&nvgpus->errp, local_err);
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+    if ((pci_default_read_config(pdev, PCI_HEADER_TYPE, 1) !=
Pablo Greco e6a3ae
+         PCI_HEADER_TYPE_BRIDGE)) {
Pablo Greco e6a3ae
+        return;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev));
Pablo Greco e6a3ae
+    if (!sec_bus) {
Pablo Greco e6a3ae
+        return;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    pci_for_each_device(sec_bus, pci_bus_num(sec_bus),
Pablo Greco e6a3ae
+                        spapr_phb_pci_collect_nvgpu, opaque);
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb, Error **errp)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+    int i, j, valid_gpu_num;
Pablo Greco e6a3ae
+    PCIBus *bus;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    /* Search for GPUs and NPUs */
Pablo Greco e6a3ae
+    if (!sphb->nv2_gpa_win_addr || !sphb->nv2_atsd_win_addr) {
Pablo Greco e6a3ae
+        return;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    sphb->nvgpus = g_new0(struct spapr_phb_pci_nvgpu_config, 1);
Pablo Greco e6a3ae
+    sphb->nvgpus->nv2_ram_current = sphb->nv2_gpa_win_addr;
Pablo Greco e6a3ae
+    sphb->nvgpus->nv2_atsd_current = sphb->nv2_atsd_win_addr;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    bus = PCI_HOST_BRIDGE(sphb)->bus;
Pablo Greco e6a3ae
+    pci_for_each_device(bus, pci_bus_num(bus),
Pablo Greco e6a3ae
+                        spapr_phb_pci_collect_nvgpu, sphb->nvgpus);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (sphb->nvgpus->errp) {
Pablo Greco e6a3ae
+        error_propagate(errp, sphb->nvgpus->errp);
Pablo Greco e6a3ae
+        sphb->nvgpus->errp = NULL;
Pablo Greco e6a3ae
+        goto cleanup_exit;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    /* Add found GPU RAM and ATSD MRs if found */
Pablo Greco e6a3ae
+    for (i = 0, valid_gpu_num = 0; i < sphb->nvgpus->num; ++i) {
Pablo Greco e6a3ae
+        Object *nvmrobj;
Pablo Greco e6a3ae
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+        if (!nvslot->gpdev) {
Pablo Greco e6a3ae
+            continue;
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+        nvmrobj = object_property_get_link(OBJECT(nvslot->gpdev),
Pablo Greco e6a3ae
+                                           "nvlink2-mr[0]", NULL);
Pablo Greco e6a3ae
+        /* ATSD is pointless without GPU RAM MR so skip those */
Pablo Greco e6a3ae
+        if (!nvmrobj) {
Pablo Greco e6a3ae
+            continue;
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+        ++valid_gpu_num;
Pablo Greco e6a3ae
+        memory_region_add_subregion(get_system_memory(), nvslot->gpa,
Pablo Greco e6a3ae
+                                    MEMORY_REGION(nvmrobj));
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+        for (j = 0; j < nvslot->linknum; ++j) {
Pablo Greco e6a3ae
+            Object *atsdmrobj;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+            atsdmrobj = object_property_get_link(OBJECT(nvslot->links[j].npdev),
Pablo Greco e6a3ae
+                                                 "nvlink2-atsd-mr[0]", NULL);
Pablo Greco e6a3ae
+            if (!atsdmrobj) {
Pablo Greco e6a3ae
+                continue;
Pablo Greco e6a3ae
+            }
Pablo Greco e6a3ae
+            memory_region_add_subregion(get_system_memory(),
Pablo Greco e6a3ae
+                                        nvslot->links[j].atsd_gpa,
Pablo Greco e6a3ae
+                                        MEMORY_REGION(atsdmrobj));
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (valid_gpu_num) {
Pablo Greco e6a3ae
+        return;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+    /* We did not find any interesting GPU */
Pablo Greco e6a3ae
+cleanup_exit:
Pablo Greco e6a3ae
+    g_free(sphb->nvgpus);
Pablo Greco e6a3ae
+    sphb->nvgpus = NULL;
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+void spapr_phb_nvgpu_free(sPAPRPHBState *sphb)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+    int i, j;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (!sphb->nvgpus) {
Pablo Greco e6a3ae
+        return;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    for (i = 0; i < sphb->nvgpus->num; ++i) {
Pablo Greco e6a3ae
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
Pablo Greco e6a3ae
+        Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
Pablo Greco e6a3ae
+                                                    "nvlink2-mr[0]", NULL);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+        if (nv_mrobj) {
Pablo Greco e6a3ae
+            memory_region_del_subregion(get_system_memory(),
Pablo Greco e6a3ae
+                                        MEMORY_REGION(nv_mrobj));
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+        for (j = 0; j < nvslot->linknum; ++j) {
Pablo Greco e6a3ae
+            PCIDevice *npdev = nvslot->links[j].npdev;
Pablo Greco e6a3ae
+            Object *atsd_mrobj;
Pablo Greco e6a3ae
+            atsd_mrobj = object_property_get_link(OBJECT(npdev),
Pablo Greco e6a3ae
+                                                  "nvlink2-atsd-mr[0]", NULL);
Pablo Greco e6a3ae
+            if (atsd_mrobj) {
Pablo Greco e6a3ae
+                memory_region_del_subregion(get_system_memory(),
Pablo Greco e6a3ae
+                                            MEMORY_REGION(atsd_mrobj));
Pablo Greco e6a3ae
+            }
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+    g_free(sphb->nvgpus);
Pablo Greco e6a3ae
+    sphb->nvgpus = NULL;
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off,
Pablo Greco e6a3ae
+                                 Error **errp)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+    int i, j, atsdnum = 0;
Pablo Greco e6a3ae
+    uint64_t atsd[8]; /* The existing limitation of known guests */
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (!sphb->nvgpus) {
Pablo Greco e6a3ae
+        return;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    for (i = 0; (i < sphb->nvgpus->num) && (atsdnum < ARRAY_SIZE(atsd)); ++i) {
Pablo Greco e6a3ae
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+        if (!nvslot->gpdev) {
Pablo Greco e6a3ae
+            continue;
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+        for (j = 0; j < nvslot->linknum; ++j) {
Pablo Greco e6a3ae
+            if (!nvslot->links[j].atsd_gpa) {
Pablo Greco e6a3ae
+                continue;
Pablo Greco e6a3ae
+            }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+            if (atsdnum == ARRAY_SIZE(atsd)) {
Pablo Greco e6a3ae
+                error_report("Only %"PRIuPTR" ATSD registers supported",
Pablo Greco e6a3ae
+                             ARRAY_SIZE(atsd));
Pablo Greco e6a3ae
+                break;
Pablo Greco e6a3ae
+            }
Pablo Greco e6a3ae
+            atsd[atsdnum] = cpu_to_be64(nvslot->links[j].atsd_gpa);
Pablo Greco e6a3ae
+            ++atsdnum;
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (!atsdnum) {
Pablo Greco e6a3ae
+        error_setg(errp, "No ATSD registers found");
Pablo Greco e6a3ae
+        return;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (!spapr_phb_eeh_available(sphb)) {
Pablo Greco e6a3ae
+        /*
Pablo Greco e6a3ae
+         * ibm,mmio-atsd contains ATSD registers; these belong to an NPU PHB
Pablo Greco e6a3ae
+         * which we do not emulate as a separate device. Instead we put
Pablo Greco e6a3ae
+         * ibm,mmio-atsd to the vPHB with GPU and make sure that we do not
Pablo Greco e6a3ae
+         * put GPUs from different IOMMU groups to the same vPHB to ensure
Pablo Greco e6a3ae
+         * that the guest will use ATSDs from the corresponding NPU.
Pablo Greco e6a3ae
+         */
Pablo Greco e6a3ae
+        error_setg(errp, "ATSD requires separate vPHB per GPU IOMMU group");
Pablo Greco e6a3ae
+        return;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    _FDT((fdt_setprop(fdt, bus_off, "ibm,mmio-atsd", atsd,
Pablo Greco e6a3ae
+                      atsdnum * sizeof(atsd[0]))));
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+    int i, j, linkidx, npuoff;
Pablo Greco e6a3ae
+    char *npuname;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (!sphb->nvgpus) {
Pablo Greco e6a3ae
+        return;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    npuname = g_strdup_printf("npuphb%d", sphb->index);
Pablo Greco e6a3ae
+    npuoff = fdt_add_subnode(fdt, 0, npuname);
Pablo Greco e6a3ae
+    _FDT(npuoff);
Pablo Greco e6a3ae
+    _FDT(fdt_setprop_cell(fdt, npuoff, "#address-cells", 1));
Pablo Greco e6a3ae
+    _FDT(fdt_setprop_cell(fdt, npuoff, "#size-cells", 0));
Pablo Greco e6a3ae
+    /* Advertise NPU as POWER9 so the guest can enable NPU2 contexts */
Pablo Greco e6a3ae
+    _FDT((fdt_setprop_string(fdt, npuoff, "compatible", "ibm,power9-npu")));
Pablo Greco e6a3ae
+    g_free(npuname);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    for (i = 0, linkidx = 0; i < sphb->nvgpus->num; ++i) {
Pablo Greco e6a3ae
+        for (j = 0; j < sphb->nvgpus->slots[i].linknum; ++j) {
Pablo Greco e6a3ae
+            char *linkname = g_strdup_printf("link@%d", linkidx);
Pablo Greco e6a3ae
+            int off = fdt_add_subnode(fdt, npuoff, linkname);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+            _FDT(off);
Pablo Greco e6a3ae
+            /* _FDT((fdt_setprop_cell(fdt, off, "reg", linkidx))); */
Pablo Greco e6a3ae
+            _FDT((fdt_setprop_string(fdt, off, "compatible",
Pablo Greco e6a3ae
+                                     "ibm,npu-link")));
Pablo Greco e6a3ae
+            _FDT((fdt_setprop_cell(fdt, off, "phandle",
Pablo Greco e6a3ae
+                                   PHANDLE_NVLINK(sphb, i, j))));
Pablo Greco e6a3ae
+            _FDT((fdt_setprop_cell(fdt, off, "ibm,npu-link-index", linkidx)));
Pablo Greco e6a3ae
+            g_free(linkname);
Pablo Greco e6a3ae
+            ++linkidx;
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    /* Add memory nodes for GPU RAM and mark them unusable */
Pablo Greco e6a3ae
+    for (i = 0; i < sphb->nvgpus->num; ++i) {
Pablo Greco e6a3ae
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
Pablo Greco e6a3ae
+        Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
Pablo Greco e6a3ae
+                                                    "nvlink2-mr[0]", NULL);
Pablo Greco e6a3ae
+        uint32_t associativity[] = {
Pablo Greco e6a3ae
+            cpu_to_be32(0x4),
Pablo Greco e6a3ae
+            SPAPR_GPU_NUMA_ID,
Pablo Greco e6a3ae
+            SPAPR_GPU_NUMA_ID,
Pablo Greco e6a3ae
+            SPAPR_GPU_NUMA_ID,
Pablo Greco e6a3ae
+            cpu_to_be32(nvslot->numa_id)
Pablo Greco e6a3ae
+        };
Pablo Greco e6a3ae
+        uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);
Pablo Greco e6a3ae
+        uint64_t mem_reg[2] = { cpu_to_be64(nvslot->gpa), cpu_to_be64(size) };
Pablo Greco e6a3ae
+        char *mem_name = g_strdup_printf("memory@%"PRIx64, nvslot->gpa);
Pablo Greco e6a3ae
+        int off = fdt_add_subnode(fdt, 0, mem_name);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+        _FDT(off);
Pablo Greco e6a3ae
+        _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
Pablo Greco e6a3ae
+        _FDT((fdt_setprop(fdt, off, "reg", mem_reg, sizeof(mem_reg))));
Pablo Greco e6a3ae
+        _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
Pablo Greco e6a3ae
+                          sizeof(associativity))));
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+        _FDT((fdt_setprop_string(fdt, off, "compatible",
Pablo Greco e6a3ae
+                                 "ibm,coherent-device-memory")));
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+        mem_reg[1] = cpu_to_be64(0);
Pablo Greco e6a3ae
+        _FDT((fdt_setprop(fdt, off, "linux,usable-memory", mem_reg,
Pablo Greco e6a3ae
+                          sizeof(mem_reg))));
Pablo Greco e6a3ae
+        _FDT((fdt_setprop_cell(fdt, off, "phandle",
Pablo Greco e6a3ae
+                               PHANDLE_GPURAM(sphb, i))));
Pablo Greco e6a3ae
+        g_free(mem_name);
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
Pablo Greco e6a3ae
+                                        sPAPRPHBState *sphb)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+    int i, j;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (!sphb->nvgpus) {
Pablo Greco e6a3ae
+        return;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    for (i = 0; i < sphb->nvgpus->num; ++i) {
Pablo Greco e6a3ae
+        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+        /* Skip "slot" without attached GPU */
Pablo Greco e6a3ae
+        if (!nvslot->gpdev) {
Pablo Greco e6a3ae
+            continue;
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+        if (dev == nvslot->gpdev) {
Pablo Greco e6a3ae
+            uint32_t npus[nvslot->linknum];
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+            for (j = 0; j < nvslot->linknum; ++j) {
Pablo Greco e6a3ae
+                PCIDevice *npdev = nvslot->links[j].npdev;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+                npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev));
Pablo Greco e6a3ae
+            }
Pablo Greco e6a3ae
+            _FDT(fdt_setprop(fdt, offset, "ibm,npu", npus,
Pablo Greco e6a3ae
+                             j * sizeof(npus[0])));
Pablo Greco e6a3ae
+            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
Pablo Greco e6a3ae
+                                   PHANDLE_PCIDEV(sphb, dev))));
Pablo Greco e6a3ae
+            continue;
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+        for (j = 0; j < nvslot->linknum; ++j) {
Pablo Greco e6a3ae
+            if (dev != nvslot->links[j].npdev) {
Pablo Greco e6a3ae
+                continue;
Pablo Greco e6a3ae
+            }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
Pablo Greco e6a3ae
+                                   PHANDLE_PCIDEV(sphb, dev))));
Pablo Greco e6a3ae
+            _FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu",
Pablo Greco e6a3ae
+                                  PHANDLE_PCIDEV(sphb, nvslot->gpdev)));
Pablo Greco e6a3ae
+            _FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink",
Pablo Greco e6a3ae
+                                   PHANDLE_NVLINK(sphb, i, j))));
Pablo Greco e6a3ae
+            /*
Pablo Greco e6a3ae
+             * If we ever want to emulate GPU RAM at the same location as on
Pablo Greco e6a3ae
+             * the host - here is the encoding GPA->TGT:
Pablo Greco e6a3ae
+             *
Pablo Greco e6a3ae
+             * gta  = ((sphb->nv2_gpa >> 42) & 0x1) << 42;
Pablo Greco e6a3ae
+             * gta |= ((sphb->nv2_gpa >> 45) & 0x3) << 43;
Pablo Greco e6a3ae
+             * gta |= ((sphb->nv2_gpa >> 49) & 0x3) << 45;
Pablo Greco e6a3ae
+             * gta |= sphb->nv2_gpa & ((1UL << 43) - 1);
Pablo Greco e6a3ae
+             */
Pablo Greco e6a3ae
+            _FDT(fdt_setprop_cell(fdt, offset, "memory-region",
Pablo Greco e6a3ae
+                                  PHANDLE_GPURAM(sphb, i)));
Pablo Greco e6a3ae
+            _FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr",
Pablo Greco e6a3ae
+                                 nvslot->tgt));
Pablo Greco e6a3ae
+            _FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed",
Pablo Greco e6a3ae
+                                  nvslot->links[j].link_speed));
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
Pablo Greco e6a3ae
index 92457ed..1beedca 100644
Pablo Greco e6a3ae
--- a/hw/vfio/pci-quirks.c
Pablo Greco e6a3ae
+++ b/hw/vfio/pci-quirks.c
Pablo Greco e6a3ae
@@ -1968,3 +1968,134 @@ int vfio_add_virt_caps(VFIOPCIDevice *vdev, Error **errp)
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
     return 0;
Pablo Greco e6a3ae
 }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+static void vfio_pci_nvlink2_get_tgt(Object *obj, Visitor *v,
Pablo Greco e6a3ae
+                                     const char *name,
Pablo Greco e6a3ae
+                                     void *opaque, Error **errp)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+    uint64_t tgt = (uintptr_t) opaque;
Pablo Greco e6a3ae
+    visit_type_uint64(v, name, &tgt, errp);
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+static void vfio_pci_nvlink2_get_link_speed(Object *obj, Visitor *v,
Pablo Greco e6a3ae
+                                                 const char *name,
Pablo Greco e6a3ae
+                                                 void *opaque, Error **errp)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+    uint32_t link_speed = (uint32_t)(uintptr_t) opaque;
Pablo Greco e6a3ae
+    visit_type_uint32(v, name, &link_speed, errp);
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+    int ret;
Pablo Greco e6a3ae
+    void *p;
Pablo Greco e6a3ae
+    struct vfio_region_info *nv2reg = NULL;
Pablo Greco e6a3ae
+    struct vfio_info_cap_header *hdr;
Pablo Greco e6a3ae
+    struct vfio_region_info_cap_nvlink2_ssatgt *cap;
Pablo Greco e6a3ae
+    VFIOQuirk *quirk;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    ret = vfio_get_dev_region_info(&vdev->vbasedev,
Pablo Greco e6a3ae
+                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
Pablo Greco e6a3ae
+                                   PCI_VENDOR_ID_NVIDIA,
Pablo Greco e6a3ae
+                                   VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,
Pablo Greco e6a3ae
+                                   &nv2reg);
Pablo Greco e6a3ae
+    if (ret) {
Pablo Greco e6a3ae
+        return ret;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    hdr = vfio_get_region_info_cap(nv2reg, VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
Pablo Greco e6a3ae
+    if (!hdr) {
Pablo Greco e6a3ae
+        ret = -ENODEV;
Pablo Greco e6a3ae
+        goto free_exit;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+    cap = (void *) hdr;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    p = mmap(NULL, nv2reg->size, PROT_READ | PROT_WRITE | PROT_EXEC,
Pablo Greco e6a3ae
+             MAP_SHARED, vdev->vbasedev.fd, nv2reg->offset);
Pablo Greco e6a3ae
+    if (p == MAP_FAILED) {
Pablo Greco e6a3ae
+        ret = -errno;
Pablo Greco e6a3ae
+        goto free_exit;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    quirk = vfio_quirk_alloc(1);
Pablo Greco e6a3ae
+    memory_region_init_ram_ptr(&quirk->mem[0], OBJECT(vdev), "nvlink2-mr",
Pablo Greco e6a3ae
+                               nv2reg->size, p);
Pablo Greco e6a3ae
+    QLIST_INSERT_HEAD(&vdev->bars[0].quirks, quirk, next);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
Pablo Greco e6a3ae
+                        vfio_pci_nvlink2_get_tgt, NULL, NULL,
Pablo Greco e6a3ae
+                        (void *) (uintptr_t) cap->tgt, NULL);
Pablo Greco e6a3ae
+    trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt,
Pablo Greco e6a3ae
+                                          nv2reg->size);
Pablo Greco e6a3ae
+free_exit:
Pablo Greco e6a3ae
+    g_free(nv2reg);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    return ret;
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+    int ret;
Pablo Greco e6a3ae
+    void *p;
Pablo Greco e6a3ae
+    struct vfio_region_info *atsdreg = NULL;
Pablo Greco e6a3ae
+    struct vfio_info_cap_header *hdr;
Pablo Greco e6a3ae
+    struct vfio_region_info_cap_nvlink2_ssatgt *captgt;
Pablo Greco e6a3ae
+    struct vfio_region_info_cap_nvlink2_lnkspd *capspeed;
Pablo Greco e6a3ae
+    VFIOQuirk *quirk;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    ret = vfio_get_dev_region_info(&vdev->vbasedev,
Pablo Greco e6a3ae
+                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
Pablo Greco e6a3ae
+                                   PCI_VENDOR_ID_IBM,
Pablo Greco e6a3ae
+                                   VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
Pablo Greco e6a3ae
+                                   &atsdreg);
Pablo Greco e6a3ae
+    if (ret) {
Pablo Greco e6a3ae
+        return ret;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    hdr = vfio_get_region_info_cap(atsdreg,
Pablo Greco e6a3ae
+                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
Pablo Greco e6a3ae
+    if (!hdr) {
Pablo Greco e6a3ae
+        ret = -ENODEV;
Pablo Greco e6a3ae
+        goto free_exit;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+    captgt = (void *) hdr;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    hdr = vfio_get_region_info_cap(atsdreg,
Pablo Greco e6a3ae
+                                   VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD);
Pablo Greco e6a3ae
+    if (!hdr) {
Pablo Greco e6a3ae
+        ret = -ENODEV;
Pablo Greco e6a3ae
+        goto free_exit;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+    capspeed = (void *) hdr;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    /* Some NVLink bridges may not have assigned ATSD */
Pablo Greco e6a3ae
+    if (atsdreg->size) {
Pablo Greco e6a3ae
+        p = mmap(NULL, atsdreg->size, PROT_READ | PROT_WRITE | PROT_EXEC,
Pablo Greco e6a3ae
+                 MAP_SHARED, vdev->vbasedev.fd, atsdreg->offset);
Pablo Greco e6a3ae
+        if (p == MAP_FAILED) {
Pablo Greco e6a3ae
+            ret = -errno;
Pablo Greco e6a3ae
+            goto free_exit;
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+        quirk = vfio_quirk_alloc(1);
Pablo Greco e6a3ae
+        memory_region_init_ram_device_ptr(&quirk->mem[0], OBJECT(vdev),
Pablo Greco e6a3ae
+                                          "nvlink2-atsd-mr", atsdreg->size, p);
Pablo Greco e6a3ae
+        QLIST_INSERT_HEAD(&vdev->bars[0].quirks, quirk, next);
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
Pablo Greco e6a3ae
+                        vfio_pci_nvlink2_get_tgt, NULL, NULL,
Pablo Greco e6a3ae
+                        (void *) (uintptr_t) captgt->tgt, NULL);
Pablo Greco e6a3ae
+    trace_vfio_pci_nvlink2_setup_quirk_ssatgt(vdev->vbasedev.name, captgt->tgt,
Pablo Greco e6a3ae
+                                              atsdreg->size);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    object_property_add(OBJECT(vdev), "nvlink2-link-speed", "uint32",
Pablo Greco e6a3ae
+                        vfio_pci_nvlink2_get_link_speed, NULL, NULL,
Pablo Greco e6a3ae
+                        (void *) (uintptr_t) capspeed->link_speed, NULL);
Pablo Greco e6a3ae
+    trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name,
Pablo Greco e6a3ae
+                                              capspeed->link_speed);
Pablo Greco e6a3ae
+free_exit:
Pablo Greco e6a3ae
+    g_free(atsdreg);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    return ret;
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
Pablo Greco e6a3ae
index ba3a393..735dcae 100644
Pablo Greco e6a3ae
--- a/hw/vfio/pci.c
Pablo Greco e6a3ae
+++ b/hw/vfio/pci.c
Pablo Greco e6a3ae
@@ -3078,6 +3078,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
Pablo Greco e6a3ae
         }
Pablo Greco e6a3ae
     }
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
+    if (vdev->vendor_id == PCI_VENDOR_ID_NVIDIA) {
Pablo Greco e6a3ae
+        ret = vfio_pci_nvidia_v100_ram_init(vdev, errp);
Pablo Greco e6a3ae
+        if (ret && ret != -ENODEV) {
Pablo Greco e6a3ae
+            error_report("Failed to setup NVIDIA V100 GPU RAM");
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (vdev->vendor_id == PCI_VENDOR_ID_IBM) {
Pablo Greco e6a3ae
+        ret = vfio_pci_nvlink2_init(vdev, errp);
Pablo Greco e6a3ae
+        if (ret && ret != -ENODEV) {
Pablo Greco e6a3ae
+            error_report("Failed to setup NVlink2 bridge");
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
     vfio_register_err_notifier(vdev);
Pablo Greco e6a3ae
     vfio_register_req_notifier(vdev);
Pablo Greco e6a3ae
     vfio_setup_resetfn_quirk(vdev);
Pablo Greco e6a3ae
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
Pablo Greco e6a3ae
index 629c875..bf07b43 100644
Pablo Greco e6a3ae
--- a/hw/vfio/pci.h
Pablo Greco e6a3ae
+++ b/hw/vfio/pci.h
Pablo Greco e6a3ae
@@ -175,6 +175,8 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp);
Pablo Greco e6a3ae
 int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
Pablo Greco e6a3ae
                                struct vfio_region_info *info,
Pablo Greco e6a3ae
                                Error **errp);
Pablo Greco e6a3ae
+int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
Pablo Greco e6a3ae
+int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
 int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
Pablo Greco e6a3ae
 void vfio_display_finalize(VFIOPCIDevice *vdev);
Pablo Greco e6a3ae
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
Pablo Greco e6a3ae
index 9487887..c9a9c14 100644
Pablo Greco e6a3ae
--- a/hw/vfio/trace-events
Pablo Greco e6a3ae
+++ b/hw/vfio/trace-events
Pablo Greco e6a3ae
@@ -84,6 +84,10 @@ vfio_pci_igd_opregion_enabled(const char *name) "%s"
Pablo Greco e6a3ae
 vfio_pci_igd_host_bridge_enabled(const char *name) "%s"
Pablo Greco e6a3ae
 vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
+vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
Pablo Greco e6a3ae
+vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
Pablo Greco e6a3ae
+vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) "%s link_speed=0x%x"
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
 # hw/vfio/common.c
Pablo Greco e6a3ae
 vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
Pablo Greco e6a3ae
 vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
Pablo Greco e6a3ae
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
Pablo Greco e6a3ae
index 0fae4fc..cd29c59 100644
Pablo Greco e6a3ae
--- a/include/hw/pci-host/spapr.h
Pablo Greco e6a3ae
+++ b/include/hw/pci-host/spapr.h
Pablo Greco e6a3ae
@@ -24,6 +24,7 @@
Pablo Greco e6a3ae
 #include "hw/pci/pci.h"
Pablo Greco e6a3ae
 #include "hw/pci/pci_host.h"
Pablo Greco e6a3ae
 #include "hw/ppc/xics.h"
Pablo Greco e6a3ae
+#include "qemu/units.h"
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
 #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
@@ -87,6 +88,9 @@ struct sPAPRPHBState {
Pablo Greco e6a3ae
     uint32_t mig_liobn;
Pablo Greco e6a3ae
     hwaddr mig_mem_win_addr, mig_mem_win_size;
Pablo Greco e6a3ae
     hwaddr mig_io_win_addr, mig_io_win_size;
Pablo Greco e6a3ae
+    hwaddr nv2_gpa_win_addr;
Pablo Greco e6a3ae
+    hwaddr nv2_atsd_win_addr;
Pablo Greco e6a3ae
+    struct spapr_phb_pci_nvgpu_config *nvgpus;
Pablo Greco e6a3ae
 };
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
 #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
Pablo Greco e6a3ae
@@ -104,6 +108,22 @@ struct sPAPRPHBState {
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
 #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
+#define SPAPR_PCI_NV2RAM64_WIN_BASE  SPAPR_PCI_LIMIT
Pablo Greco e6a3ae
+#define SPAPR_PCI_NV2RAM64_WIN_SIZE  (2 * TiB) /* For up to 6 GPUs 256GB each */
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+/* Max number of these GPUsper a physical box */
Pablo Greco e6a3ae
+#define NVGPU_MAX_NUM                6
Pablo Greco e6a3ae
+/* Max number of NVLinks per GPU in any physical box */
Pablo Greco e6a3ae
+#define NVGPU_MAX_LINKS              3
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+/*
Pablo Greco e6a3ae
+ * GPU RAM starts at 64TiB so huge DMA window to cover it all ends at 128TiB
Pablo Greco e6a3ae
+ * which is enough. We do not need DMA for ATSD so we put them at 128TiB.
Pablo Greco e6a3ae
+ */
Pablo Greco e6a3ae
+#define SPAPR_PCI_NV2ATSD_WIN_BASE   (128 * TiB)
Pablo Greco e6a3ae
+#define SPAPR_PCI_NV2ATSD_WIN_SIZE   (NVGPU_MAX_NUM * NVGPU_MAX_LINKS * \
Pablo Greco e6a3ae
+                                      64 * KiB)
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
 static inline qemu_irq spapr_phb_lsi_qirq(struct sPAPRPHBState *phb, int pin)
Pablo Greco e6a3ae
 {
Pablo Greco e6a3ae
     sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
Pablo Greco e6a3ae
@@ -135,6 +155,13 @@ int spapr_phb_vfio_eeh_get_state(sPAPRPHBState *sphb, int *state);
Pablo Greco e6a3ae
 int spapr_phb_vfio_eeh_reset(sPAPRPHBState *sphb, int option);
Pablo Greco e6a3ae
 int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb);
Pablo Greco e6a3ae
 void spapr_phb_vfio_reset(DeviceState *qdev);
Pablo Greco e6a3ae
+void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb, Error **errp);
Pablo Greco e6a3ae
+void spapr_phb_nvgpu_free(sPAPRPHBState *sphb);
Pablo Greco e6a3ae
+void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt, int bus_off,
Pablo Greco e6a3ae
+                                 Error **errp);
Pablo Greco e6a3ae
+void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb, void *fdt);
Pablo Greco e6a3ae
+void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int offset,
Pablo Greco e6a3ae
+                                        sPAPRPHBState *sphb);
Pablo Greco e6a3ae
 #else
Pablo Greco e6a3ae
 static inline bool spapr_phb_eeh_available(sPAPRPHBState *sphb)
Pablo Greco e6a3ae
 {
Pablo Greco e6a3ae
@@ -161,6 +188,25 @@ static inline int spapr_phb_vfio_eeh_configure(sPAPRPHBState *sphb)
Pablo Greco e6a3ae
 static inline void spapr_phb_vfio_reset(DeviceState *qdev)
Pablo Greco e6a3ae
 {
Pablo Greco e6a3ae
 }
Pablo Greco e6a3ae
+static inline void spapr_phb_nvgpu_setup(sPAPRPHBState *sphb, Error **errp)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+static inline void spapr_phb_nvgpu_free(sPAPRPHBState *sphb)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+static inline void spapr_phb_nvgpu_populate_dt(sPAPRPHBState *sphb, void *fdt,
Pablo Greco e6a3ae
+                                               int bus_off, Error **errp)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+static inline void spapr_phb_nvgpu_ram_populate_dt(sPAPRPHBState *sphb,
Pablo Greco e6a3ae
+                                                   void *fdt)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
+static inline void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt,
Pablo Greco e6a3ae
+                                                      int offset,
Pablo Greco e6a3ae
+                                                      sPAPRPHBState *sphb)
Pablo Greco e6a3ae
+{
Pablo Greco e6a3ae
+}
Pablo Greco e6a3ae
 #endif
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
 void spapr_phb_dma_reset(sPAPRPHBState *sphb);
Pablo Greco e6a3ae
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
Pablo Greco e6a3ae
index beb42bc..72cfa49 100644
Pablo Greco e6a3ae
--- a/include/hw/ppc/spapr.h
Pablo Greco e6a3ae
+++ b/include/hw/ppc/spapr.h
Pablo Greco e6a3ae
@@ -104,7 +104,8 @@ struct sPAPRMachineClass {
Pablo Greco e6a3ae
     void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
Pablo Greco e6a3ae
                           uint64_t *buid, hwaddr *pio, 
Pablo Greco e6a3ae
                           hwaddr *mmio32, hwaddr *mmio64,
Pablo Greco e6a3ae
-                          unsigned n_dma, uint32_t *liobns, Error **errp);
Pablo Greco e6a3ae
+                          unsigned n_dma, uint32_t *liobns, hwaddr *nv2gpa,
Pablo Greco e6a3ae
+                          hwaddr *nv2atsd, Error **errp);
Pablo Greco e6a3ae
     sPAPRResizeHPT resize_hpt_default;
Pablo Greco e6a3ae
     sPAPRCapabilities default_caps;
Pablo Greco e6a3ae
 };
Pablo Greco e6a3ae
@@ -171,6 +172,8 @@ struct sPAPRMachineState {
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
     bool cmd_line_caps[SPAPR_CAP_NUM];
Pablo Greco e6a3ae
     sPAPRCapabilities def, eff, mig;
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    unsigned gpu_numa_id;
Pablo Greco e6a3ae
 };
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
 #define H_SUCCESS         0
Pablo Greco e6a3ae
-- 
Pablo Greco e6a3ae
1.8.3.1
Pablo Greco e6a3ae