Blame SOURCES/0049-daxctl-Add-Soft-Reservation-theory-of-operation.patch

26ccd9
From 8f4e42c0c526e85b045fd0329df7cb904f511c98 Mon Sep 17 00:00:00 2001
26ccd9
From: Dan Williams <dan.j.williams@intel.com>
26ccd9
Date: Thu, 7 Oct 2021 14:59:53 -0700
26ccd9
Subject: [PATCH 049/217] daxctl: Add "Soft Reservation" theory of operation
26ccd9
26ccd9
As systems are starting to ship memory with the EFI "Special Purpose"
26ccd9
attribute that Linux optionally turns into "Soft Reserved" ranges one of
26ccd9
the immediate first questions is "where is my special memory, and how do
26ccd9
access it". Add some documentation to explain the default behaviour of
26ccd9
"Soft Reserved".
26ccd9
26ccd9
Link: https://lore.kernel.org/r/163364399303.201290.6835215953983673447.stgit@dwillia2-desk3.amr.corp.intel.com
26ccd9
Reported-by: John Groves <john@jagalactic.com>
26ccd9
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
26ccd9
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
26ccd9
---
26ccd9
 .../daxctl/daxctl-reconfigure-device.txt      | 127 ++++++++++++------
26ccd9
 1 file changed, 88 insertions(+), 39 deletions(-)
26ccd9
26ccd9
diff --git a/Documentation/daxctl/daxctl-reconfigure-device.txt b/Documentation/daxctl/daxctl-reconfigure-device.txt
26ccd9
index f112b3c..132684c 100644
26ccd9
--- a/Documentation/daxctl/daxctl-reconfigure-device.txt
26ccd9
+++ b/Documentation/daxctl/daxctl-reconfigure-device.txt
26ccd9
@@ -12,6 +12,94 @@ SYNOPSIS
26ccd9
 [verse]
26ccd9
 'daxctl reconfigure-device' <dax0.0> [<dax1.0>...<daxY.Z>] [<options>]
26ccd9
 
26ccd9
+DESCRIPTION
26ccd9
+-----------
26ccd9
+
26ccd9
+Reconfigure the operational mode of a dax device. This can be used to convert
26ccd9
+a regular 'devdax' mode device to the 'system-ram' mode which arranges for the
26ccd9
+dax range to be hot-plugged into the system as regular memory.
26ccd9
+
26ccd9
+NOTE: This is a destructive operation. Any data on the dax device *will* be
26ccd9
+lost.
26ccd9
+
26ccd9
+NOTE: Device reconfiguration depends on the dax-bus device model. See
26ccd9
+linkdaxctl:daxctl-migrate-device-model[1] for more information. If dax-class is
26ccd9
+in use (via the dax_pmem_compat driver), the reconfiguration will fail with an
26ccd9
+error such as the following:
26ccd9
+----
26ccd9
+# daxctl reconfigure-device --mode=system-ram --region=0 all
26ccd9
+libdaxctl: daxctl_dev_disable: dax3.0: error: device model is dax-class
26ccd9
+dax3.0: disable failed: Operation not supported
26ccd9
+error reconfiguring devices: Operation not supported
26ccd9
+reconfigured 0 devices
26ccd9
+----
26ccd9
+
26ccd9
+'daxctl-reconfigure-device' nominally expects that it will online new memory
26ccd9
+blocks as 'movable', so that kernel data doesn't make it into this memory.
26ccd9
+However, there are other potential agents that may be configured to
26ccd9
+automatically online new hot-plugged memory as it appears. Most notably,
26ccd9
+these are the '/sys/devices/system/memory/auto_online_blocks' configuration,
26ccd9
+or system udev rules. If such an agent races to online memory sections, daxctl
26ccd9
+checks if the blocks were onlined as 'movable' memory. If this was not the
26ccd9
+case, and the memory blocks are found to be in a different zone, then a
26ccd9
+warning is displayed. If it is desired that a different agent control the
26ccd9
+onlining of memory blocks, and the associated memory zone, then it is
26ccd9
+recommended to use the --no-online option described below. This will abridge
26ccd9
+the device reconfiguration operation to just hotplugging the memory, and
26ccd9
+refrain from then onlining it.
26ccd9
+
26ccd9
+In case daxctl detects that there is a kernel policy to auto-online blocks
26ccd9
+(via /sys/devices/system/memory/auto_online_blocks), then reconfiguring to
26ccd9
+system-ram will result in a failure. This can be overridden with '--force'.
26ccd9
+
26ccd9
+
26ccd9
+THEORY OF OPERATION
26ccd9
+-------------------
26ccd9
+The kernel device-dax subsystem surfaces character devices
26ccd9
+that provide DAX-access (direct mappings sans page-cache buffering) to a
26ccd9
+given memory region. The devices are named /dev/daxX.Y where X is a
26ccd9
+region-id and Y is an instance-id within that region. There are 2
26ccd9
+mechanisms that trigger device-dax instances to appear:
26ccd9
+
26ccd9
+1. Persistent Memory (PMEM) namespace configured in "devdax" mode. See
26ccd9
+"ndctl create-namspace --help" and
26ccd9
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dax/Kconfig[CONFIG_DEV_DAX_PMEM].
26ccd9
+In this case the device-dax instance is statically sized to its host
26ccd9
+memory region which is bounded to the physical address range of the host
26ccd9
+namespace.
26ccd9
+
26ccd9
+2. Soft Reserved memory enumerated by platform firmware. On EFI systems
26ccd9
+this is communicated via the so called EFI_MEMORY_SP "Special Purpose"
26ccd9
+attribute. See
26ccd9
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dax/Kconfig[CONFIG_DEV_DAX_HMEM].
26ccd9
+In this case the device-dax instance(s) associated with the given memory
26ccd9
+region can be resized and divided into multiple devices.
26ccd9
+
26ccd9
+In the Soft Reservation case the expectation for EFI + ACPI based
26ccd9
+platforms is that in addition to the EFI_MEMORY_SP attribute the
26ccd9
+firmware also creates distinct ACPI proximity domains for any address
26ccd9
+range that has different performance characteristics than default
26ccd9
+"System RAM". So, the SRAT will define the proximity domain, the SLIT
26ccd9
+communicates relative distance to other proximity domains, and the HMAT
26ccd9
+is populated with nominal read/write latency and read/write bandwidth
26ccd9
+data. That HMAT data is emitted to the kernel log on bootup, and also
26ccd9
+exported to sysfs. See
26ccd9
+https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html[NUMAPERF],
26ccd9
+for the runtime representation of CPU to Memory node performance
26ccd9
+details.
26ccd9
+
26ccd9
+Outside of the NUMA performance details linked above the other method to
26ccd9
+detect the presence of "Soft Reserved" memory is to dump /proc/iomem and
26ccd9
+look for "Soft Reserved" ranges. If the kernel was not built with
26ccd9
+CONFIG_EFI_SOFTRESERVE, predates the introduction of
26ccd9
+CONFIG_EFI_SOFTRESERVE (v5.5), or was booted with the efi=nosoftreserve
26ccd9
+command line then device-dax will not attach and the expectation is that
26ccd9
+the memory shows up as a memory-only NUMA node. Otherwise the memory
26ccd9
+shows up as a device-dax instance and DAXCTL(1) can be used to
26ccd9
+optionally partition it and assign the memory back to the kernel as
26ccd9
+"System RAM", or the device can be mapped directly as the back end of a
26ccd9
+userspace memory allocator like https://pmem.io/vmem/libvmem/[LIBVMEM].
26ccd9
+
26ccd9
 EXAMPLES
26ccd9
 --------
26ccd9
 
26ccd9
@@ -83,45 +171,6 @@ reconfigured 1 device
26ccd9
 reconfigured 1 device
26ccd9
 ----
26ccd9
 
26ccd9
-DESCRIPTION
26ccd9
------------
26ccd9
-
26ccd9
-Reconfigure the operational mode of a dax device. This can be used to convert
26ccd9
-a regular 'devdax' mode device to the 'system-ram' mode which arranges for the
26ccd9
-dax range to be hot-plugged into the system as regular memory.
26ccd9
-
26ccd9
-NOTE: This is a destructive operation. Any data on the dax device *will* be
26ccd9
-lost.
26ccd9
-
26ccd9
-NOTE: Device reconfiguration depends on the dax-bus device model. See
26ccd9
-linkdaxctl:daxctl-migrate-device-model[1] for more information. If dax-class is
26ccd9
-in use (via the dax_pmem_compat driver), the reconfiguration will fail with an
26ccd9
-error such as the following:
26ccd9
-----
26ccd9
-# daxctl reconfigure-device --mode=system-ram --region=0 all
26ccd9
-libdaxctl: daxctl_dev_disable: dax3.0: error: device model is dax-class
26ccd9
-dax3.0: disable failed: Operation not supported
26ccd9
-error reconfiguring devices: Operation not supported
26ccd9
-reconfigured 0 devices
26ccd9
-----
26ccd9
-
26ccd9
-'daxctl-reconfigure-device' nominally expects that it will online new memory
26ccd9
-blocks as 'movable', so that kernel data doesn't make it into this memory.
26ccd9
-However, there are other potential agents that may be configured to
26ccd9
-automatically online new hot-plugged memory as it appears. Most notably,
26ccd9
-these are the '/sys/devices/system/memory/auto_online_blocks' configuration,
26ccd9
-or system udev rules. If such an agent races to online memory sections, daxctl
26ccd9
-checks if the blocks were onlined as 'movable' memory. If this was not the
26ccd9
-case, and the memory blocks are found to be in a different zone, then a
26ccd9
-warning is displayed. If it is desired that a different agent control the
26ccd9
-onlining of memory blocks, and the associated memory zone, then it is
26ccd9
-recommended to use the --no-online option described below. This will abridge
26ccd9
-the device reconfiguration operation to just hotplugging the memory, and
26ccd9
-refrain from then onlining it.
26ccd9
-
26ccd9
-In case daxctl detects that there is a kernel policy to auto-online blocks
26ccd9
-(via /sys/devices/system/memory/auto_online_blocks), then reconfiguring to
26ccd9
-system-ram will result in a failure. This can be overridden with '--force'.
26ccd9
 
26ccd9
 OPTIONS
26ccd9
 -------
26ccd9
-- 
26ccd9
2.27.0
26ccd9