Blame SOURCES/0049-daxctl-Add-Soft-Reservation-theory-of-operation.patch

2eb93d
From 8f4e42c0c526e85b045fd0329df7cb904f511c98 Mon Sep 17 00:00:00 2001
2eb93d
From: Dan Williams <dan.j.williams@intel.com>
2eb93d
Date: Thu, 7 Oct 2021 14:59:53 -0700
2eb93d
Subject: [PATCH 049/217] daxctl: Add "Soft Reservation" theory of operation
2eb93d
2eb93d
As systems are starting to ship memory with the EFI "Special Purpose"
2eb93d
attribute that Linux optionally turns into "Soft Reserved" ranges one of
2eb93d
the immediate first questions is "where is my special memory, and how do
2eb93d
access it". Add some documentation to explain the default behaviour of
2eb93d
"Soft Reserved".
2eb93d
2eb93d
Link: https://lore.kernel.org/r/163364399303.201290.6835215953983673447.stgit@dwillia2-desk3.amr.corp.intel.com
2eb93d
Reported-by: John Groves <john@jagalactic.com>
2eb93d
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2eb93d
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2eb93d
---
2eb93d
 .../daxctl/daxctl-reconfigure-device.txt      | 127 ++++++++++++------
2eb93d
 1 file changed, 88 insertions(+), 39 deletions(-)
2eb93d
2eb93d
diff --git a/Documentation/daxctl/daxctl-reconfigure-device.txt b/Documentation/daxctl/daxctl-reconfigure-device.txt
2eb93d
index f112b3c..132684c 100644
2eb93d
--- a/Documentation/daxctl/daxctl-reconfigure-device.txt
2eb93d
+++ b/Documentation/daxctl/daxctl-reconfigure-device.txt
2eb93d
@@ -12,6 +12,94 @@ SYNOPSIS
2eb93d
 [verse]
2eb93d
 'daxctl reconfigure-device' <dax0.0> [<dax1.0>...<daxY.Z>] [<options>]
2eb93d
 
2eb93d
+DESCRIPTION
2eb93d
+-----------
2eb93d
+
2eb93d
+Reconfigure the operational mode of a dax device. This can be used to convert
2eb93d
+a regular 'devdax' mode device to the 'system-ram' mode which arranges for the
2eb93d
+dax range to be hot-plugged into the system as regular memory.
2eb93d
+
2eb93d
+NOTE: This is a destructive operation. Any data on the dax device *will* be
2eb93d
+lost.
2eb93d
+
2eb93d
+NOTE: Device reconfiguration depends on the dax-bus device model. See
2eb93d
+linkdaxctl:daxctl-migrate-device-model[1] for more information. If dax-class is
2eb93d
+in use (via the dax_pmem_compat driver), the reconfiguration will fail with an
2eb93d
+error such as the following:
2eb93d
+----
2eb93d
+# daxctl reconfigure-device --mode=system-ram --region=0 all
2eb93d
+libdaxctl: daxctl_dev_disable: dax3.0: error: device model is dax-class
2eb93d
+dax3.0: disable failed: Operation not supported
2eb93d
+error reconfiguring devices: Operation not supported
2eb93d
+reconfigured 0 devices
2eb93d
+----
2eb93d
+
2eb93d
+'daxctl-reconfigure-device' nominally expects that it will online new memory
2eb93d
+blocks as 'movable', so that kernel data doesn't make it into this memory.
2eb93d
+However, there are other potential agents that may be configured to
2eb93d
+automatically online new hot-plugged memory as it appears. Most notably,
2eb93d
+these are the '/sys/devices/system/memory/auto_online_blocks' configuration,
2eb93d
+or system udev rules. If such an agent races to online memory sections, daxctl
2eb93d
+checks if the blocks were onlined as 'movable' memory. If this was not the
2eb93d
+case, and the memory blocks are found to be in a different zone, then a
2eb93d
+warning is displayed. If it is desired that a different agent control the
2eb93d
+onlining of memory blocks, and the associated memory zone, then it is
2eb93d
+recommended to use the --no-online option described below. This will abridge
2eb93d
+the device reconfiguration operation to just hotplugging the memory, and
2eb93d
+refrain from then onlining it.
2eb93d
+
2eb93d
+In case daxctl detects that there is a kernel policy to auto-online blocks
2eb93d
+(via /sys/devices/system/memory/auto_online_blocks), then reconfiguring to
2eb93d
+system-ram will result in a failure. This can be overridden with '--force'.
2eb93d
+
2eb93d
+
2eb93d
+THEORY OF OPERATION
2eb93d
+-------------------
2eb93d
+The kernel device-dax subsystem surfaces character devices
2eb93d
+that provide DAX-access (direct mappings sans page-cache buffering) to a
2eb93d
+given memory region. The devices are named /dev/daxX.Y where X is a
2eb93d
+region-id and Y is an instance-id within that region. There are 2
2eb93d
+mechanisms that trigger device-dax instances to appear:
2eb93d
+
2eb93d
+1. Persistent Memory (PMEM) namespace configured in "devdax" mode. See
2eb93d
+"ndctl create-namspace --help" and
2eb93d
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dax/Kconfig[CONFIG_DEV_DAX_PMEM].
2eb93d
+In this case the device-dax instance is statically sized to its host
2eb93d
+memory region which is bounded to the physical address range of the host
2eb93d
+namespace.
2eb93d
+
2eb93d
+2. Soft Reserved memory enumerated by platform firmware. On EFI systems
2eb93d
+this is communicated via the so called EFI_MEMORY_SP "Special Purpose"
2eb93d
+attribute. See
2eb93d
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dax/Kconfig[CONFIG_DEV_DAX_HMEM].
2eb93d
+In this case the device-dax instance(s) associated with the given memory
2eb93d
+region can be resized and divided into multiple devices.
2eb93d
+
2eb93d
+In the Soft Reservation case the expectation for EFI + ACPI based
2eb93d
+platforms is that in addition to the EFI_MEMORY_SP attribute the
2eb93d
+firmware also creates distinct ACPI proximity domains for any address
2eb93d
+range that has different performance characteristics than default
2eb93d
+"System RAM". So, the SRAT will define the proximity domain, the SLIT
2eb93d
+communicates relative distance to other proximity domains, and the HMAT
2eb93d
+is populated with nominal read/write latency and read/write bandwidth
2eb93d
+data. That HMAT data is emitted to the kernel log on bootup, and also
2eb93d
+exported to sysfs. See
2eb93d
+https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html[NUMAPERF],
2eb93d
+for the runtime representation of CPU to Memory node performance
2eb93d
+details.
2eb93d
+
2eb93d
+Outside of the NUMA performance details linked above the other method to
2eb93d
+detect the presence of "Soft Reserved" memory is to dump /proc/iomem and
2eb93d
+look for "Soft Reserved" ranges. If the kernel was not built with
2eb93d
+CONFIG_EFI_SOFTRESERVE, predates the introduction of
2eb93d
+CONFIG_EFI_SOFTRESERVE (v5.5), or was booted with the efi=nosoftreserve
2eb93d
+command line then device-dax will not attach and the expectation is that
2eb93d
+the memory shows up as a memory-only NUMA node. Otherwise the memory
2eb93d
+shows up as a device-dax instance and DAXCTL(1) can be used to
2eb93d
+optionally partition it and assign the memory back to the kernel as
2eb93d
+"System RAM", or the device can be mapped directly as the back end of a
2eb93d
+userspace memory allocator like https://pmem.io/vmem/libvmem/[LIBVMEM].
2eb93d
+
2eb93d
 EXAMPLES
2eb93d
 --------
2eb93d
 
2eb93d
@@ -83,45 +171,6 @@ reconfigured 1 device
2eb93d
 reconfigured 1 device
2eb93d
 ----
2eb93d
 
2eb93d
-DESCRIPTION
2eb93d
------------
2eb93d
-
2eb93d
-Reconfigure the operational mode of a dax device. This can be used to convert
2eb93d
-a regular 'devdax' mode device to the 'system-ram' mode which arranges for the
2eb93d
-dax range to be hot-plugged into the system as regular memory.
2eb93d
-
2eb93d
-NOTE: This is a destructive operation. Any data on the dax device *will* be
2eb93d
-lost.
2eb93d
-
2eb93d
-NOTE: Device reconfiguration depends on the dax-bus device model. See
2eb93d
-linkdaxctl:daxctl-migrate-device-model[1] for more information. If dax-class is
2eb93d
-in use (via the dax_pmem_compat driver), the reconfiguration will fail with an
2eb93d
-error such as the following:
2eb93d
-----
2eb93d
-# daxctl reconfigure-device --mode=system-ram --region=0 all
2eb93d
-libdaxctl: daxctl_dev_disable: dax3.0: error: device model is dax-class
2eb93d
-dax3.0: disable failed: Operation not supported
2eb93d
-error reconfiguring devices: Operation not supported
2eb93d
-reconfigured 0 devices
2eb93d
-----
2eb93d
-
2eb93d
-'daxctl-reconfigure-device' nominally expects that it will online new memory
2eb93d
-blocks as 'movable', so that kernel data doesn't make it into this memory.
2eb93d
-However, there are other potential agents that may be configured to
2eb93d
-automatically online new hot-plugged memory as it appears. Most notably,
2eb93d
-these are the '/sys/devices/system/memory/auto_online_blocks' configuration,
2eb93d
-or system udev rules. If such an agent races to online memory sections, daxctl
2eb93d
-checks if the blocks were onlined as 'movable' memory. If this was not the
2eb93d
-case, and the memory blocks are found to be in a different zone, then a
2eb93d
-warning is displayed. If it is desired that a different agent control the
2eb93d
-onlining of memory blocks, and the associated memory zone, then it is
2eb93d
-recommended to use the --no-online option described below. This will abridge
2eb93d
-the device reconfiguration operation to just hotplugging the memory, and
2eb93d
-refrain from then onlining it.
2eb93d
-
2eb93d
-In case daxctl detects that there is a kernel policy to auto-online blocks
2eb93d
-(via /sys/devices/system/memory/auto_online_blocks), then reconfiguring to
2eb93d
-system-ram will result in a failure. This can be overridden with '--force'.
2eb93d
 
2eb93d
 OPTIONS
2eb93d
 -------
2eb93d
-- 
2eb93d
2.27.0
2eb93d