anitazha / rpms / ndctl

Forked from rpms/ndctl 2 years ago
Clone

Blame SOURCES/0049-daxctl-Add-Soft-Reservation-theory-of-operation.patch

e0018b
From 8f4e42c0c526e85b045fd0329df7cb904f511c98 Mon Sep 17 00:00:00 2001
e0018b
From: Dan Williams <dan.j.williams@intel.com>
e0018b
Date: Thu, 7 Oct 2021 14:59:53 -0700
e0018b
Subject: [PATCH 049/217] daxctl: Add "Soft Reservation" theory of operation
e0018b
e0018b
As systems are starting to ship memory with the EFI "Special Purpose"
e0018b
attribute that Linux optionally turns into "Soft Reserved" ranges one of
e0018b
the immediate first questions is "where is my special memory, and how do
e0018b
access it". Add some documentation to explain the default behaviour of
e0018b
"Soft Reserved".
e0018b
e0018b
Link: https://lore.kernel.org/r/163364399303.201290.6835215953983673447.stgit@dwillia2-desk3.amr.corp.intel.com
e0018b
Reported-by: John Groves <john@jagalactic.com>
e0018b
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
e0018b
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
e0018b
---
e0018b
 .../daxctl/daxctl-reconfigure-device.txt      | 127 ++++++++++++------
e0018b
 1 file changed, 88 insertions(+), 39 deletions(-)
e0018b
e0018b
diff --git a/Documentation/daxctl/daxctl-reconfigure-device.txt b/Documentation/daxctl/daxctl-reconfigure-device.txt
e0018b
index f112b3c..132684c 100644
e0018b
--- a/Documentation/daxctl/daxctl-reconfigure-device.txt
e0018b
+++ b/Documentation/daxctl/daxctl-reconfigure-device.txt
e0018b
@@ -12,6 +12,94 @@ SYNOPSIS
e0018b
 [verse]
e0018b
 'daxctl reconfigure-device' <dax0.0> [<dax1.0>...<daxY.Z>] [<options>]
e0018b
 
e0018b
+DESCRIPTION
e0018b
+-----------
e0018b
+
e0018b
+Reconfigure the operational mode of a dax device. This can be used to convert
e0018b
+a regular 'devdax' mode device to the 'system-ram' mode which arranges for the
e0018b
+dax range to be hot-plugged into the system as regular memory.
e0018b
+
e0018b
+NOTE: This is a destructive operation. Any data on the dax device *will* be
e0018b
+lost.
e0018b
+
e0018b
+NOTE: Device reconfiguration depends on the dax-bus device model. See
e0018b
+linkdaxctl:daxctl-migrate-device-model[1] for more information. If dax-class is
e0018b
+in use (via the dax_pmem_compat driver), the reconfiguration will fail with an
e0018b
+error such as the following:
e0018b
+----
e0018b
+# daxctl reconfigure-device --mode=system-ram --region=0 all
e0018b
+libdaxctl: daxctl_dev_disable: dax3.0: error: device model is dax-class
e0018b
+dax3.0: disable failed: Operation not supported
e0018b
+error reconfiguring devices: Operation not supported
e0018b
+reconfigured 0 devices
e0018b
+----
e0018b
+
e0018b
+'daxctl-reconfigure-device' nominally expects that it will online new memory
e0018b
+blocks as 'movable', so that kernel data doesn't make it into this memory.
e0018b
+However, there are other potential agents that may be configured to
e0018b
+automatically online new hot-plugged memory as it appears. Most notably,
e0018b
+these are the '/sys/devices/system/memory/auto_online_blocks' configuration,
e0018b
+or system udev rules. If such an agent races to online memory sections, daxctl
e0018b
+checks if the blocks were onlined as 'movable' memory. If this was not the
e0018b
+case, and the memory blocks are found to be in a different zone, then a
e0018b
+warning is displayed. If it is desired that a different agent control the
e0018b
+onlining of memory blocks, and the associated memory zone, then it is
e0018b
+recommended to use the --no-online option described below. This will abridge
e0018b
+the device reconfiguration operation to just hotplugging the memory, and
e0018b
+refrain from then onlining it.
e0018b
+
e0018b
+In case daxctl detects that there is a kernel policy to auto-online blocks
e0018b
+(via /sys/devices/system/memory/auto_online_blocks), then reconfiguring to
e0018b
+system-ram will result in a failure. This can be overridden with '--force'.
e0018b
+
e0018b
+
e0018b
+THEORY OF OPERATION
e0018b
+-------------------
e0018b
+The kernel device-dax subsystem surfaces character devices
e0018b
+that provide DAX-access (direct mappings sans page-cache buffering) to a
e0018b
+given memory region. The devices are named /dev/daxX.Y where X is a
e0018b
+region-id and Y is an instance-id within that region. There are 2
e0018b
+mechanisms that trigger device-dax instances to appear:
e0018b
+
e0018b
+1. Persistent Memory (PMEM) namespace configured in "devdax" mode. See
e0018b
+"ndctl create-namspace --help" and
e0018b
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dax/Kconfig[CONFIG_DEV_DAX_PMEM].
e0018b
+In this case the device-dax instance is statically sized to its host
e0018b
+memory region which is bounded to the physical address range of the host
e0018b
+namespace.
e0018b
+
e0018b
+2. Soft Reserved memory enumerated by platform firmware. On EFI systems
e0018b
+this is communicated via the so called EFI_MEMORY_SP "Special Purpose"
e0018b
+attribute. See
e0018b
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dax/Kconfig[CONFIG_DEV_DAX_HMEM].
e0018b
+In this case the device-dax instance(s) associated with the given memory
e0018b
+region can be resized and divided into multiple devices.
e0018b
+
e0018b
+In the Soft Reservation case the expectation for EFI + ACPI based
e0018b
+platforms is that in addition to the EFI_MEMORY_SP attribute the
e0018b
+firmware also creates distinct ACPI proximity domains for any address
e0018b
+range that has different performance characteristics than default
e0018b
+"System RAM". So, the SRAT will define the proximity domain, the SLIT
e0018b
+communicates relative distance to other proximity domains, and the HMAT
e0018b
+is populated with nominal read/write latency and read/write bandwidth
e0018b
+data. That HMAT data is emitted to the kernel log on bootup, and also
e0018b
+exported to sysfs. See
e0018b
+https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html[NUMAPERF],
e0018b
+for the runtime representation of CPU to Memory node performance
e0018b
+details.
e0018b
+
e0018b
+Outside of the NUMA performance details linked above the other method to
e0018b
+detect the presence of "Soft Reserved" memory is to dump /proc/iomem and
e0018b
+look for "Soft Reserved" ranges. If the kernel was not built with
e0018b
+CONFIG_EFI_SOFTRESERVE, predates the introduction of
e0018b
+CONFIG_EFI_SOFTRESERVE (v5.5), or was booted with the efi=nosoftreserve
e0018b
+command line then device-dax will not attach and the expectation is that
e0018b
+the memory shows up as a memory-only NUMA node. Otherwise the memory
e0018b
+shows up as a device-dax instance and DAXCTL(1) can be used to
e0018b
+optionally partition it and assign the memory back to the kernel as
e0018b
+"System RAM", or the device can be mapped directly as the back end of a
e0018b
+userspace memory allocator like https://pmem.io/vmem/libvmem/[LIBVMEM].
e0018b
+
e0018b
 EXAMPLES
e0018b
 --------
e0018b
 
e0018b
@@ -83,45 +171,6 @@ reconfigured 1 device
e0018b
 reconfigured 1 device
e0018b
 ----
e0018b
 
e0018b
-DESCRIPTION
e0018b
------------
e0018b
-
e0018b
-Reconfigure the operational mode of a dax device. This can be used to convert
e0018b
-a regular 'devdax' mode device to the 'system-ram' mode which arranges for the
e0018b
-dax range to be hot-plugged into the system as regular memory.
e0018b
-
e0018b
-NOTE: This is a destructive operation. Any data on the dax device *will* be
e0018b
-lost.
e0018b
-
e0018b
-NOTE: Device reconfiguration depends on the dax-bus device model. See
e0018b
-linkdaxctl:daxctl-migrate-device-model[1] for more information. If dax-class is
e0018b
-in use (via the dax_pmem_compat driver), the reconfiguration will fail with an
e0018b
-error such as the following:
e0018b
-----
e0018b
-# daxctl reconfigure-device --mode=system-ram --region=0 all
e0018b
-libdaxctl: daxctl_dev_disable: dax3.0: error: device model is dax-class
e0018b
-dax3.0: disable failed: Operation not supported
e0018b
-error reconfiguring devices: Operation not supported
e0018b
-reconfigured 0 devices
e0018b
-----
e0018b
-
e0018b
-'daxctl-reconfigure-device' nominally expects that it will online new memory
e0018b
-blocks as 'movable', so that kernel data doesn't make it into this memory.
e0018b
-However, there are other potential agents that may be configured to
e0018b
-automatically online new hot-plugged memory as it appears. Most notably,
e0018b
-these are the '/sys/devices/system/memory/auto_online_blocks' configuration,
e0018b
-or system udev rules. If such an agent races to online memory sections, daxctl
e0018b
-checks if the blocks were onlined as 'movable' memory. If this was not the
e0018b
-case, and the memory blocks are found to be in a different zone, then a
e0018b
-warning is displayed. If it is desired that a different agent control the
e0018b
-onlining of memory blocks, and the associated memory zone, then it is
e0018b
-recommended to use the --no-online option described below. This will abridge
e0018b
-the device reconfiguration operation to just hotplugging the memory, and
e0018b
-refrain from then onlining it.
e0018b
-
e0018b
-In case daxctl detects that there is a kernel policy to auto-online blocks
e0018b
-(via /sys/devices/system/memory/auto_online_blocks), then reconfiguring to
e0018b
-system-ram will result in a failure. This can be overridden with '--force'.
e0018b
 
e0018b
 OPTIONS
e0018b
 -------
e0018b
-- 
e0018b
2.27.0
e0018b