|
|
26ccd9 |
From 8f4e42c0c526e85b045fd0329df7cb904f511c98 Mon Sep 17 00:00:00 2001
|
|
|
26ccd9 |
From: Dan Williams <dan.j.williams@intel.com>
|
|
|
26ccd9 |
Date: Thu, 7 Oct 2021 14:59:53 -0700
|
|
|
26ccd9 |
Subject: [PATCH 049/217] daxctl: Add "Soft Reservation" theory of operation
|
|
|
26ccd9 |
|
|
|
26ccd9 |
As systems are starting to ship memory with the EFI "Special Purpose"
|
|
|
26ccd9 |
attribute that Linux optionally turns into "Soft Reserved" ranges one of
|
|
|
26ccd9 |
the immediate first questions is "where is my special memory, and how do
|
|
|
26ccd9 |
access it". Add some documentation to explain the default behaviour of
|
|
|
26ccd9 |
"Soft Reserved".
|
|
|
26ccd9 |
|
|
|
26ccd9 |
Link: https://lore.kernel.org/r/163364399303.201290.6835215953983673447.stgit@dwillia2-desk3.amr.corp.intel.com
|
|
|
26ccd9 |
Reported-by: John Groves <john@jagalactic.com>
|
|
|
26ccd9 |
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
|
26ccd9 |
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
|
|
|
26ccd9 |
---
|
|
|
26ccd9 |
.../daxctl/daxctl-reconfigure-device.txt | 127 ++++++++++++------
|
|
|
26ccd9 |
1 file changed, 88 insertions(+), 39 deletions(-)
|
|
|
26ccd9 |
|
|
|
26ccd9 |
diff --git a/Documentation/daxctl/daxctl-reconfigure-device.txt b/Documentation/daxctl/daxctl-reconfigure-device.txt
|
|
|
26ccd9 |
index f112b3c..132684c 100644
|
|
|
26ccd9 |
--- a/Documentation/daxctl/daxctl-reconfigure-device.txt
|
|
|
26ccd9 |
+++ b/Documentation/daxctl/daxctl-reconfigure-device.txt
|
|
|
26ccd9 |
@@ -12,6 +12,94 @@ SYNOPSIS
|
|
|
26ccd9 |
[verse]
|
|
|
26ccd9 |
'daxctl reconfigure-device' <dax0.0> [<dax1.0>...<daxY.Z>] [<options>]
|
|
|
26ccd9 |
|
|
|
26ccd9 |
+DESCRIPTION
|
|
|
26ccd9 |
+-----------
|
|
|
26ccd9 |
+
|
|
|
26ccd9 |
+Reconfigure the operational mode of a dax device. This can be used to convert
|
|
|
26ccd9 |
+a regular 'devdax' mode device to the 'system-ram' mode which arranges for the
|
|
|
26ccd9 |
+dax range to be hot-plugged into the system as regular memory.
|
|
|
26ccd9 |
+
|
|
|
26ccd9 |
+NOTE: This is a destructive operation. Any data on the dax device *will* be
|
|
|
26ccd9 |
+lost.
|
|
|
26ccd9 |
+
|
|
|
26ccd9 |
+NOTE: Device reconfiguration depends on the dax-bus device model. See
|
|
|
26ccd9 |
+linkdaxctl:daxctl-migrate-device-model[1] for more information. If dax-class is
|
|
|
26ccd9 |
+in use (via the dax_pmem_compat driver), the reconfiguration will fail with an
|
|
|
26ccd9 |
+error such as the following:
|
|
|
26ccd9 |
+----
|
|
|
26ccd9 |
+# daxctl reconfigure-device --mode=system-ram --region=0 all
|
|
|
26ccd9 |
+libdaxctl: daxctl_dev_disable: dax3.0: error: device model is dax-class
|
|
|
26ccd9 |
+dax3.0: disable failed: Operation not supported
|
|
|
26ccd9 |
+error reconfiguring devices: Operation not supported
|
|
|
26ccd9 |
+reconfigured 0 devices
|
|
|
26ccd9 |
+----
|
|
|
26ccd9 |
+
|
|
|
26ccd9 |
+'daxctl-reconfigure-device' nominally expects that it will online new memory
|
|
|
26ccd9 |
+blocks as 'movable', so that kernel data doesn't make it into this memory.
|
|
|
26ccd9 |
+However, there are other potential agents that may be configured to
|
|
|
26ccd9 |
+automatically online new hot-plugged memory as it appears. Most notably,
|
|
|
26ccd9 |
+these are the '/sys/devices/system/memory/auto_online_blocks' configuration,
|
|
|
26ccd9 |
+or system udev rules. If such an agent races to online memory sections, daxctl
|
|
|
26ccd9 |
+checks if the blocks were onlined as 'movable' memory. If this was not the
|
|
|
26ccd9 |
+case, and the memory blocks are found to be in a different zone, then a
|
|
|
26ccd9 |
+warning is displayed. If it is desired that a different agent control the
|
|
|
26ccd9 |
+onlining of memory blocks, and the associated memory zone, then it is
|
|
|
26ccd9 |
+recommended to use the --no-online option described below. This will abridge
|
|
|
26ccd9 |
+the device reconfiguration operation to just hotplugging the memory, and
|
|
|
26ccd9 |
+refrain from then onlining it.
|
|
|
26ccd9 |
+
|
|
|
26ccd9 |
+In case daxctl detects that there is a kernel policy to auto-online blocks
|
|
|
26ccd9 |
+(via /sys/devices/system/memory/auto_online_blocks), then reconfiguring to
|
|
|
26ccd9 |
+system-ram will result in a failure. This can be overridden with '--force'.
|
|
|
26ccd9 |
+
|
|
|
26ccd9 |
+
|
|
|
26ccd9 |
+THEORY OF OPERATION
|
|
|
26ccd9 |
+-------------------
|
|
|
26ccd9 |
+The kernel device-dax subsystem surfaces character devices
|
|
|
26ccd9 |
+that provide DAX-access (direct mappings sans page-cache buffering) to a
|
|
|
26ccd9 |
+given memory region. The devices are named /dev/daxX.Y where X is a
|
|
|
26ccd9 |
+region-id and Y is an instance-id within that region. There are 2
|
|
|
26ccd9 |
+mechanisms that trigger device-dax instances to appear:
|
|
|
26ccd9 |
+
|
|
|
26ccd9 |
+1. Persistent Memory (PMEM) namespace configured in "devdax" mode. See
|
|
|
26ccd9 |
+"ndctl create-namspace --help" and
|
|
|
26ccd9 |
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dax/Kconfig[CONFIG_DEV_DAX_PMEM].
|
|
|
26ccd9 |
+In this case the device-dax instance is statically sized to its host
|
|
|
26ccd9 |
+memory region which is bounded to the physical address range of the host
|
|
|
26ccd9 |
+namespace.
|
|
|
26ccd9 |
+
|
|
|
26ccd9 |
+2. Soft Reserved memory enumerated by platform firmware. On EFI systems
|
|
|
26ccd9 |
+this is communicated via the so called EFI_MEMORY_SP "Special Purpose"
|
|
|
26ccd9 |
+attribute. See
|
|
|
26ccd9 |
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/dax/Kconfig[CONFIG_DEV_DAX_HMEM].
|
|
|
26ccd9 |
+In this case the device-dax instance(s) associated with the given memory
|
|
|
26ccd9 |
+region can be resized and divided into multiple devices.
|
|
|
26ccd9 |
+
|
|
|
26ccd9 |
+In the Soft Reservation case the expectation for EFI + ACPI based
|
|
|
26ccd9 |
+platforms is that in addition to the EFI_MEMORY_SP attribute the
|
|
|
26ccd9 |
+firmware also creates distinct ACPI proximity domains for any address
|
|
|
26ccd9 |
+range that has different performance characteristics than default
|
|
|
26ccd9 |
+"System RAM". So, the SRAT will define the proximity domain, the SLIT
|
|
|
26ccd9 |
+communicates relative distance to other proximity domains, and the HMAT
|
|
|
26ccd9 |
+is populated with nominal read/write latency and read/write bandwidth
|
|
|
26ccd9 |
+data. That HMAT data is emitted to the kernel log on bootup, and also
|
|
|
26ccd9 |
+exported to sysfs. See
|
|
|
26ccd9 |
+https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html[NUMAPERF],
|
|
|
26ccd9 |
+for the runtime representation of CPU to Memory node performance
|
|
|
26ccd9 |
+details.
|
|
|
26ccd9 |
+
|
|
|
26ccd9 |
+Outside of the NUMA performance details linked above the other method to
|
|
|
26ccd9 |
+detect the presence of "Soft Reserved" memory is to dump /proc/iomem and
|
|
|
26ccd9 |
+look for "Soft Reserved" ranges. If the kernel was not built with
|
|
|
26ccd9 |
+CONFIG_EFI_SOFTRESERVE, predates the introduction of
|
|
|
26ccd9 |
+CONFIG_EFI_SOFTRESERVE (v5.5), or was booted with the efi=nosoftreserve
|
|
|
26ccd9 |
+command line then device-dax will not attach and the expectation is that
|
|
|
26ccd9 |
+the memory shows up as a memory-only NUMA node. Otherwise the memory
|
|
|
26ccd9 |
+shows up as a device-dax instance and DAXCTL(1) can be used to
|
|
|
26ccd9 |
+optionally partition it and assign the memory back to the kernel as
|
|
|
26ccd9 |
+"System RAM", or the device can be mapped directly as the back end of a
|
|
|
26ccd9 |
+userspace memory allocator like https://pmem.io/vmem/libvmem/[LIBVMEM].
|
|
|
26ccd9 |
+
|
|
|
26ccd9 |
EXAMPLES
|
|
|
26ccd9 |
--------
|
|
|
26ccd9 |
|
|
|
26ccd9 |
@@ -83,45 +171,6 @@ reconfigured 1 device
|
|
|
26ccd9 |
reconfigured 1 device
|
|
|
26ccd9 |
----
|
|
|
26ccd9 |
|
|
|
26ccd9 |
-DESCRIPTION
|
|
|
26ccd9 |
------------
|
|
|
26ccd9 |
-
|
|
|
26ccd9 |
-Reconfigure the operational mode of a dax device. This can be used to convert
|
|
|
26ccd9 |
-a regular 'devdax' mode device to the 'system-ram' mode which arranges for the
|
|
|
26ccd9 |
-dax range to be hot-plugged into the system as regular memory.
|
|
|
26ccd9 |
-
|
|
|
26ccd9 |
-NOTE: This is a destructive operation. Any data on the dax device *will* be
|
|
|
26ccd9 |
-lost.
|
|
|
26ccd9 |
-
|
|
|
26ccd9 |
-NOTE: Device reconfiguration depends on the dax-bus device model. See
|
|
|
26ccd9 |
-linkdaxctl:daxctl-migrate-device-model[1] for more information. If dax-class is
|
|
|
26ccd9 |
-in use (via the dax_pmem_compat driver), the reconfiguration will fail with an
|
|
|
26ccd9 |
-error such as the following:
|
|
|
26ccd9 |
-----
|
|
|
26ccd9 |
-# daxctl reconfigure-device --mode=system-ram --region=0 all
|
|
|
26ccd9 |
-libdaxctl: daxctl_dev_disable: dax3.0: error: device model is dax-class
|
|
|
26ccd9 |
-dax3.0: disable failed: Operation not supported
|
|
|
26ccd9 |
-error reconfiguring devices: Operation not supported
|
|
|
26ccd9 |
-reconfigured 0 devices
|
|
|
26ccd9 |
-----
|
|
|
26ccd9 |
-
|
|
|
26ccd9 |
-'daxctl-reconfigure-device' nominally expects that it will online new memory
|
|
|
26ccd9 |
-blocks as 'movable', so that kernel data doesn't make it into this memory.
|
|
|
26ccd9 |
-However, there are other potential agents that may be configured to
|
|
|
26ccd9 |
-automatically online new hot-plugged memory as it appears. Most notably,
|
|
|
26ccd9 |
-these are the '/sys/devices/system/memory/auto_online_blocks' configuration,
|
|
|
26ccd9 |
-or system udev rules. If such an agent races to online memory sections, daxctl
|
|
|
26ccd9 |
-checks if the blocks were onlined as 'movable' memory. If this was not the
|
|
|
26ccd9 |
-case, and the memory blocks are found to be in a different zone, then a
|
|
|
26ccd9 |
-warning is displayed. If it is desired that a different agent control the
|
|
|
26ccd9 |
-onlining of memory blocks, and the associated memory zone, then it is
|
|
|
26ccd9 |
-recommended to use the --no-online option described below. This will abridge
|
|
|
26ccd9 |
-the device reconfiguration operation to just hotplugging the memory, and
|
|
|
26ccd9 |
-refrain from then onlining it.
|
|
|
26ccd9 |
-
|
|
|
26ccd9 |
-In case daxctl detects that there is a kernel policy to auto-online blocks
|
|
|
26ccd9 |
-(via /sys/devices/system/memory/auto_online_blocks), then reconfiguring to
|
|
|
26ccd9 |
-system-ram will result in a failure. This can be overridden with '--force'.
|
|
|
26ccd9 |
|
|
|
26ccd9 |
OPTIONS
|
|
|
26ccd9 |
-------
|
|
|
26ccd9 |
--
|
|
|
26ccd9 |
2.27.0
|
|
|
26ccd9 |
|