Blame SOURCES/kvm-util-mmap-alloc-support-MAP_SYNC-in-qemu_ram_mmap.patch

4ec855
From 4438710f7aa42f55d189d1b6adb09b1c0471495e Mon Sep 17 00:00:00 2001
4ec855
From: "plai@redhat.com" <plai@redhat.com>
4ec855
Date: Tue, 20 Aug 2019 16:12:51 +0100
4ec855
Subject: [PATCH 04/11] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
4ec855
4ec855
RH-Author: plai@redhat.com
4ec855
Message-id: <1566317571-5697-5-git-send-email-plai@redhat.com>
4ec855
Patchwork-id: 90085
4ec855
O-Subject: [RHEL8.2 qemu-kvm PATCH 4/4] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
4ec855
Bugzilla: 1539282
4ec855
RH-Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
4ec855
RH-Acked-by: Pankaj Gupta <pagupta@redhat.com>
4ec855
RH-Acked-by: Eduardo Habkost <ehabkost@redhat.com>
4ec855
4ec855
From: Zhang Yi <yi.z.zhang@linux.intel.com>
4ec855
4ec855
When a file supporting DAX is used as vNVDIMM backend, mmap it with
4ec855
MAP_SYNC flag in addition which can ensure file system metadata
4ec855
synced in each guest writes to the backend file, without other QEMU
4ec855
actions (e.g., periodic fsync() by QEMU).
4ec855
4ec855
Current, We have below different possible use cases:
4ec855
4ec855
1. pmem=on is set, shared=on is set, MAP_SYNC supported:
4ec855
   a: backend is a dax supporting file.
4ec855
    - MAP_SYNC will active.
4ec855
   b: backend is not a dax supporting file.
4ec855
    - mmap will trigger a warning. then MAP_SYNC flag will be ignored
4ec855
4ec855
2. The rest of cases:
4ec855
   - we will never pass the MAP_SYNC to mmap2
4ec855
4ec855
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
4ec855
Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
4ec855
[ehabkost: Rebased patch to latest code on master]
4ec855
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
4ec855
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
4ec855
Tested-by: Wei Yang <richardw.yang@linux.intel.com>
4ec855
Message-Id: <20190422004849.26463-2-richardw.yang@linux.intel.com>
4ec855
[ehabkost: squashed documentation patch]
4ec855
Message-Id: <20190422004849.26463-3-richardw.yang@linux.intel.com>
4ec855
[ehabkost: documentation fixup]
4ec855
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
4ec855
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
4ec855
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
4ec855
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
4ec855
4ec855
(cherry picked from commit 119906afa5ca610adb87c55ab0d8e53c9104bfc3)
4ec855
Signed-off-by: Paul Lai <plai@redhat.com>
4ec855
Signed-off-by: Danilo C. L. de Paula <ddepaula@redhat.com>
4ec855
---
4ec855
 docs/nvdimm.txt   | 22 +++++++++++++++++++---
4ec855
 qemu-options.hx   |  5 +++++
4ec855
 util/mmap-alloc.c | 41 ++++++++++++++++++++++++++++++++++++++++-
4ec855
 3 files changed, 64 insertions(+), 4 deletions(-)
4ec855
4ec855
diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
4ec855
index 5f158a6..33ce9aa 100644
4ec855
--- a/docs/nvdimm.txt
4ec855
+++ b/docs/nvdimm.txt
4ec855
@@ -143,9 +143,25 @@ Guest Data Persistence
4ec855
 ----------------------
4ec855
 
4ec855
 Though QEMU supports multiple types of vNVDIMM backends on Linux,
4ec855
-currently the only one that can guarantee the guest write persistence
4ec855
-is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to
4ec855
-which all guest access do not involve any host-side kernel cache.
4ec855
+the only backend that can guarantee the guest write persistence is:
4ec855
+
4ec855
+A. DAX device (e.g., /dev/dax0.0, ) or
4ec855
+B. DAX file(mounted with dax option)
4ec855
+
4ec855
+When using B (A file supporting direct mapping of persistent memory)
4ec855
+as a backend, write persistence is guaranteed if the host kernel has
4ec855
+support for the MAP_SYNC flag in the mmap system call (available
4ec855
+since Linux 4.15 and on certain distro kernels) and additionally
4ec855
+both 'pmem' and 'share' flags are set to 'on' on the backend.
4ec855
+
4ec855
+If these conditions are not satisfied i.e. if either 'pmem' or 'share'
4ec855
+are not set, if the backend file does not support DAX or if MAP_SYNC
4ec855
+is not supported by the host kernel, write persistence is not
4ec855
+guaranteed after a system crash. For compatibility reasons, these
4ec855
+conditions are ignored if not satisfied. Currently, no way is
4ec855
+provided to test for them.
4ec855
+For more details, please reference mmap(2) man page:
4ec855
+http://man7.org/linux/man-pages/man2/mmap.2.html.
4ec855
 
4ec855
 When using other types of backends, it's suggested to set 'unarmed'
4ec855
 option of '-device nvdimm' to 'on', which sets the unarmed flag of the
4ec855
diff --git a/qemu-options.hx b/qemu-options.hx
4ec855
index 1b6786b..1243057 100644
4ec855
--- a/qemu-options.hx
4ec855
+++ b/qemu-options.hx
4ec855
@@ -4057,6 +4057,11 @@ using the SNIA NVM programming model (e.g. Intel NVDIMM).
4ec855
 If @option{pmem} is set to 'on', QEMU will take necessary operations to
4ec855
 guarantee the persistence of its own writes to @option{mem-path}
4ec855
 (e.g. in vNVDIMM label emulation and live migration).
4ec855
+Also, we will map the backend-file with MAP_SYNC flag, which ensures the
4ec855
+file metadata is in sync for @option{mem-path} in case of host crash
4ec855
+or a power failure. MAP_SYNC requires support from both the host kernel
4ec855
+(since Linux kernel 4.15) and the filesystem of @option{mem-path} mounted
4ec855
+with DAX option.
4ec855
 
4ec855
 @item -object memory-backend-ram,id=@var{id},merge=@var{on|off},dump=@var{on|off},share=@var{on|off},prealloc=@var{on|off},size=@var{size},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave}
4ec855
 
4ec855
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
4ec855
index bbd9077..4873984 100644
4ec855
--- a/util/mmap-alloc.c
4ec855
+++ b/util/mmap-alloc.c
4ec855
@@ -10,6 +10,13 @@
4ec855
  * later.  See the COPYING file in the top-level directory.
4ec855
  */
4ec855
 
4ec855
+#ifdef CONFIG_LINUX
4ec855
+#include <linux/mman.h>
4ec855
+#else  /* !CONFIG_LINUX */
4ec855
+#define MAP_SYNC              0x0
4ec855
+#define MAP_SHARED_VALIDATE   0x0
4ec855
+#endif /* CONFIG_LINUX */
4ec855
+
4ec855
 #include "qemu/osdep.h"
4ec855
 #include "qemu/mmap-alloc.h"
4ec855
 #include "qemu/host-utils.h"
4ec855
@@ -80,6 +87,7 @@ void *qemu_ram_mmap(int fd,
4ec855
                     bool is_pmem)
4ec855
 {
4ec855
     int flags;
4ec855
+    int map_sync_flags = 0;
4ec855
     int guardfd;
4ec855
     size_t offset;
4ec855
     size_t pagesize;
4ec855
@@ -130,9 +138,40 @@ void *qemu_ram_mmap(int fd,
4ec855
     flags = MAP_FIXED;
4ec855
     flags |= fd == -1 ? MAP_ANONYMOUS : 0;
4ec855
     flags |= shared ? MAP_SHARED : MAP_PRIVATE;
4ec855
+    if (shared && is_pmem) {
4ec855
+        map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
4ec855
+    }
4ec855
+
4ec855
     offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
4ec855
 
4ec855
-    ptr = mmap(guardptr + offset, size, PROT_READ | PROT_WRITE, flags, fd, 0);
4ec855
+    ptr = mmap(guardptr + offset, size, PROT_READ | PROT_WRITE,
4ec855
+               flags | map_sync_flags, fd, 0);
4ec855
+
4ec855
+    if (ptr == MAP_FAILED && map_sync_flags) {
4ec855
+        if (errno == ENOTSUP) {
4ec855
+            char *proc_link, *file_name;
4ec855
+            int len;
4ec855
+            proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
4ec855
+            file_name = g_malloc0(PATH_MAX);
4ec855
+            len = readlink(proc_link, file_name, PATH_MAX - 1);
4ec855
+            if (len < 0) {
4ec855
+                len = 0;
4ec855
+            }
4ec855
+            file_name[len] = '\0';
4ec855
+            fprintf(stderr, "Warning: requesting persistence across crashes "
4ec855
+                    "for backend file %s failed. Proceeding without "
4ec855
+                    "persistence, data might become corrupted in case of host "
4ec855
+                    "crash.\n", file_name);
4ec855
+            g_free(proc_link);
4ec855
+            g_free(file_name);
4ec855
+        }
4ec855
+        /*
4ec855
+         * if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
4ec855
+         * we will remove these flags to handle compatibility.
4ec855
+         */
4ec855
+        ptr = mmap(guardptr + offset, size, PROT_READ | PROT_WRITE,
4ec855
+                   flags, fd, 0);
4ec855
+    }
4ec855
 
4ec855
     if (ptr == MAP_FAILED) {
4ec855
         munmap(guardptr, total);
4ec855
-- 
4ec855
1.8.3.1
4ec855