Pablo Greco e6a3ae
From 4438710f7aa42f55d189d1b6adb09b1c0471495e Mon Sep 17 00:00:00 2001
Pablo Greco e6a3ae
From: "plai@redhat.com" <plai@redhat.com>
Pablo Greco e6a3ae
Date: Tue, 20 Aug 2019 16:12:51 +0100
Pablo Greco e6a3ae
Subject: [PATCH 04/11] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
Pablo Greco e6a3ae
Pablo Greco e6a3ae
RH-Author: plai@redhat.com
Pablo Greco e6a3ae
Message-id: <1566317571-5697-5-git-send-email-plai@redhat.com>
Pablo Greco e6a3ae
Patchwork-id: 90085
Pablo Greco e6a3ae
O-Subject: [RHEL8.2 qemu-kvm PATCH 4/4] util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
Pablo Greco e6a3ae
Bugzilla: 1539282
Pablo Greco e6a3ae
RH-Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Pablo Greco e6a3ae
RH-Acked-by: Pankaj Gupta <pagupta@redhat.com>
Pablo Greco e6a3ae
RH-Acked-by: Eduardo Habkost <ehabkost@redhat.com>
Pablo Greco e6a3ae
Pablo Greco e6a3ae
From: Zhang Yi <yi.z.zhang@linux.intel.com>
Pablo Greco e6a3ae
Pablo Greco e6a3ae
When a file supporting DAX is used as vNVDIMM backend, mmap it with
Pablo Greco e6a3ae
MAP_SYNC flag in addition which can ensure file system metadata
Pablo Greco e6a3ae
synced in each guest writes to the backend file, without other QEMU
Pablo Greco e6a3ae
actions (e.g., periodic fsync() by QEMU).
Pablo Greco e6a3ae
Pablo Greco e6a3ae
Current, We have below different possible use cases:
Pablo Greco e6a3ae
Pablo Greco e6a3ae
1. pmem=on is set, shared=on is set, MAP_SYNC supported:
Pablo Greco e6a3ae
   a: backend is a dax supporting file.
Pablo Greco e6a3ae
    - MAP_SYNC will active.
Pablo Greco e6a3ae
   b: backend is not a dax supporting file.
Pablo Greco e6a3ae
    - mmap will trigger a warning. then MAP_SYNC flag will be ignored
Pablo Greco e6a3ae
Pablo Greco e6a3ae
2. The rest of cases:
Pablo Greco e6a3ae
   - we will never pass the MAP_SYNC to mmap2
Pablo Greco e6a3ae
Pablo Greco e6a3ae
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Pablo Greco e6a3ae
Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com>
Pablo Greco e6a3ae
[ehabkost: Rebased patch to latest code on master]
Pablo Greco e6a3ae
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Pablo Greco e6a3ae
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Pablo Greco e6a3ae
Tested-by: Wei Yang <richardw.yang@linux.intel.com>
Pablo Greco e6a3ae
Message-Id: <20190422004849.26463-2-richardw.yang@linux.intel.com>
Pablo Greco e6a3ae
[ehabkost: squashed documentation patch]
Pablo Greco e6a3ae
Message-Id: <20190422004849.26463-3-richardw.yang@linux.intel.com>
Pablo Greco e6a3ae
[ehabkost: documentation fixup]
Pablo Greco e6a3ae
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Pablo Greco e6a3ae
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Pablo Greco e6a3ae
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Pablo Greco e6a3ae
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Pablo Greco e6a3ae
Pablo Greco e6a3ae
(cherry picked from commit 119906afa5ca610adb87c55ab0d8e53c9104bfc3)
Pablo Greco e6a3ae
Signed-off-by: Paul Lai <plai@redhat.com>
Pablo Greco e6a3ae
Signed-off-by: Danilo C. L. de Paula <ddepaula@redhat.com>
Pablo Greco e6a3ae
---
Pablo Greco e6a3ae
 docs/nvdimm.txt   | 22 +++++++++++++++++++---
Pablo Greco e6a3ae
 qemu-options.hx   |  5 +++++
Pablo Greco e6a3ae
 util/mmap-alloc.c | 41 ++++++++++++++++++++++++++++++++++++++++-
Pablo Greco e6a3ae
 3 files changed, 64 insertions(+), 4 deletions(-)
Pablo Greco e6a3ae
Pablo Greco e6a3ae
diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
Pablo Greco e6a3ae
index 5f158a6..33ce9aa 100644
Pablo Greco e6a3ae
--- a/docs/nvdimm.txt
Pablo Greco e6a3ae
+++ b/docs/nvdimm.txt
Pablo Greco e6a3ae
@@ -143,9 +143,25 @@ Guest Data Persistence
Pablo Greco e6a3ae
 ----------------------
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
 Though QEMU supports multiple types of vNVDIMM backends on Linux,
Pablo Greco e6a3ae
-currently the only one that can guarantee the guest write persistence
Pablo Greco e6a3ae
-is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to
Pablo Greco e6a3ae
-which all guest access do not involve any host-side kernel cache.
Pablo Greco e6a3ae
+the only backend that can guarantee the guest write persistence is:
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+A. DAX device (e.g., /dev/dax0.0, ) or
Pablo Greco e6a3ae
+B. DAX file(mounted with dax option)
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+When using B (A file supporting direct mapping of persistent memory)
Pablo Greco e6a3ae
+as a backend, write persistence is guaranteed if the host kernel has
Pablo Greco e6a3ae
+support for the MAP_SYNC flag in the mmap system call (available
Pablo Greco e6a3ae
+since Linux 4.15 and on certain distro kernels) and additionally
Pablo Greco e6a3ae
+both 'pmem' and 'share' flags are set to 'on' on the backend.
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+If these conditions are not satisfied i.e. if either 'pmem' or 'share'
Pablo Greco e6a3ae
+are not set, if the backend file does not support DAX or if MAP_SYNC
Pablo Greco e6a3ae
+is not supported by the host kernel, write persistence is not
Pablo Greco e6a3ae
+guaranteed after a system crash. For compatibility reasons, these
Pablo Greco e6a3ae
+conditions are ignored if not satisfied. Currently, no way is
Pablo Greco e6a3ae
+provided to test for them.
Pablo Greco e6a3ae
+For more details, please reference mmap(2) man page:
Pablo Greco e6a3ae
+http://man7.org/linux/man-pages/man2/mmap.2.html.
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
 When using other types of backends, it's suggested to set 'unarmed'
Pablo Greco e6a3ae
 option of '-device nvdimm' to 'on', which sets the unarmed flag of the
Pablo Greco e6a3ae
diff --git a/qemu-options.hx b/qemu-options.hx
Pablo Greco e6a3ae
index 1b6786b..1243057 100644
Pablo Greco e6a3ae
--- a/qemu-options.hx
Pablo Greco e6a3ae
+++ b/qemu-options.hx
Pablo Greco e6a3ae
@@ -4057,6 +4057,11 @@ using the SNIA NVM programming model (e.g. Intel NVDIMM).
Pablo Greco e6a3ae
 If @option{pmem} is set to 'on', QEMU will take necessary operations to
Pablo Greco e6a3ae
 guarantee the persistence of its own writes to @option{mem-path}
Pablo Greco e6a3ae
 (e.g. in vNVDIMM label emulation and live migration).
Pablo Greco e6a3ae
+Also, we will map the backend-file with MAP_SYNC flag, which ensures the
Pablo Greco e6a3ae
+file metadata is in sync for @option{mem-path} in case of host crash
Pablo Greco e6a3ae
+or a power failure. MAP_SYNC requires support from both the host kernel
Pablo Greco e6a3ae
+(since Linux kernel 4.15) and the filesystem of @option{mem-path} mounted
Pablo Greco e6a3ae
+with DAX option.
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
 @item -object memory-backend-ram,id=@var{id},merge=@var{on|off},dump=@var{on|off},share=@var{on|off},prealloc=@var{on|off},size=@var{size},host-nodes=@var{host-nodes},policy=@var{default|preferred|bind|interleave}
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
Pablo Greco e6a3ae
index bbd9077..4873984 100644
Pablo Greco e6a3ae
--- a/util/mmap-alloc.c
Pablo Greco e6a3ae
+++ b/util/mmap-alloc.c
Pablo Greco e6a3ae
@@ -10,6 +10,13 @@
Pablo Greco e6a3ae
  * later.  See the COPYING file in the top-level directory.
Pablo Greco e6a3ae
  */
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
+#ifdef CONFIG_LINUX
Pablo Greco e6a3ae
+#include <linux mman.h="">
Pablo Greco e6a3ae
+#else  /* !CONFIG_LINUX */
Pablo Greco e6a3ae
+#define MAP_SYNC              0x0
Pablo Greco e6a3ae
+#define MAP_SHARED_VALIDATE   0x0
Pablo Greco e6a3ae
+#endif /* CONFIG_LINUX */
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
 #include "qemu/osdep.h"
Pablo Greco e6a3ae
 #include "qemu/mmap-alloc.h"
Pablo Greco e6a3ae
 #include "qemu/host-utils.h"
Pablo Greco e6a3ae
@@ -80,6 +87,7 @@ void *qemu_ram_mmap(int fd,
Pablo Greco e6a3ae
                     bool is_pmem)
Pablo Greco e6a3ae
 {
Pablo Greco e6a3ae
     int flags;
Pablo Greco e6a3ae
+    int map_sync_flags = 0;
Pablo Greco e6a3ae
     int guardfd;
Pablo Greco e6a3ae
     size_t offset;
Pablo Greco e6a3ae
     size_t pagesize;
Pablo Greco e6a3ae
@@ -130,9 +138,40 @@ void *qemu_ram_mmap(int fd,
Pablo Greco e6a3ae
     flags = MAP_FIXED;
Pablo Greco e6a3ae
     flags |= fd == -1 ? MAP_ANONYMOUS : 0;
Pablo Greco e6a3ae
     flags |= shared ? MAP_SHARED : MAP_PRIVATE;
Pablo Greco e6a3ae
+    if (shared && is_pmem) {
Pablo Greco e6a3ae
+        map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
     offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
-    ptr = mmap(guardptr + offset, size, PROT_READ | PROT_WRITE, flags, fd, 0);
Pablo Greco e6a3ae
+    ptr = mmap(guardptr + offset, size, PROT_READ | PROT_WRITE,
Pablo Greco e6a3ae
+               flags | map_sync_flags, fd, 0);
Pablo Greco e6a3ae
+
Pablo Greco e6a3ae
+    if (ptr == MAP_FAILED && map_sync_flags) {
Pablo Greco e6a3ae
+        if (errno == ENOTSUP) {
Pablo Greco e6a3ae
+            char *proc_link, *file_name;
Pablo Greco e6a3ae
+            int len;
Pablo Greco e6a3ae
+            proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
Pablo Greco e6a3ae
+            file_name = g_malloc0(PATH_MAX);
Pablo Greco e6a3ae
+            len = readlink(proc_link, file_name, PATH_MAX - 1);
Pablo Greco e6a3ae
+            if (len < 0) {
Pablo Greco e6a3ae
+                len = 0;
Pablo Greco e6a3ae
+            }
Pablo Greco e6a3ae
+            file_name[len] = '\0';
Pablo Greco e6a3ae
+            fprintf(stderr, "Warning: requesting persistence across crashes "
Pablo Greco e6a3ae
+                    "for backend file %s failed. Proceeding without "
Pablo Greco e6a3ae
+                    "persistence, data might become corrupted in case of host "
Pablo Greco e6a3ae
+                    "crash.\n", file_name);
Pablo Greco e6a3ae
+            g_free(proc_link);
Pablo Greco e6a3ae
+            g_free(file_name);
Pablo Greco e6a3ae
+        }
Pablo Greco e6a3ae
+        /*
Pablo Greco e6a3ae
+         * if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
Pablo Greco e6a3ae
+         * we will remove these flags to handle compatibility.
Pablo Greco e6a3ae
+         */
Pablo Greco e6a3ae
+        ptr = mmap(guardptr + offset, size, PROT_READ | PROT_WRITE,
Pablo Greco e6a3ae
+                   flags, fd, 0);
Pablo Greco e6a3ae
+    }
Pablo Greco e6a3ae
 
Pablo Greco e6a3ae
     if (ptr == MAP_FAILED) {
Pablo Greco e6a3ae
         munmap(guardptr, total);
Pablo Greco e6a3ae
-- 
Pablo Greco e6a3ae
1.8.3.1
Pablo Greco e6a3ae