Blame SOURCES/0007-copy-Use-preferred-block-size-for-copying.patch

5893ea
From c8626acc63c4ae1c6cf5d1505e0209ac10f44e81 Mon Sep 17 00:00:00 2001
5893ea
From: "Richard W.M. Jones" <rjones@redhat.com>
5893ea
Date: Tue, 28 Jun 2022 21:58:55 +0100
5893ea
Subject: [PATCH] copy: Use preferred block size for copying
5893ea
5893ea
You're not supposed to read or write NBD servers at a granularity less
5893ea
than the advertised minimum block size.  nbdcopy has ignored this
5893ea
requirement, and this is usually fine because the NBD servers we care
5893ea
about support 512-byte sector granularity, and never advertise sizes /
5893ea
extents less granular than sectors (even if it's a bit suboptimal in a
5893ea
few cases).
5893ea
5893ea
However there is one new case where we do care: When writing to a
5893ea
compressed qcow2 file, qemu advertises a minimum and preferred block
5893ea
size of 64K, and it really means it.  You cannot write blocks smaller
5893ea
than this because of the way qcow2 compression is implemented.
5893ea
5893ea
This commit attempts to do the least work possible to fix this.
5893ea
5893ea
The previous multi-thread-copying loop was driven by the extent map
5893ea
received from the source.  I have modified the loop so that it
5893ea
iterates over request_size blocks.  request_size is set from the
5893ea
command line (--request-size) but will be adjusted upwards if either
5893ea
the source or destination preferred block size is larger.  So this
5893ea
will always copy blocks which are at least the preferred block size
5893ea
(except for the very last block of the disk).
5893ea
5893ea
While copying these blocks we consult the source extent map.  If it
5893ea
contains only zero regions covering the whole block (only_zeroes
5893ea
function) then we can skip straight to zeroing the target
5893ea
(fill_dst_range_with_zeroes), else we do read + write as before.
5893ea
5893ea
I only modified the multi-thread-copying loop, not the synchronous
5893ea
loop.  That should be updated in the same way later.
5893ea
5893ea
One side effect of this change is it always makes larger requests,
5893ea
even for regions we know are sparse.  This is clear in the
5893ea
copy-sparse.sh and copy-sparse-allocated.sh tests which were
5893ea
previously driven by the 32K sparse map granularity of the source.
5893ea
Without changing these tests, they would make make 256K reads & writes
5893ea
(and also read from areas of the disk even though we know they are
5893ea
sparse).  I adjusted these tests to use --request-size=32768 to force
5893ea
the existing behaviour.
5893ea
5893ea
Note this doesn't attempt to limit the maximum block size when reading
5893ea
or writing.  That is for future work.
5893ea
5893ea
This is a partial fix for https://bugzilla.redhat.com/2047660.
5893ea
Further changes will be required in virt-v2v.
5893ea
5893ea
Link: https://lists.gnu.org/archive/html/qemu-block/2022-01/threads.html#00729
5893ea
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2047660
5893ea
(cherry picked from commit 4058fe1ff03fb41156b67302ba1006b9d06b0218)
5893ea
---
5893ea
 TODO                                  |   4 +-
5893ea
 copy/Makefile.am                      |   6 +-
5893ea
 copy/copy-file-to-qcow2-compressed.sh |  64 +++++++++++
5893ea
 copy/copy-sparse-allocated.sh         |   4 +-
5893ea
 copy/copy-sparse.sh                   |   7 +-
5893ea
 copy/main.c                           |  13 +++
5893ea
 copy/multi-thread-copying.c           | 149 +++++++++++++++++++-------
5893ea
 copy/nbdcopy.pod                      |   5 +-
5893ea
 8 files changed, 202 insertions(+), 50 deletions(-)
5893ea
 create mode 100755 copy/copy-file-to-qcow2-compressed.sh
5893ea
5893ea
diff --git a/TODO b/TODO
5893ea
index 7c9c15e..bc38d70 100644
5893ea
--- a/TODO
5893ea
+++ b/TODO
5893ea
@@ -28,7 +28,9 @@ Performance: Chart it over various buffer sizes and threads, as that
5893ea
 Examine other fuzzers: https://gitlab.com/akihe/radamsa
5893ea
 
5893ea
 nbdcopy:
5893ea
- - Minimum/preferred/maximum block size.
5893ea
+ - Enforce maximum block size.
5893ea
+ - Synchronous loop should be adjusted to take into account
5893ea
+   the NBD preferred block size, as was done for multi-thread loop.
5893ea
  - Benchmark.
5893ea
  - Better page cache usage, see nbdkit-file-plugin options
5893ea
    fadvise=sequential cache=none.
5893ea
diff --git a/copy/Makefile.am b/copy/Makefile.am
5893ea
index e729f86..25f75c5 100644
5893ea
--- a/copy/Makefile.am
5893ea
+++ b/copy/Makefile.am
5893ea
@@ -23,6 +23,7 @@ EXTRA_DIST = \
5893ea
 	copy-file-to-nbd.sh \
5893ea
 	copy-file-to-null.sh \
5893ea
 	copy-file-to-qcow2.sh \
5893ea
+	copy-file-to-qcow2-compressed.sh \
5893ea
 	copy-nbd-to-block.sh \
5893ea
 	copy-nbd-to-file.sh \
5893ea
 	copy-nbd-to-hexdump.sh \
5893ea
@@ -142,7 +143,10 @@ TESTS += \
5893ea
 	$(NULL)
5893ea
 
5893ea
 if HAVE_QEMU_NBD
5893ea
-TESTS += copy-file-to-qcow2.sh
5893ea
+TESTS += \
5893ea
+	copy-file-to-qcow2.sh \
5893ea
+	copy-file-to-qcow2-compressed.sh \
5893ea
+	$(NULL)
5893ea
 endif
5893ea
 
5893ea
 if HAVE_GNUTLS
5893ea
diff --git a/copy/copy-file-to-qcow2-compressed.sh b/copy/copy-file-to-qcow2-compressed.sh
5893ea
new file mode 100755
5893ea
index 0000000..dfe4fa5
5893ea
--- /dev/null
5893ea
+++ b/copy/copy-file-to-qcow2-compressed.sh
5893ea
@@ -0,0 +1,64 @@
5893ea
+#!/usr/bin/env bash
5893ea
+# nbd client library in userspace
5893ea
+# Copyright (C) 2020-2022 Red Hat Inc.
5893ea
+#
5893ea
+# This library is free software; you can redistribute it and/or
5893ea
+# modify it under the terms of the GNU Lesser General Public
5893ea
+# License as published by the Free Software Foundation; either
5893ea
+# version 2 of the License, or (at your option) any later version.
5893ea
+#
5893ea
+# This library is distributed in the hope that it will be useful,
5893ea
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
5893ea
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
5893ea
+# Lesser General Public License for more details.
5893ea
+#
5893ea
+# You should have received a copy of the GNU Lesser General Public
5893ea
+# License along with this library; if not, write to the Free Software
5893ea
+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
5893ea
+
5893ea
+. ../tests/functions.sh
5893ea
+
5893ea
+set -e
5893ea
+set -x
5893ea
+
5893ea
+requires $QEMU_NBD --version
5893ea
+requires nbdkit --exit-with-parent --version
5893ea
+requires nbdkit sparse-random --dump-plugin
5893ea
+requires qemu-img --version
5893ea
+requires stat --version
5893ea
+
5893ea
+file1=copy-file-to-qcow2-compressed.file1
5893ea
+file2=copy-file-to-qcow2-compressed.file2
5893ea
+rm -f $file1 $file2
5893ea
+cleanup_fn rm -f $file1 $file2
5893ea
+
5893ea
+size=1G
5893ea
+seed=$RANDOM
5893ea
+
5893ea
+# Create a compressed qcow2 file1.
5893ea
+#
5893ea
+# sparse-random files should compress easily because by default each
5893ea
+# block uses repeated bytes.
5893ea
+qemu-img create -f qcow2 $file1 $size
5893ea
+nbdcopy -- [ nbdkit --exit-with-parent sparse-random $size seed=$seed ] \
5893ea
+        [ $QEMU_NBD --image-opts driver=compress,file.driver=qcow2,file.file.driver=file,file.file.filename=$file1 ]
5893ea
+
5893ea
+ls -l $file1
5893ea
+
5893ea
+# Create an uncompressed qcow2 file2 with the same data.
5893ea
+qemu-img create -f qcow2 $file2 $size
5893ea
+nbdcopy -- [ nbdkit --exit-with-parent sparse-random $size seed=$seed ] \
5893ea
+        [ $QEMU_NBD --image-opts driver=qcow2,file.driver=file,file.filename=$file2 ]
5893ea
+
5893ea
+ls -l $file2
5893ea
+
5893ea
+# file1 < file2 (shows the compression is having some effect).
5893ea
+size1="$( stat -c %s $file1 )"
5893ea
+size2="$( stat -c %s $file2 )"
5893ea
+if [ $size1 -ge $size2 ]; then
5893ea
+    echo "$0: qcow2 compression did not make the file smaller"
5893ea
+    exit 1
5893ea
+fi
5893ea
+
5893ea
+# Logical content of the files should be identical.
5893ea
+qemu-img compare -f qcow2 $file1 -F qcow2 $file2
5893ea
diff --git a/copy/copy-sparse-allocated.sh b/copy/copy-sparse-allocated.sh
5893ea
index 203c3b9..465e347 100755
5893ea
--- a/copy/copy-sparse-allocated.sh
5893ea
+++ b/copy/copy-sparse-allocated.sh
5893ea
@@ -17,8 +17,6 @@
5893ea
 # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
5893ea
 
5893ea
 # Adapted from copy-sparse.sh.
5893ea
-#
5893ea
-# This test depends on the nbdkit default sparse block size (32K).
5893ea
 
5893ea
 . ../tests/functions.sh
5893ea
 
5893ea
@@ -33,7 +31,7 @@ requires nbdkit eval --version
5893ea
 out=copy-sparse-allocated.out
5893ea
 cleanup_fn rm -f $out
5893ea
 
5893ea
-$VG nbdcopy --allocated -- \
5893ea
+$VG nbdcopy --allocated --request-size=32768 -- \
5893ea
     [ nbdkit --exit-with-parent data data='
5893ea
              1
5893ea
              @1073741823 1
5893ea
diff --git a/copy/copy-sparse.sh b/copy/copy-sparse.sh
5893ea
index 1a6da86..7912a21 100755
5893ea
--- a/copy/copy-sparse.sh
5893ea
+++ b/copy/copy-sparse.sh
5893ea
@@ -16,8 +16,6 @@
5893ea
 # License along with this library; if not, write to the Free Software
5893ea
 # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
5893ea
 
5893ea
-# This test depends on the nbdkit default sparse block size (32K).
5893ea
-
5893ea
 . ../tests/functions.sh
5893ea
 
5893ea
 set -e
5893ea
@@ -34,8 +32,9 @@ cleanup_fn rm -f $out
5893ea
 # Copy from a sparse data disk to an nbdkit-eval-plugin instance which
5893ea
 # is logging everything.  This allows us to see exactly what nbdcopy
5893ea
 # is writing, to ensure it is writing and zeroing the target as
5893ea
-# expected.
5893ea
-$VG nbdcopy -S 0 -- \
5893ea
+# expected.  Force request size to match nbdkit default sparse
5893ea
+# allocator block size (32K).
5893ea
+$VG nbdcopy -S 0 --request-size=32768 -- \
5893ea
     [ nbdkit --exit-with-parent data data='
5893ea
              1
5893ea
              @1073741823 1
5893ea
diff --git a/copy/main.c b/copy/main.c
5893ea
index 19ec384..0e27db8 100644
5893ea
--- a/copy/main.c
5893ea
+++ b/copy/main.c
5893ea
@@ -40,6 +40,7 @@
5893ea
 
5893ea
 #include "ispowerof2.h"
5893ea
 #include "human-size.h"
5893ea
+#include "minmax.h"
5893ea
 #include "version.h"
5893ea
 #include "nbdcopy.h"
5893ea
 
5893ea
@@ -379,10 +380,22 @@ main (int argc, char *argv[])
5893ea
   if (threads < connections)
5893ea
     connections = threads;
5893ea
 
5893ea
+  /* request_size must always be at least as large as the preferred
5893ea
+   * size of source & destination.
5893ea
+   */
5893ea
+  request_size = MAX (request_size, src->preferred);
5893ea
+  request_size = MAX (request_size, dst->preferred);
5893ea
+
5893ea
   /* Adapt queue to size to request size if needed. */
5893ea
   if (request_size > queue_size)
5893ea
     queue_size = request_size;
5893ea
 
5893ea
+  /* Sparse size (if using) must not be smaller than the destination
5893ea
+   * preferred size, otherwise we end up creating too small requests.
5893ea
+   */
5893ea
+  if (sparse_size > 0 && sparse_size < dst->preferred)
5893ea
+    sparse_size = dst->preferred;
5893ea
+
5893ea
   /* Truncate the destination to the same size as the source.  Only
5893ea
    * has an effect on regular files.
5893ea
    */
5893ea
diff --git a/copy/multi-thread-copying.c b/copy/multi-thread-copying.c
5893ea
index 06cdb8e..9267545 100644
5893ea
--- a/copy/multi-thread-copying.c
5893ea
+++ b/copy/multi-thread-copying.c
5893ea
@@ -166,6 +166,62 @@ decrease_queue_size (struct worker *worker, size_t len)
5893ea
   worker->queue_size -= len;
5893ea
 }
5893ea
 
5893ea
+/* Using the extents map 'exts', check if the region
5893ea
+ * [offset..offset+len-1] intersects only with zero extents.
5893ea
+ *
5893ea
+ * The invariant for '*i' is always an extent which starts before or
5893ea
+ * equal to the current offset.
5893ea
+ */
5893ea
+static bool
5893ea
+only_zeroes (const extent_list exts, size_t *i,
5893ea
+             uint64_t offset, unsigned len)
5893ea
+{
5893ea
+  size_t j;
5893ea
+
5893ea
+  /* Invariant. */
5893ea
+  assert (*i < exts.len);
5893ea
+  assert (exts.ptr[*i].offset <= offset);
5893ea
+
5893ea
+  /* Update the invariant.  Search for the last possible extent in the
5893ea
+   * list which is <= offset.
5893ea
+   */
5893ea
+  for (j = *i + 1; j < exts.len; ++j) {
5893ea
+    if (exts.ptr[j].offset <= offset)
5893ea
+      *i = j;
5893ea
+    else
5893ea
+      break;
5893ea
+  }
5893ea
+
5893ea
+  /* Check invariant again. */
5893ea
+  assert (*i < exts.len);
5893ea
+  assert (exts.ptr[*i].offset <= offset);
5893ea
+
5893ea
+  /* If *i is not the last extent, then the next extent starts
5893ea
+   * strictly beyond our current offset.
5893ea
+   */
5893ea
+  assert (*i == exts.len - 1 || exts.ptr[*i + 1].offset > offset);
5893ea
+
5893ea
+  /* Search forward, look for any non-zero extents overlapping the region. */
5893ea
+  for (j = *i; j < exts.len; ++j) {
5893ea
+    uint64_t start, end;
5893ea
+
5893ea
+    /* [start..end-1] is the current extent. */
5893ea
+    start = exts.ptr[j].offset;
5893ea
+    end = exts.ptr[j].offset + exts.ptr[j].length;
5893ea
+
5893ea
+    assert (end > offset);
5893ea
+
5893ea
+    if (start >= offset + len)
5893ea
+      break;
5893ea
+
5893ea
+    /* Non-zero extent covering this region => test failed. */
5893ea
+    if (!exts.ptr[j].zero)
5893ea
+      return false;
5893ea
+  }
5893ea
+
5893ea
+  return true;
5893ea
+}
5893ea
+
5893ea
 /* There are 'threads' worker threads, each copying work ranges from
5893ea
  * src to dst until there are no more work ranges.
5893ea
  */
5893ea
@@ -177,7 +233,10 @@ worker_thread (void *wp)
5893ea
   extent_list exts = empty_vector;
5893ea
 
5893ea
   while (get_next_offset (&offset, &count)) {
5893ea
-    size_t i;
5893ea
+    struct command *command;
5893ea
+    size_t extent_index;
5893ea
+    bool is_zeroing = false;
5893ea
+    uint64_t zeroing_start = 0; /* initialized to avoid bogus GCC warning */
5893ea
 
5893ea
     assert (0 < count && count <= THREAD_WORK_SIZE);
5893ea
     if (extents)
5893ea
@@ -185,52 +244,64 @@ worker_thread (void *wp)
5893ea
     else
5893ea
       default_get_extents (src, w->index, offset, count, &exts);
5893ea
 
5893ea
-    for (i = 0; i < exts.len; ++i) {
5893ea
-      struct command *command;
5893ea
-      size_t len;
5893ea
+    extent_index = 0; // index into extents array used to optimize only_zeroes
5893ea
+    while (count) {
5893ea
+      const size_t len = MIN (count, request_size);
5893ea
 
5893ea
-      if (exts.ptr[i].zero) {
5893ea
+      if (only_zeroes (exts, &extent_index, offset, len)) {
5893ea
         /* The source is zero so we can proceed directly to skipping,
5893ea
-         * fast zeroing, or writing zeroes at the destination.
5893ea
+         * fast zeroing, or writing zeroes at the destination.  Defer
5893ea
+         * zeroing so we can send it as a single large command.
5893ea
          */
5893ea
-        command = create_command (exts.ptr[i].offset, exts.ptr[i].length,
5893ea
-                                  true, w);
5893ea
-        fill_dst_range_with_zeroes (command);
5893ea
+        if (!is_zeroing) {
5893ea
+          is_zeroing = true;
5893ea
+          zeroing_start = offset;
5893ea
+        }
5893ea
       }
5893ea
-
5893ea
       else /* data */ {
5893ea
-        /* As the extent might be larger than permitted for a single
5893ea
-         * command, we may have to split this into multiple read
5893ea
-         * requests.
5893ea
-         */
5893ea
-        while (exts.ptr[i].length > 0) {
5893ea
-          len = exts.ptr[i].length;
5893ea
-          if (len > request_size)
5893ea
-            len = request_size;
5893ea
-
5893ea
-          command = create_command (exts.ptr[i].offset, len,
5893ea
-                                    false, w);
5893ea
-
5893ea
-          wait_for_request_slots (w);
5893ea
-
5893ea
-          /* NOTE: Must increase the queue size after waiting. */
5893ea
-          increase_queue_size (w, len);
5893ea
-
5893ea
-          /* Begin the asynch read operation. */
5893ea
-          src->ops->asynch_read (src, command,
5893ea
-                                 (nbd_completion_callback) {
5893ea
-                                   .callback = finished_read,
5893ea
-                                   .user_data = command,
5893ea
-                                 });
5893ea
-
5893ea
-          exts.ptr[i].offset += len;
5893ea
-          exts.ptr[i].length -= len;
5893ea
+        /* If we were in the middle of deferred zeroing, do it now. */
5893ea
+        if (is_zeroing) {
5893ea
+          /* Note that offset-zeroing_start can never exceed
5893ea
+           * THREAD_WORK_SIZE, so there is no danger of overflowing
5893ea
+           * size_t.
5893ea
+           */
5893ea
+          command = create_command (zeroing_start, offset-zeroing_start,
5893ea
+                                    true, w);
5893ea
+          fill_dst_range_with_zeroes (command);
5893ea
+          is_zeroing = false;
5893ea
         }
5893ea
+
5893ea
+        /* Issue the asynchronous read command. */
5893ea
+        command = create_command (offset, len, false, w);
5893ea
+
5893ea
+        wait_for_request_slots (w);
5893ea
+
5893ea
+        /* NOTE: Must increase the queue size after waiting. */
5893ea
+        increase_queue_size (w, len);
5893ea
+
5893ea
+        /* Begin the asynch read operation. */
5893ea
+        src->ops->asynch_read (src, command,
5893ea
+                               (nbd_completion_callback) {
5893ea
+                                 .callback = finished_read,
5893ea
+                                 .user_data = command,
5893ea
+                               });
5893ea
       }
5893ea
 
5893ea
-      offset += count;
5893ea
-      count = 0;
5893ea
-    } /* for extents */
5893ea
+      offset += len;
5893ea
+      count -= len;
5893ea
+    } /* while (count) */
5893ea
+
5893ea
+    /* If we were in the middle of deferred zeroing, do it now. */
5893ea
+    if (is_zeroing) {
5893ea
+      /* Note that offset-zeroing_start can never exceed
5893ea
+       * THREAD_WORK_SIZE, so there is no danger of overflowing
5893ea
+       * size_t.
5893ea
+       */
5893ea
+      command = create_command (zeroing_start, offset - zeroing_start,
5893ea
+                                true, w);
5893ea
+      fill_dst_range_with_zeroes (command);
5893ea
+      is_zeroing = false;
5893ea
+    }
5893ea
   }
5893ea
 
5893ea
   /* Wait for in flight NBD requests to finish. */
5893ea
diff --git a/copy/nbdcopy.pod b/copy/nbdcopy.pod
5893ea
index fd10f7c..f06d112 100644
5893ea
--- a/copy/nbdcopy.pod
5893ea
+++ b/copy/nbdcopy.pod
5893ea
@@ -182,8 +182,9 @@ Set the maximum number of requests in flight per NBD connection.
5893ea
 =item B<--sparse=>N
5893ea
 
5893ea
 Detect all zero blocks of size N (bytes) and make them sparse on the
5893ea
-output.  You can also turn off sparse detection using S<I<-S 0>>.
5893ea
-The default is 4096 bytes.
5893ea
+output.  You can also turn off sparse detection using S<I<-S 0>>.  The
5893ea
+default is 4096 bytes, or the destination preferred block size,
5893ea
+whichever is larger.
5893ea
 
5893ea
 =item B<--synchronous>
5893ea
 
5893ea
-- 
5893ea
2.31.1
5893ea